HPC Computing Systems Professional with Los Alamos National Laboratory
Where: Los Alamos, New Mexico
To apply: email Jeremy Hughes at [email protected]
For consideration, applicants should submit a cover letter addressing how their knowledge, skills and abilities meet the minimum requirements along with a resume.
What You Will Do:
The High Performance Computing (HPC) Division at Los Alamos National Laboratory provides scientific computing resources consisting of some of the largest HPC systems in the world, including a large (19K+ node) Cray system called Trinity, as well as numerous large commodity cluster systems. Our High Performance Computing (HPC) Computer System Professional (CSP) Team within the HPC Systems Group (HPC-SYS) provides vanguard production monitoring, support, testing, and maintenance for existing systems and deployment support for future systems. Visit the HPC website to learn more: https://www.lanl.gov/org/ddste/aldsc/hpc/index.php
The CSP Team is seeking our next dynamic team member to help deploy and maintain our existing and future HPC systems. Mentoring of students, junior staff, and peers in technical and professional growth activities is highly valued, as is maintaining state-of-the-art technical expertise and knowledge within HPC and developing new skills in related disciplines.
This position will be filled at either the CSP 2, CSP 3 or CSP 4 level, as dictated by the current programmatic needs and skills of the selected candidate. Job responsibilities will be assigned in accordance with the level at which the selected candidate is hired.
Computing System Professional 2 ($73,800- $120,400)
A successful Computing System Professional 2 candidate will participate in periodic on-call responsibilities and actively grow HPC skill base and expertise across networking, data storage, system administration as part of the HPC-SYS Triage Team. Specific tasks/scenarios in which the selected candidate will engage in include: deploying and testing new hardware, troubleshooting and diagnosing system failures, and modifying existing systems, software and methods while actively participating in knowledge sharing across teams.
Minimum Job Requirements
- Demonstrated knowledge of building, configuring, troubleshooting, and administering Linux computer/support systems, including Linux command line interface skills, and experience scripting in Bash, Perl, Python, or similar languages.
- Demonstrated effective communication skills, including demonstrated ability to work productively with customers and suppliers.
- Demonstrated ability to work in a team environment.
- Ability to obtain a Q clearance, which typically requires U.S. citizenship.
- Proven track record of continuous learning to advance technical skillsets and knowledge
Education:
Position typically requires a bachelor’s degree from an accredited college or university and a minimum of four years of related experience, or an equivalent combination of education and experience. At this level, applicable advanced vendor and/or professional certification is desirable.
Computing System Professional 3 ($89,900 – $148,300)
A successful Computing System Professional 3 candidate will participate in periodic on-call responsibilities and apply subject matter expertise in one or more core topical areas (system, network, or data storage administration), both independently and collaboratively with other members of the team or group, after receiving initial direction and requirements from technical project leads. In addition, the selected candidate will actively grow HPC skill base and expertise across networking, data storage, system administration as part of the HPC-SYS Triage Team. Specific tasks/scenarios in which the selected candidate will engage in include: deploying and testing new hardware, troubleshooting and diagnosing system failures, and modifying existing systems, software and methods while actively participating in knowledge sharing across teams. In addition, the selected candidate will have the opportunity to develop technical products such as technical documentation, presentations, technical papers, and reports, to communicate findings internally.
Education:
Position typically requires a bachelor’s degree from an accredited college or university and a minimum of eight years of related experience, or an equivalent combination of education and experience. At this level, applicable advanced vendor and/or professional certification is desirable.
Minimum Job Requirements
In addition to the Job Requirements outlined above, qualification at the CSP-3 level requires:
- Proven ability to work independently and in a team environment to analyze problems, propose solutions to management, and deploy and document implemented solutions.
- Demonstrated experience building, configuring, and administering production Linux computer/support systems, including strong command line Linux operating system skills, working knowledge of or experience with hardware and software security practices, and intermediate experience scripting in Bash, Perl, Python, or similar languages.
- Demonstrated experience in automating tasks using programming and scripting
- Ability to program in a compiled or interpretative language
- Broad experience in network administration, including knowledge of TCP/IP, Ethernet, and/or High-Speed Networks (such as InfiniBand or Omni-Path) and/or broad experience in data storage administration, including knowledge of storage system hardware.
- Experience communicating technical information to both technical and non-technical personnel
- Demonstrated ability to communicate technical strategy, accomplishments, and challenges to management team, as well as cross-organizationally.
Computing System Professional 4 ($98,900 – $165,100)
A successful Computing System Professional 4 candidate will, in addition to the duties outlined above, work as a technical leader to develop innovative advanced concepts, theories, methods, techniques, and approaches to address specialized system problems, including proposing and implementing solutions to current problems and future HPC technologies in conjunction with junior and senior administrators and technical staff within and across teams; proactively examine our HPC infrastructure through creation of experiments and tooling to validate solutions and to detect and diagnose hardware health issues; analyze and share best practices and technical results to peers internally and at conferences, workshops, and meetings, as well as participate in strategic partnerships. The successful candidate will exercise independent judgement in methods, techniques, and evaluation criterial to achieve results while working closely with fellow HPC administrators as a leader and mentor to define and implement solutions on both tactical and strategic levels. In addition, the selected CSP 4 candidate will interact and/or collaborate with people from other teams, groups, divisions, directorates, and programs to develop, implement, and/or communicate technical solutions and work to enhance the technical and professional expertise of other staff and students through active mentoring and training activities.
Minimum Job Requirements
In addition to the Job Requirements outlined above, qualification at the Computing System Professional level 4 requires:
- Ability to leverage broad expertise or unique knowledge to contribute to development of technical objectives and principles as well as to achieve goals in creative and effective ways.
- Demonstrated experience building, configuring and managing parallel or distributed file systems.
- Broad demonstrated knowledge of production HPC system management topics including at least three of the following areas data storage, including Linux system administration, networking, programming, operating systems, configuration management, with depth in one or more areas.
- Demonstrated programming experience including compiled languages and advanced scripting.
- Demonstrated ability to evaluate competing computing subsystem technologies
- Demonstrated ability to initiate, design, and lead technical efforts.
- Experience interacting with vendors and colleagues within the industry, including presenting technical results and practices to peers locally and at conferences.
Education:
Position typically requires a bachelor’s degree from and accredited college or university and a minimum of twelve years related experience, or an equivalent combination of education and experience. At this level, advanced vendor and/or professional certifications are highly desirable and postgraduate course work may be expected.
Desired Skills for all levels
- Experience working in a production computing environment, preferably with HPC data centers, large topology systems or at large scale.
- Experience supporting a scientific user base and/or experience managing computers in a DOE or DOD classified environment.
- Demonstrated experience with centralized configuration management in a heterogeneous computing environment.
- Demonstrated experience working with authentication services such as LDAP
- Demonstrated experience maintaining various system services (Kerberos, NFS, SSH, Samba, etc.)
- Experience integrating operational metrics into a monitoring system such as Splunk.
- Experience configuring networks, network switches, firewalls. Experience with multiple network technologies (e.g., Ethernet, IB, OPA).
- Experience with multiple Linux distributions; experience diagnosing system software problems; familiarity with Cfengine, Chef, Puppet, Ansible, Salt, or similar configuration and automation tools and practices; experience with revision control systems such as RCS, Subversion, or Git; and/or experience with low-level system administration tools such as perf, strace, tcpdump, and vmstat.
- Knowledge of parallel/distributed file systems (e.g., Lustre, GPFS, Panasas, Glustre).
- Knowledge of file systems such as ZFS, EXT, XFS and their underlying structures/characteristics.
- Experience with Object storage and RESTful storage interfaces.
- Experience with archival storage systems.
- Experience configuring networks, network switches, and systems.
- Experience configuring network firewalls.
- Basic understanding of relational databases and database design methodologies.
- Contribution to open source or non-work-related projects.
- Demonstrated experience leading and mentoring teams, students, or junior team members.
- An Active DOE Q Clearance.
- An Active SCI Clearance.
For more information, visit the job posting here.