Los Alamos National Laboratory, Multiple HPC Intern Summer Opportunities | Michigan Institute for Computational Discovery and Engineering

For questions about internships and instructions on how to apply, please email [email protected]

HPC Data Movement and Storage Team: Upcoming Student Project Opportunities

PROJECT: EMERGING STORAGE SYSTEM(S) EVALUATION

(Lead Mentor: Dominic Manno)

Storage systems are evolving as technology, such as flash, becomes economically viable. Vendors implementing cutting edge hardware solutions often approach LANL to help gain insight into how these systems could move into the real world (HPC applications). Work in this area includes potential modifications to filesystems, filesystem configuration/tuning, testing hardware, fixing bugs, finding bottlenecks anywhere in the stack in order to increase efficiency and make the storage system faster.

Preferred skills:

● Interest in HPC and storage systems
● Comfortable with computer hardware
● Strong analytical skills
● Benchmarking experience
● Experience with linux and scripting (bash, csh, Python, etc.)
● Comfortable with C programming

PROJECT: FILE SYSTEM(S) FEATURE AND TOOLSET EVALUATIONS

(Lead Mentor: Dominic Manno)

File systems evolve along with user requirements. New features are implemented to accommodate changing workloads and technology. LANL’s storage team must evaluate new features and their impact on HPC applications. This work will explore file system features, modifications to current build procedures/processes, and impact to LANL’s storage team metric collection tooling. Work in this area includes building source code (kernel included), configuring linux servers, configuring a basic distributed file system, benchmarking, experiment design, analysis of data, and scripting.

Preferred skills:

● Knowledge of and interest in filesystems
● Experience with Linux and Command Line Interface
● Experience with code build systems and software
● Interest in HPC and storage systems at scale
● Benchmarking experience

ABOUT THE HPC DATA MOVEMENT AND STORAGE TEAM:
The High Performance Computing (HPC) Data Storage Team provides vanguard production support, research, and development for existing and future systems that feed and unleash the power of the supercomputer. The Data Storage Team designs, builds and maintains some of the largest, fastest and most complex data movement and storage systems in the world, including systems supporting 100 Petabytes of capacity. We provide storage systems spanning the full range of tiers from the most resilient archival systems to the pinnacle of high-speed storage, including all-flash file systems and systems supplying bandwidth that exceeds a terabyte per second to some of the largest and fastest supercomputers in the world. Innovators and builders at heart, the Data Storage team seeks highly motivated, productive, inquisitive, and multi-talented candidates who are equally comfortable working independently as well as part of a team. Team member duties include: designing, building, and maintaining world-class data movement and storage systems; evaluating and testing new technology and solutions; system administration of HPC storage infrastructure in support of compute clusters; diagnosing, solving, and implementing solutions for various system operational problems; tuning file systems to increase performance and reliability of services; process automation.

HPC Platforms Team: Upcoming Student Project Opportunities

PROJECT: HPC CLUSTER REGRESSION
(Lead Mentor: Alden Stradling)

Building on work done by our interns this summer, we are continuing the process of adapting existing regression testing software to do system-level regression testing. Using the LANL- developed Pavilion2 framework in combination with Node Health Check (NHC) for more detailed information, our interns are moving the system from proof-of-concept in a virtualized test cluster to production-style systems to measure effectiveness and system performance impact, and to flesh it out as a running system. Also on the agenda is to make test creation and propagation simple, allowing regression detection to be added at the same time as fixes are made to the system.

Preferred skills

• Interest in HPC and modern infrastructure management at scale
• Problem solving and creativity
• Configuration Management
• Version Control
• Programming experience in bash, python or perl
• Strong background in UNIX and familiarity using CLI

About the HPC Platforms Team
The High Performance Computing (HPC) Platforms Team provides vanguard system and runtime support for some of the largest and fastest supercomputers in the world, including multi-petaop systems (e.g., the recently deployed 40 Peta operations per second Trinity Supercomputer). Troubleshooters and problem- solvers at heart, the HPC Platforms Team seeks highly motivated, productive, inquisitive, and multi-talented candidates who are equally comfortable working independently as well as part of a team. Team member duties include: system deployment, configuration, and full system administration of LANL’s world-class compute clusters; evaluating and testing new technology and solutions; diagnosing, solving, and implementing solutions for various system operational problems; system administration of HPC network infrastructure in support of compute clusters; diagnosing, solving, and implementing solutions for various system operational problems; system software management and maintenance, including security posture maintenance; tuning operating systems to increase performance and reliability of services; developing tools to support automation, optimization and monitoring efforts; interacting with vendors; and communicating and collaborating with other groups, teams, projects and sites.

HPC Design Group: Upcoming Student Project Opportunities

PROJECT: OPTIMIZING “SPACK CONTANERIZE” FOR USE WITH CHARLIECLOUD

(Lead Mentor: Tim Randles)

The Spack software package manager has the ability to output software build recipes as dockerfiles. These dockerfiles often require hand-editing to work well with Charliecloud. In this project you will work with the Charliecloud team at Los Alamos to identify common problems with Spack dockerfiles. You will then determine if these problems are best addressed by making changes to Charliecloud’s dockerfile support or if there are improvements that should be proposed to Spack’s containerize functionality. The intern will be expected to implement suggested changes. At the end of the summer the intern will present their work.

PROJECT: BUILDING A GITLAB TEST INFRASTRUCTURE USING THE ANSIBLE REPOSITORY

(Lead Mentor: Cory Lueninghoener)

Use Gitlab’s CI/CD pipeline and runner functionality to build an automated test infrastructure for checkins to our Git-backed Ansible repository. This would start out with getting familiar with Gitlab’s automated pipeline capabilities and running tasks on code checkin, and move on to simple linting tests that run each time a change is checked in. From there, it could move on to running larger test suites on VMs or in containers, all the way up to building and testing virtual clusters and tagging good cluster image releases.

About the HPC DES Group:
The High Performance Computing Design Group focuses on future technologies and systems related to HPC while providing technical resources when needed to the more production focused HPC Groups. Areas of focus include I/O and storage, future HPC architectures, system management, hardware accelerators, and reliability and resiliency. Production timescales of projects vary from weeks in the future for production deployments to 10 years or more for some of the reliability and future architecture work.

Where You Will Work:
Our diverse workforce enjoys a collegial work environment focused on creative problem solving, where everyone’s opinions and ideas are valued. We are committed to work-life balance, as well as both personal and professional growth. We consider our creative and dedicated scientific professionals to be our greatest assets, and we take pride in cultivating their talents, supporting their efforts, and enabling their successes. We provide mentoring to help new staff build a solid technical and professional foundation, and to smoothly integrate into the culture of LANL.

Los Alamos, New Mexico enjoys excellent weather, clean air, and outstanding public schools. This is a safe, low-crime, family-oriented community with frequent concerts and events as well as quick travel to many top ski resorts, scenic hiking & biking trails, and mountain climbing. The short drive to work includes stunning views of rugged canyons and mesas as well as the Sangre de Cristo mountains. Many employees choose to live in the nearby state capital, Santa Fe, which is known for world-class restaurants, art galleries, and opera.

About LANL:
Located in northern New Mexico, Los Alamos National Laboratory (LANL) is a multidisciplinary research institution engaged in strategic science on behalf of national security. LANL enhances national security by ensuring the safety and reliability of the U.S. nuclear stockpile, developing technologies to reduce threats from weapons of mass destruction, and solving problems related to energy, environment, infrastructure, health, and global security concerns.

The High Performance Computing (HPC) Division provides production high performance computing systems services to the Laboratory. HPC Division serves all Laboratory programs requiring a world-class high-performance computing capability to enable solutions to complex problems of strategic national interest. Our work starts with the early phases of acquisition, development, and production readiness of HPC platforms, and continues through the maintenance and operation of these systems and the facilities in which they are housed. HPC Division also manages the network, parallel file systems, storage, and visualization infrastructure associated with the HPC platforms. The Division directly supports the Laboratory’s HPC user base and aids, at multiple levels, in the effective use of HPC resources to generate science. Additionally, we engage in research activities that we deem important to our mission.