High Performance Computing System Engineer
Job Code: 4833
Job Grade: K
The Stanford Research Computing Center (SRCC) is seeking outstanding applicants for the position of HPC System Engineer. Embedded with world-class researchers in the School of Earth, Energy and Environmental Sciences, you will join a dynamic and growing team of technology specialists supporting the computational and data needs of Stanford’s research community. This position will specifically focus on managing and supporting HPC clusters.
The hiring range for this position is $125,000 - $140,000.
The successful candidate will be someone who:
- Has built, managed, secured and supported HPC clusters before and is comfortable with handling all aspects of that, from racking servers, to configuring networking, to installing software for end-users to providing one-on-one instruction and support
- Thrives when working in an academic environment
- Is passionate about technology and is driven by challenge and intellectual curiosity
- Is self-motivated to learn, sometimes on your own time
- Has user support experience and actually likes working with end-users on a daily basis
- Is extremely detail-oriented, documents well, and understands the importance of documentation
- Isn’t afraid of hardware
- Loves problem-solving
- Understands the need to ensure the usability of systems from the end-users’ perspectives
The SRCC is jointly sponsored by University IT (UIT) and the Office of the Dean of Research. The SRCC team of 18 cyberinfrastructure professionals offers research computing platforms, consultation, tool and software development, system engineering, and system administration in support of computational and data-intensive research across the Stanford campus.
This position will provide system administration, engineering and specialized technical consultation for existing and future systems and services for research computing workloads. The position will also specifically have responsibilities for managing high performance computing infrastructure in the School of Earth, Energy and Environmental Sciences and for providing technical consultation to researchers there. The work will include hands-on installation, management and support of complex compute environments, including filesystems and storage platforms, Linux server environments, containers, job schedulers, scientific tools, and application software.
- Support and administration of research computing clusters, servers and storage systems, including installation, network and security configuration, monitoring, producing and maintaining system documentation for users, maintenance, application software build/configuration, upgrading, patching, and complex user problem solving. Those systems will be housed in Stanford data centers.
- Provision computing platforms and associated storage and networking for research environments, incorporating novel technical solutions as needed to meet research requirements. Install, test and configure software tools, libraries and compilers to meet researchers’ needs.
- Customize environments as requested by research teams, with specific focus on the optimization of end-users’ experiences
- Provide advanced cyberinfrastructure training and consultation for faculty, postdocs and graduate students in the School of Earth, Energy and Environmental Sciences.
- Ensure systems are configured and managed in accordance with Stanford policies and any regulatory requirements specific to data sources and classifications.
- Conceive, design, develop, optimize, integrate, and maintain information technology at a complex level.
- Troubleshoot highly complex problems for which the analysis and resolution require extensive knowledge of many diverse system components
- Develop long range technology plans.
- Provide leadership and IT solutions for complex problems
*Other duties may be assigned.
Education and Experience
Bachelor's degree and eight years of related increasingly technical work experience or a combination of education and relevant experience. Strong, demonstrated knowledge of Linux and demonstrated experience managing multiuser compute clusters and associated storage environments are required as well.
Knowledge, Skills and Abilities
Advanced knowledge of Linux and HPC cluster management and operation are required; experience managing, using, supporting and consulting on research computing cyberinfrastructure in an academic or research environment is strongly preferred. Proven ability to deliver outstanding system and service administration and end-user support in a thorough and timely manner is needed. This position requires that you be able to juggle multiple competing priorities, work quickly and accurately, and demonstrate initiative in conceptualizing and moving technical projects successfully to completion. The position must be able to do independent analysis, troubleshooting and problem resolution, but also must work collaboratively with other team members and across organizational group boundaries. An essential component of the job is keeping up with and mastering current and emerging technologies to facilitate researchers’ computing work and also that streamline and automate system administration tasks; that requires a demonstrated passion for and curiosity about the breadth of HPC technologies and tools and also of technology trends in general.
This position requires hands-on experience building and supporting multi-tenant Linux servers/clusters and their associated networks, file systems and storage devices in production research environments. Specifically, this technical knowledge needed to be successful in this position includes:
- Expert demonstrated knowledge of Linux and managing Linux-based environments, including securing systems, and day-to-day troubleshooting, monitoring, support, software packaging, and working within industry-wide best practices
- Experience administering, configuring, and supporting systems with accelerators, and shared file systems and large-scale storage platforms. This includes hardware installation, configuration, upgrades and repairs
- Knowledge of and experience utilizing data and system security techniques, practices and standards as they relate to multi-user systems, storage and networks
- Hands-on experience installing, configuring and supporting job schedulers and resource managers (e.g., SLURM, OGE, LSF, Torque, Maui, etc.) is desirable.
- Familiarity with deploying virtualization technologies and basic knowledge of container technologies
- Exceptional written and verbal communication skills
- Experience using shells scripts (bash), programming languages (Python), and programming automated system management tools (e.g. Puppet)
- Familiarity with TCP/IP, Internet Routing Protocols, private and public networks, VLANs, Firewalls, Load Balancers, addressing schemes, subnet creation and subnet masking. Proven ability to troubleshoot basic network issues and communicate and work with a team of network engineers to solve possible network design issues
- Familiarity with the intersection of storage and networking disciplines: transport media, speeds of media, storage networks, IP based storage delivery, other storage delivery technologies
- Experience with some the following applications: Git, Apache, Kerberos, LDAP
- Software installation and maintenance experience supporting research codes and clients
- Exceptional client service and communication, focusing on proactive system administrator actions and interactions to reduce or remove barriers to clients’ efficient use of resources to advance research
This position requires the ability to lift and manipulate storage and compute servers, rack and unrack equipment up to 40 pounds, and occasionally climb ladders.
This position requires the ability to lift and manipulate storage and compute servers up to 40 pounds, rack and unrack equipment, and occasionally climb ladders. The position will support equipment in off-campus locations, so having a valid driver’s license is necessary. The position is expected to respond to critical system problems off-hours and also must also be available for routine on-site system maintenance and patching, typically scheduled for evenings and weekends so to minimize the disruption of research work. The position is expected to rotate on-call duties during winter break and other closures.
- Interpersonal Skills: Demonstrates the ability to work well with Stanford colleagues and clients and with external organizations.
- Promote Culture of Safety: Demonstrates commitment to personal responsibility and value for safety; communicates safety concerns; uses and promotes safe behaviors based on training and lessons learned.
- Subject to and expected to comply with all applicable University policies and procedures, including but not limited to the personnel policies and other policies found in the University’s Administrative Guide, http://adminguide.stanford.edu/.
Stanford is an equal employment opportunity and affirmative action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic protected by law. Stanford welcomes applications from all who would bring additional dimensions to the University’s research, teaching and clinical missions.
Consistent with its obligations under the law, the University will provide reasonable accommodation to any employee with a disability who requires accommodation to perform the essential functions of the