High-Performance Computing System Administrator
Bargaining Unit: None - Not included in the union (Yale Union Group)
Time Type: Full time
Duration Type: Regular
Compensation Grade: Administration & Operations
Compensation Grade Profile: Supervisor; Senior Associate (P5)
Work Location: Central Campus
Worksite Address: 160 St. Ronan Street New Haven, CT 06511
Work Week: Standard (M-F equal number of hours per day)
Searchable Job Family: Computing and Information Systems
Total # of hours to be worked: 37.5
The Yale Center for Research Computing (YCRC) seeks a High-Performance Computing System Administrator to join the center’s Engineering team to provide hardware and software administration for a growing number of high-performance computing (HPC) clusters used in faculty research.
The YCRC provides support that spans the Yale School of Medicine and Faculty of Arts & Sciences and encompasses Yale’s HPC clusters, multiple petabytes of high-performance storage, and technologies for computational science and the analysis, sharing, and management of large-scale research data.
The successful candidate will support the infrastructure behind all of the above, including hardware, system and resource-management software, networking, storage, monitoring and security measures. This is a highly-collaborative effort, so frequent interaction with other system administrators, research-support staff, management, vendors and researchers is a regular part of the role. The successful candidate will also participate in designing, recommending and vetting architectures, specifications, and configurations of new systems.
The Yale Center for Research Computing is a component of the Provost’s Office, and is governed jointly by the Vice Provost for Research, the Deputy Dean(s) for Research at the Yale School of Medicine, and the Chief Information Officer of the University.
Responsibilities Specific to this Role
- Configure and support HPC clusters.
- Install, administer and maintain hardware, system software, networking, accounts, and security measures.
- Deploy and support data storage and backup.
- Diagnose and correct system issues, whether these be issues with correct operation or performance.
- Reinstate integrity of system as quickly as possible following an outage in order to minimize downtime.
- Manage end-user accounts.
- Triage and solve user-submitted tickets, especially when they relate to the infrastructure.
- Track resource usage using monitoring and queuing software.
- Develop and maintain documentation for team members and end users.
- Research developments in HPC architecture and new technologies, processes, and methodologies.
- Patch system firmware and software as needed.
- Determine specifications for new systems and tailor these to meet business needs (together with team).
- Conduct training and user education.
- Perform other duties as assigned.
- Provides technical expertise in resolving user system deficiencies and determines appropriate action.
- Provides system services and analyze system performance for stakeholders and intended end users. Performs all activities necessary to activate a new operating system or new release of an existing system, including analysis, design, implementation, and related documentation. Analyzes systems performance and modifies programs to increase the efficiency of the operation. Reinstates integrity of system as quickly as possible following an outage in order to minimize item and data loss.
- Recommends and authorizes system upgrades and software installations.
- Designs, develops and implements new system tools.
- Analyzes execution time of commonly used instruction to identify and replaces those that are inefficient or slow to operation.
- Analyzes, evaluates and takes steps to circumvent problems and restores systems to operating condition.
- Contributes in the determination of specifications and determines the combination of options needed to tailor an operating system to meet the business needs.
- Conducts training and user education.
- Researches new technologies, processes, and methodologies.
- May perform other duties as assigned.
Required Education and Experience
Bachelor's Degree in a related field and a minimum of two years of related work experience or an equivalent combination of education and experience.
Required Skill/Ability 1:
Proven expertise with Linux operating system distributions. Expertise with bash and at least one other scripting language.
Required Skill/Ability 2:
Demonstrated expertise with Linux system administration, including OS, networking, storage, and security.
Required Skill/Ability 3:
Proven ability to work in team environment in fast moving technology field.
Required Skill/Ability 4:
Excellent verbal and writing skills. Ability to interact well with team members and end users. Ability to work independently and across units.
Required Skill/Ability 5:
Attention to detail with the proven ability to take the care necessary to be entrusted with a system that hundreds of users depend on for research computation and the storage of research data.
Preferred Education, Experience and Skills:
Experience with HPC clusters, preferably with administration thereof. Extensive knowledge of clustering tools, e.g., Slurm, xcat. Experience with technology in a research environment. Expertise with high-speed networking such as InfiniBand and 10/40 Gigabit ethernet. Familiarity with large storage systems and parallel file systems such as GPFS and Lustre.
Background Check Requirements
All candidates for employment will be subject to pre-employment background screening for this position, which may include motor vehicle, DOT certification, drug testing and credit checks based on the position description and job requirements. All offers are contingent upon the successful completion of the background check. Please visit www.yale.edu/hronline/careers/screening/faqs.html for additional information on the background check requirements and process.
The intent of this job description is to provide a representative summary of the essential functions that will be required of the position and should not be construed as a declaration of specific duties and responsibilities of the particular position. Employees will be assigned specific job-related duties through their hiring departments.
Affirmative Action Statement:
Yale University considers applicants for employment without regard to, and does not discriminate on the basis of, an individual’s sex, race, color, religion, age, disability, status as a veteran, or national or ethnic origin; nor does Yale discriminate on the basis of sexual orientation or gender identity or expression. Title IX of the Education Amendments of 1972 protects people from sex discrimination in educational programs and activities at institutions that receive federal financial assistance. Questions regarding Title IX may be referred to the University’s Title IX Coordinator, at TitleIX@yale.edu, or to the U.S. Department of Education, Office for Civil Rights, 8th Floor, Five Post Office Square, Boston MA 02109-3921. Telephone: 617.289.0111, Fax: 617.289.0150, TDD: 800.877.8339, or Email: firstname.lastname@example.org.
Yale University is a tobacco-free campus