Site Reliability Engineer
Reporting to the Yale School of Management (Yale SOM) Manager of Infrastructure Services, the Site Reliability Engineer is responsible for the design, implementation, and management of infrastructure-related systems and services. These technologies encompass High Performance Computing, Virtualization, and Cloud Solutions.
With a deep technical understanding of modern scheduling, performance optimization, parallel file systems, and networking technologies, script and build infrastructure services through code. Working closely with faculty and staff, help define and implement solutions to solve complex research problems and deploy repeatable and rapid services.
- Working as a member of the operations team to maintain and enhance existing compute and infrastructure services.
- Identify and implement performance and stability improvements based on data and usage patterns.
- Establish strategy and processes consistent with the goals of the school, in its academic and research programs.
- Engage with customers to develop a keen understanding of their goals, requirements, and technical needs – and help to define and deliver high-value solutions that meet these needs.
- Build automation using software and system expertise. This automation will cover the full solution stack - from creating and managing infrastructure to scaling data processing for massive amounts of research data.
- Explore cutting edge cloud native solutions to cost effectively provide services to the institution.
- Generate and analyze weekly, monthly, quarterly, and annual operational reports to ensure achievement of metric driven goals and operational efficiency.
- Manage relationships with all constituencies within the school, in other university departments, outside vendors, and consultants.
- Provide consultation and make recommendations for allocation of resources, product selection, and priorities to Yale SOM IT senior staff as well as the Yale SOM community.
- Ensure the highest level of service delivery for current and evolving technologies in complex, multi-vendor, multi-platform environment.
- Develop relationships with Technology Partners and help them innovate, paving the way for the next generation of technology solutions.
- Assist in troubleshooting user related issues, including job submission, package management, permission, and service management.
- Other projects and duties as assigned.
Required Education and Experience
Bachelor’s degree and four years of relevant experience in service delivery or an equivalent combination of education and experience.
Required Skill/Ability 1:
Experience with Red Hat Satellite, Cloudforms and Ansible Tower or similar services. Working knowledge of configuration management and infrastructure as code (IaC) – Ansible, Terraform, etc. Experience with DevOps tools: Git, Continuous Integration and Continuous Deployment.
Required Skill/Ability 2:
Experience with at least one job scheduler such as LSF, SLURM, xcat, OpenShift, Kubernetes, Docker Swam or similar. Experience with large storage systems and parallel file systems – GPFS, Lustre, BeeGFS. Excellent interpersonal skills and superior customer service orientation to manage the needs of faculty, staff, and students.
Required Skill/Ability 3:
Self-driven and positive can-do attitude. Team player with ability to work collegially with peers, colleagues and school leadership. Proven excellent oral and written communication and presentation skills.
Required Skill/Ability 4:
Ability to manage operations in a fast-paced and changing environment. Demonstrated project management skills on large complex projects. Demonstrated ability to effectively lead service delivery teams.
Required Skill/Ability 5:
Ability to work under pressure and manage projects and priorities in a highly complex and dynamic environment. Ability to represent the school well in working collegially with peers and colleagues within and outside the university.
Preferred Education, Experience and Skills:
Four years of relevant experience in service delivery. Experience with system administration of Linux servers – RHEL. Experience with at least one Cloud Platform – Azure or AWS. Global orientation; experience working across countries and regions, and fluency in more than one language.
Weekend Hours Required?
Evening Hours Required?
Background Check Requirements
All candidates for employment will be subject to pre-employment background screening for this position, which may include motor vehicle, DOT certification, drug testing and credit checks based on the position description and job requirements. All offers are contingent upon the successful completion of the background check. Please visit www.yale.edu/hronline/careers/screening/faqs.html for additional information on the background check requirements and process.
The intent of this job description is to provide a representative summary of the essential functions that will be required of the position and should not be construed as a declaration of specific duties and responsibilities of the particular position. Employees will be assigned specific job-related duties through their hiring departments.
Affirmative Action Statement:
Yale University considers applicants for employment without regard to, and does not discriminate on the basis of, an individual’s sex, race, color, religion, age, disability, status as a veteran, or national or ethnic origin; nor does Yale discriminate on the basis of sexual orientation or gender identity or expression. Title IX of the Education Amendments of 1972 protects people from sex discrimination in educational programs and activities at institutions that receive federal financial assistance. Questions regarding Title IX may be referred to the University’s Title IX Coordinator, at TitleIX@yale.edu, or to the U.S. Department of Education, Office for Civil Rights, 8th Floor, Five Post Office Square, Boston MA 02109-3921. Telephone: 617.289.0111, Fax: 617.289.0150, TDD: 800.877.8339, or Email: firstname.lastname@example.org.
Yale University is a tobacco-free campus