Software Engineer, Service Reliability

California, United States
06 Nov 2018
End of advertisement period
06 Jan 2019
Contract Type
Full Time

Job Code: 4823
Job Grade: L
Exemption: Exempt

Are you an individual ready to make an impact leveraging your programming skills to change the world? Do you have experience scripting, integrating disparate systems, and developing tools? Do you say no to the status quo? Are you looking to join and exciting team that has over 100 years’ experience developing and supporting tools for Stanford University?

As a Service Reliability Engineer, you will need to be curious and possess deep technical knowledge. You will need to engage and build deep expertise with Stanford data and tools; apply high standards to the code around you and develop an ability to identify highly impactful projects in a complex domain.

This position is responsible for building, documenting and maintaining monitoring systems for all voice and network services. This includes traffic reporting and shaping, custom integration scripts, and a deep understanding of network topology and voice systems. You will be responsible for building, maintaining and enhancing the internal monitoring systems and scripts that the Stanford Communications Services network and voice operations teams rely on to help resolve critical service impacting issues. 

We are looking for passion, attention to detail, taking pride in one's work, the desire to work cross-functionally, taking ownership, and having ideas/opinions. Above all, you will need to be able to work independently and collaboratively with team members: asking questions to understand the requirements, estimating the level of effort for your development, and managing to a timeline. You need to be a problem solver and someone who uses ingenuity to solve hard problems. If you’re an enthusiastic team player who cares about the infrastructure, remains calm in crisis, collaborates cross functionally, and loves writing code to improve operations we want to talk to you.

So… what do we do? At the end of the day, our organization’s mission is to deliver world-class service and technological solutions in support of research, teaching and learning, administration, and healthcare at Stanford. Our team gets pulled into projects requiring custom development of any kind. This development typically involves quickly learning 3rd party APIs, developing automation critical to the implementation or maintenance of an application or service, automating tasks in support of reliability, or developing tools where no software solution exists on the commercial market. 

The Service Reliability Engineer is a member of the Automation and Service Reliability team in the Communication Services division of University IT (UIT), the central IT organization for Stanford University. This position will report directly to the Director of Automation and Service Reliability.


  • Work independently including asking questions to understand the requirements, estimating the level of effort for your development, and managing development to a timeline.
  • Experience with the installation and configuration of Cisco routers in a TCP/IP-based networking environment is desired.
  • Learn quickly and apply knowledge to ones work to deliver solutions quickly.
  • Lead projects, as necessary, for special systems and software development in areas of complex problems.
  • Design and develop features in both the cloud and on premise that enable radically simplified automation.
  • Propose, conceptualize, design, implement, and develop solutions for difficult and complex applications independently.
  • Leverage third party APIs to build integrations with other systems.
  • Possess expert programming and troubleshooting skills in order to resolve highly complex problems where the analysis and resolution requires extensive knowledge of the many diverse system components, such as: authentication, networking, firewalls, databases, operating systems, storage, and server hardware.
  • Participate in meetings with senior level staff and possess professional services level soft skills.
  • Oversee testing, debugging, change control, and documentation for projects.
  • Improve the physical design of existing systems to optimize performance.


Education & Experience:

Bachelor's degree and eight years of relevant experience, or a combination of education and relevant experience.

Knowledge, Skills and Abilities:

  • Domain experience with building solutions or applications on premise or in the cloud.
  • Proficient understanding of code versioning tools, such as Git.
  • Experience with continuous integration.
  • Experience with administrating Linux operating system such as Debian and Red Hat.
  • Demonstrated experience leading activities on structured team development projects.
  • Ability to learn quickly and adapt to new technologies and programming tools.
  • Able to understand and often predict the emergent behavior of complex systems.
  • Demonstrated experience in designing, developing, testing, and deploying and securing applications.
  • Demonstrated experience leveraging third party APIs to build integrations.
  • Demonstrated experience with application technologies such as servlet, JSP, JSTL, XHTML, XSLT, PHP, CSS, Javascript, Perl, Flash, jQuery, and JSON and AJAX. Python experience required.
  • Strong understanding of data design, architecture, relational databases, and data modeling.
  • Must have understanding of networking protocols (TCP/IP, HTTP, SSL, DNS, FTP, etc), monitoring protocols (SNMP, Syslog) and working knowledge of firewalls.
  • Must have extensive experience in Web Services (REST, SOAP), XML and data persistence layer framework design and development.
  • Thorough understanding of all aspects of software development life cycle and quality control practices.
  • Must have a deep understanding and good working experience in application servers like Apache, Tomcat etc.
  • Experience with graph, chart libraries is a plus
  • Bonus points for experience with network monitoring/configuration applications like InfoSim Stablenet, MicroFocus Network Automation, Splunk, Statseeker, or Cacti.
  • Bonus points for network operations experience (switches, routers, firewalls, VPNs, etc)
  • Strong communication skills with both technical and non-technical clients.

Certifications and Licenses:



  • Constantly perform desk-based computer tasks.
  • Frequently sit, grasp lightly/fine manipulation.
  • Occasionally stand/walk, writing by hand.
  • Rarely use a telephone, lift/carry/push/pull objects that weigh up to 10 pounds.

* - Consistent with its obligations under the law, the University will provide reasonable accommodation to any employee with a disability who requires accommodation to perform the essential functions of his or her job.

  • May work extended hours, evening and weekends.


  • Interpersonal Skills: Demonstrates the ability to work well with Stanford colleagues and clients and with external organizations.
  • Promote Culture of Safety: Demonstrates commitment to personal responsibility and value for safety; communicates safety concerns; uses and promotes safe behaviors based on training and lessons learned.
  • Subject to and expected to comply with all applicable University policies and procedures, including but not limited to the personnel policies and other policies found in the University's Administrative Guide,

Stanford is an equal opportunity employer and all qualified applicants will receive consideration without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other characteristic protected by law.
Consistent with its obligations under the law, the University will provide reasonable accommodation to any employee with a disability who requires accommodation to perform the essential functions of the job.