HPC System Senior Administrator
Organization Name Research Computing
Khalifa University is a world-class, research-intensive institution in Abu Dhabi, the capital city of the United Arab Emirates (UAE). The University’s mission is to seamlessly integrate research and education to produce world leaders and critical thinkers in science, engineering and medicine, and also to be a catalyst towards Abu Dhabi’s 2030 vision for a knowledge-based economy. Khalifa University has three campuses in the city of Abu Dhabi – the KU Main Campus, Masdar City Campus and the Sas Al Nakhl Campus, the latter two housing the internationally renowned Masdar and Institute and Petroleum Institute, respectively.
HPC System Senior Administrator is responsible for providing technical leadership in design, development, installation and maintenance of hardware and software for the central High-Performance Computing systems and/or research computing services at Khalifa University
The role holder undertakes and leads system administration and support of HPC and other Linux-based systems operated and/or supported by RC. Examples may include but are not limited to technical support, account creation, issue diagnosis and resolution, access rights, system monitoring, training, system configuration, software maintenance, network configuration, creation of user documentation, system testing and improvement.
- Oversee System administration and service ownership of HPC clusters and infrastructure to include specialized servers, storage, and networking.
- Supervise operations of HPC resources and conduct routine HPC production operations activities
- Manage systems’ users as well as provide end-user support on how to use the resources including assistance in any parallelization, debugging and optimization issues with their applications
- Keep end-users informed about updates in resource issues, system changes, or any other updates
- Liaise with research teams to discuss issues and implement service improvements
- Install research software on the central research computing platform
- Conduct periodic checks and updates, when necessary, in order to maximize the up-time of the systems
- Monitor the usage and performance of the various systems used and their maintenance status
- Monitor HPC infrastructure health and utilization and is responsible for both proactively and reactively addressing operational issues.
- Execute, organize regular system maintenance and enhancement activities to include diagnosing and solving various system operational problems and automating common processes when possible.
- Develop and manage service level agreements for the HPC services, and implement operational procedures including web-based access to HPC resources.
- Aiding faculty in setting up their individual and group computing abilities
- Facilitate a robust HPC user community from a variety of disciplines
- Coordinate with the HPC user community to establish service roadmap recommendations and service enhancements.
- Establish and manage training materials for services desk, local support partners and the HPC user community; delivery of training will be coordinated with training and development groups.
- Collect, analyze, and report usage data to relevant parties including the HPC user communities and interested administrators.
- Prepare academic report for management and other customized reports, when required
- Interact regularly with a wide range of internal and external constituents, faculty and staff members; provide information to academic staff and students
- Adhere to the University's information security and confidentiality policies and procedures, and report breaches or other security risks accordingly
- Coordinate with other departments to facilitate the accomplishment of tasks and responsibilities, as and when needed
- Perform any other tasks assigned by the Line Manager
- A Bachelor’s Degree in related field.
- Proven experience as a HPC system administrator, preferably working with users in a research environment.
- Demonstrable experience in the integration of Linux systems in a large, multi-user networked environment. Examples of specific skills may include remote file systems, network authentication, email, DNS, NTP, LDAP, Active Directory, virtual machines, system installation and maintenance, provisioning systems, job scheduling, system log analysis and hardware monitoring,
- Compilation techniques and usage methodologies of a range of scientific, technical, research focused and HPC applications.
- Proven experience providing technical support to a user base with a wide range of experience and skill levels.
- Understanding the importance of and issues surrounding network security, and experience of its implementation
- Experience of performance evaluation, and building open & closed source code is desirable.
- Experience in writing and maintaining scripts using scripting languages such as Bash & Perl is essential.
- Demonstrable understanding of service issues in a multi-user, network environment e.g. risks of account compromise or privilege escalation, advice on safe permissions to users, good password practice.
- A minimum 7-9 years of relevant experience.
- A minimum of 3-5 years of relevant experience
How To Apply
Applicants should submit an online application at http://www.ku.ac.ae/pages/careers. All applicants should submit an online application via the portal. A complete application includes curriculum vitae, cover letter, photo and the names and contact information of three references.
Should you require further assistance or if you face any issue with the online application, please feel to contact the Recruitment Team (RecruitmentTeam@ku.ac.ae)