Senior High Perfomance Computer Cluster System Engineer
HPC cluster systems engineer is responsible for managing and supporting all HPC systems and Grid system, for the University data center and distributed locations.
- Solves HPC and Grid related problems on a daily basis.
- In support of change management within the data center, provides the CSC with information about the HPC systems.
- Daily verifies all HPC Systems by using the monitoring tools and proactively intervenes to solve problems.
- Analyze solutions components, understand systems integration challenges and identify technology gaps.
- Resolve / propose solutions to above gaps to reach future performance targets and functionality requirements.
- Prototype features and perform integration checkout of various software components, and collaborate with component developers and solutions architects.
- Develop / drive validation test content and evaluate systems components.
- Engage with industry partners as required to identify and investigate best-known methods used in the HPC community and apply those methods.
- Collaborate with architects and developers to define architectural requirements for high-end HPC clusters.
- Responsible for system integration and validation of UAEU HPC clusters.
- Responsible of monitoring all HPC and Grid services.
- Co-ordinates work with vendors for support.
- Tests and deploys HPC systems.
- Knowledge of IT Service Management frameworks.
- Maintains accurate and comprehensive documentation diagrams of the enterprise HPC system, backup infrastructure, communications flow, and routing.
- Other duties as assigned.
- Bachelor degree required in Computer Engineering/Science
- 3-6 years of experience
- HPC Cluster Administration
- Advanced RED Hat Linux Administration
- Knowledge of server hardware components, diagnostics and replacing them defective items.
- Good communication skills & Report Writing Skills.
- Must be able to work under pressure in a fast-paced work environment.
- Must be able to work flexible hours including evenings, weekends, holidays and overtime as required, should be available 24/7 on-call in case of major services outage.
- Strong problem solving, testing, and network troubleshooting skills
- Cluster solutions integration and administration
- Linux operating systems and OS components for HPC clusters
- Cluster provisioning, systems management, resource management middleware
- Cluster interconnect fabrics and software stack
- HPC Cluster storage solutions
- Parallel programming models for HPC clusters
Division Information Technology Division-CIO
Department Infrastructure&Core Techno. Section
Job Close Date open until filled
Job Category Staff
Salary 11000 to 23000 AED