Senior High Perfomance Computer Cluster System Engineer
3 days left
- Full Time
HPC cluster systems engineer is responsible for managing and supporting all HPC systems and Grid system, for the University data center and distributed locations.
- Solves HPC and Grid related problems on a daily basis.
- In support of change management within the data center, provides the CSC with information about the HPC systems.
- Daily verifies all HPC Systems by using the monitoring tools and proactively intervenes to solve problems.
- Analyze solutions components, understand systems integration challenges and identify technology gaps.
- Resolve / propose solutions to above gaps to reach future performance targets and functionality requirements.
- Prototype features and perform integration checkout of various software components, and collaborate with component developers and solutions architects.
- Develop / drive validation test content and evaluate systems components.
- Engage with industry partners as required to identify and investigate best-known methods used in the HPC community and apply those methods.
- Collaborate with architects and developers to define architectural requirements for high-end HPC clusters.
- Responsible for system integration and validation of UAEU HPC clusters.
- Responsible of monitoring all HPC and Grid services.
- Co-ordinates work with vendors for support.
- Tests and deploys HPC systems.
- Knowledge of IT Service Management frameworks.
- Maintains accurate and comprehensive documentation diagrams of the enterprise HPC system, backup infrastructure, communications flow, and routing.
- Other duties as assigned.
- Bachelor degree required in Computer Engineering/Science
- 3-6 years of experience
- HPC Cluster Administration
- Advanced RED Hat Linux Administration
- Knowledge of server hardware components, diagnostics and replacing them defective items.
- Good communication skills & Report Writing Skills.
- Must be able to work under pressure in a fast-paced work environment.
- Must be able to work flexible hours including evenings, weekends, holidays and overtime as required, should be available 24/7 on-call in case of major services outage.
- Strong problem solving, testing, and network troubleshooting skills
- Cluster solutions integration and administration
- Linux operating systems and OS components for HPC clusters
- Cluster provisioning, systems management, resource management middleware
- Cluster interconnect fabrics and software stack
- HPC Cluster storage solutions
- Parallel programming models for HPC clusters
Division Information Technology Division-CIO
Department Infrastructure&Core Techno. Section
Job Close Date open until filled
Job Category Staff
Salary 11000 to 23000 AED