HPC SYSTEMS ADMINISTRATOR Responsibilities: a. Cluster Management: i. Install, configure, and maintain HPC and AI cluster hardware and software. ii. Monitor cluster performance and resource utilization. iii. Ensure high availability and reliability of cluster resources. iv. Manage job scheduling and workload management systems (e.g., SLURM, PBS, Torque). b. System Administration: i. Perform regular system updates, patches, and upgrades. ii. Manage user accounts, permissions, and security protocols. iii. Implement and maintain system backup and recovery procedures. iv. Troubleshoot and resolve hardware and software issues. c. Performance Optimization: i. Analyze and optimize system performance and resource usage. ii. Implement tuning parameters and configurations for improved performance. iii. Collaborate with users to optimize their applications and workflows for HPC and AI environments. d. User Support: i. Provide technical support and training to users. ii. Assist users with job submission, monitoring, and troubleshooting. iii. Develop and maintain documentation and user guides. e. Security and Compliance: i. Implement and enforce security policies and procedures. ii. Monitor system security and respond to incidents. iii. Ensure compliance with relevant regulations and organizational policies. f. Research and Development: i. Stay updated with the latest technologies and trends in HPC and AI. ii. Evaluate and recommend new hardware, software, and methodologies. iii. Collaborate with research teams to support new projects and initiatives. g. Collaboration: i. Work closely with IT, research, and development teams to understand their requirements and provide solutions. ii. Participate in cross-functional projects and initiatives. iii. Communicate effectively with stakeholders and management. h. Documentation and Reporting: i. Maintain detailed documentation of system configurations, procedures, and changes. ii. Generate reports on system performance, usage, and incidents. iii. Document and share best practices and lessons learned Qualifications: a. Education: i. Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field. Master's degree preferred. b. Experience: i. Proven experience in HPC/AI and or mainframe cluster administration. ii. Strong background in Linux/Unix system administration. iii. Experience with job scheduling and workload management systems. iv. Familiarity with AI and machine learning frameworks and tools. c. Skills: i. Proficiency in scripting languages (e.g., Python, Bash). ii. Strong analytical and problem-solving skills. iii. Excellent communication and teamwork skills. iv. Knowledge of networking, storage, and security concepts. d. Certifications (Optional): i. Certified HPC Administrator (CHPCA) ii. IBM or other Mainframe certifications iii. Red Hat Certified Engineer (RHCE) iv. AWS Certified Solutions Architect v. Nutanix Certified Professional (NCP) vi. Nutanix Certified Advanced Professional (NCAP) vii. Nutanix Platform Expert (NPX) viii. Other relevant certifications Job Types: Full-time, Permanent Experience:
Linux/Unix: 2 years (Preferred)
HPC clusters administration: 2 years (Preferred)
Install & configure HPC/AI cluster: 2 years (Preferred)