[Remote] Lead Engineer – HPC Operations

Remote Full-time
Note: The job is a remote job and is open to candidates in USA. Core42 is a leader in AI-powered cloud and digital infrastructure, driving transformative technology solutions globally. They are seeking a highly skilled Lead Engineer – HPC Operations to oversee the daily operations and support of high-performance computing clusters designed to power large-scale AI and ML workloads, ensuring stable and high-performing infrastructure.ResponsibilitiesOversee the daily operational management of HPC infrastructure, including compute, storage, networking, and scheduler components (e.g., Slurm, Kubernetes, etc.)Drive efforts to optimize the efficiency and performance of HPC systems, ensuring maximum resource utilization and minimizing downtimeServe as the primary technical escalation point for L2 support teams, ensuring rapid and effective resolution of incidents and service requestsContinuously monitor system health, performance, and resource utilization using advanced monitoring tools (e.g., Prometheus, Grafana, DCGM)Manage user environments for AI/ML workloads, including container orchestration (e.g., Docker, Kubernetes) and workflow tools (e.g., MLflow, Kubeflow)Define and enforce job scheduling policies, priorities, and partitions within Slurm and/or Kubernetes environments to ensure resource fairness, efficiency, and workload optimizationLead root cause analysis (RCA) of operational issues, contributing to post-mortem documentation and driving continuous improvement initiativesProvide mentorship and technical guidance to junior engineers, fostering skills development and knowledge sharing across teams. Participate in on-call rotation as necessaryEnsure adherence to security and operational policies, assisting in audits and maintaining documentation for change and incident management processesSkillsBachelor's or Master's degree in Computer Science, Engineering, or a related technical fieldMinimum of 8 years of experience in HPC operations, systems engineering, or DevOps roles, with at least 2 years in a leadership or ownership capacityAdvanced expertise in configuring, optimizing, and maintaining complex HPC environments, including hardware, software, and storage systemsHands-on experience managing Slurm clusters and/or Kubernetes-based environments for AI/ML workloadsIn-depth knowledge of GPU resource management, workload schedulers, and performance tuning for AI/ML workloadsProficiency with monitoring and observability frameworks such as Prometheus, Grafana, and DCGMStrong scripting and automation skills, including Python, Bash, Ansible, and TerraformSolid understanding of Linux (RHEL/CentOS/Ubuntu), networking technologies (RDMA, InfiniBand, RoCE), and storage solutions (NFS, Lustre, Ceph)BenefitsBonus and benefits on topCompany OverviewCore42 is a developer of foundation models to empower organizations in different industries. It was founded in 2021, and is headquartered in Abu Dhabi, Abu Dhabi, ARE, with a workforce of 1001-5000 employees. Its website is https://www.core42.ai.

Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Care Review Clinician, Prior Authorization

Remote

Inventory Specialist – Amazon Store

Remote

Experienced Full Stack Social Media Customer Support Representative – Remote Workforce Management

Remote

Join Today: Urgently Require Nursing Assistant I-3 West Acute

Remote

Senior Fraud Investigator - REMOTE

Remote

Director of Event Marketing/Community Engagement - **Wilmington, DE**

Remote

Admin & Operation Coordinator (Japanese-English Bilingual) — Remote, Eastern Time

Remote

Director, Origination (Data Center Tech Space)

Remote

[Remote] Senior Data Engineer [Remote-US]

Remote

Experienced Customer Support Representative – Remote Client Engagement & Error Resolution

Remote
← Back