[Remote] Principal GPU Infrastructure Engineer – AI/HPC Systems
Note: The job is a remote job and is open to candidates in USA. Axiom Recruit is partnering with a rapidly scaling technology business that is building advanced compute infrastructure for next-generation AI systems. They are seeking a Principal GPU Infrastructure Engineer to design and operate large-scale GPU environments supporting demanding enterprise-grade workloads across high-performance compute platforms.ResponsibilitiesOwn the lifecycle management of large-scale GPU infrastructure, from provisioning and firmware validation through to operational reliabilityLead operations across high-density, liquid-cooled compute environments supporting next-generation AI workloadsBuild automated observability and remediation systems using Prometheus, Grafana, NVIDIA DCGM, and infrastructure automation toolingDrive NetBox DCIM integration, asset management, IPAM, and infrastructure compliance across complex compute environmentsAct as a senior technical lead for infrastructure operations, incident response, vendor management, and enterprise-level infrastructure supportSkillsStrong experience managing large-scale GPU, HPC, or high-performance compute infrastructureDeep hands-on expertise with NVIDIA GPU systems, including H200, B200, or B300 environmentsAdvanced knowledge of InfiniBand, NVLink, NVSwitch, and high-throughput networking architecturesStrong Linux systems engineering background with infrastructure automation using Python or GoExperience with observability and monitoring tooling including Prometheus, Grafana, NVIDIA DCGM, and SNMPProven experience across bare-metal provisioning, infrastructure lifecycle management, and automated/self-healing systemsExperience with liquid-cooled or high-density compute environmentsFamiliarity with NVIDIA Mission Control and GPU cluster managementExposure to confidential compute technologies and attestation workflowsExperience building infrastructure standards in fast-scaling environmentsBenefitsCompetitive salary and benefits packageOpportunity to build next-generation AI infrastructureExposure to cutting-edge GPU and HPC environmentsStrong ownership across infrastructure and automationEngineering-led culture working on mission-critical systemsCompany OverviewWeb3/Blockchain/AI Recruitment It was founded in 2019, and is headquartered in Dubai, Dubai, ARE, with a workforce of 11-50 employees. Its website is https://www.axiomrecruit.com/.