[Remote] Staff AI/ML Infrastructure Engineer
Note: The job is a remote job and is open to candidates in USA. Vultr is on a mission to make high-performance cloud infrastructure easy to use, affordable, and locally accessible for enterprises and AI innovators around the world. The Staff AI/ML Infrastructure Engineer will drive the design, performance, and reliability of the AI infrastructure platform, requiring deep GPU systems knowledge and strong automation experience.ResponsibilitiesDesign and maintain GPU and bare metal infrastructure in containerized and physical environmentsBuild scalable GPU clusters in partnership with networking and provisioning teamsEnsure reliable, high-performance provisioning of GPU infrastructureDevelop automated testing systems for GPU-based platformsImplement infrastructure solutions for diverse AI/ML workloadsBenchmark, test, and troubleshoot GPU performance at scaleCollaborate with hardware vendors on drivers, firmware, and supportResolve hardware, software, and performance issues across environmentsOptimize rail and cluster performance across architecturesLead technical direction and mentor engineers on infrastructure best practicesSkills5+ years experience working with bare metal infrastructure and hardware automationHands-on experience with modern NVIDIA/AMD GPU platforms and high-performance networking (RoCE, InfiniBand)Deep knowledge of BIOS, BMC, firmware, NICs, Redfish/IPMI, and PCIe systemsStrong Linux systems experience including device drivers and package managementExperience building infrastructure automation using Python and BashFamiliarity with GPU drivers, firmware ecosystems, and vendor collaborationExperience designing and delivering complex infrastructure productsProven ability to lead projects and mentor engineersExperience optimizing multi-cluster GPU environmentsExposure to Machine Learning software stacks and GPU workloadsBenefits100% company-paid insurance premiums for employee medical, dental and vision plans.401(k) plan that matches 100% up to 4%, with immediate vestingProfessional Development Reimbursement of $2,500 each year11 Holidays + Paid Time Off Accrual + Rollover PlanCommitment matters to Vultr! Increased PTO at 3 year and 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year$500 stipend for remote office setup in first year + $400 each following yearInternet reimbursement up to $75 per monthGym membership reimbursement up to $50 per monthCompany paid Wellable subscriptionCompany OverviewVultr is an AI cloud infrastructure platform offering latest generation NVIDIA GPUs and AMD CPUs and GPUs across 32 worldwide regions It was founded in 2014, and is headquartered in West Palm Beach, Florida, USA, with a workforce of 201-500 employees. Its website is https://www.vultr.com.Company H1B SponsorshipVultr has a track record of offering H1B sponsorships, with 1 in 2024. Please note that this does not guarantee sponsorship for this specific role.