[Remote] Platform Engineer (GPU)
Note: The job is a remote job and is open to candidates in USA. Vero is an exciting AI infrastructure startup that collaborates closely with NVIDIA and other key organizations to shape the future of data centers. The Platform Engineer (GPU) will be responsible for the operation, optimization, and reliability of large-scale GPU clusters supporting AI/ML and HPC workloads, focusing on performance tuning and systems management.ResponsibilitiesSupport the reliability, performance, and day-to-day operations of large-scale GPU infrastructure supporting AI/ML and HPC workloadsOptimize Kubernetes platforms to maximize efficiency, utilization, and stability in productionDevelop reusable Terraform and Ansible modules to enable scalable, low-drift deploymentsMaintain high availability through strong observability, SLO/SLI ownership, and incident response practicesTroubleshoot complex cross-layer issues and manage platform lifecycle (upgrades, scaling, security, multi-tenancy) in production environmentsSkills3+ years of experience in Platform Engineering, SRE, DevOps or infrastructure rolesRobust experience with GPU infrastructure & HPC clustersProven experience operating and scaling large distributed systems in high-availability environmentsKubernetesTerraform & AnsibleStrong background in monitoring, observability and incident response (Prometheus, Grafana, etc.)Slurm (or similar workload schedulers)BenefitsHuge equity upsideMedical, dental, and vision insurance for the employee and familyEquity SchemeBonus401(k) with a generous employer matchCompany-paid Life InsuranceFlexible Spending AccountMental Wellness BenefitsFlexible PTOCompany OverviewWe help founders and leaders build high-impact teams by connecting them with exceptional talent globally, with a focus in the US. It was founded in 2019, and is headquartered in London, City of London, GB, with a workforce of 11-50 employees. Its website is https://www.wearevero.io/.