[Remote] Senior Compute Platform Engineer
Note: The job is a remote job and is open to candidates in USA. Stack AV is developing revolutionary AI and advanced autonomous systems to enhance safety and efficiency in the trucking transportation industry. The Senior Compute Platform Engineer will be responsible for designing and operating high scale batch compute systems and workflow orchestration systems, ensuring reliability and efficiency in complex workloads.ResponsibilitiesDesign and operate distributed systems for scheduling and executing large-scale batch workloads across Kubernetes clustersBuild and maintain compute platform abstractionsOptimize utilization of compute resourcesDevelop and improve multi-tenant scheduling strategiesImprove reliability and fault tolerance of large-scale distributed jobs and platform componentsCollaborate with teams across the company to understand workload requirements and improve platform capabilitiesContribute to platform tooling, automation, and CI/CD workflowsSkills7+ years of experience building and operating distributed systems or infrastructure platformsStrong experience with Kubernetes and container orchestration in production grade environmentsProficiency developing in Golang and PythonExperience designing and operating large-scale batch compute systemsStrong debugging and problem-solving skills in complex distributed systemsAbility to collaborate across teams and communicate technical concepts clearlyExperience with at least one batch scheduling system such as Kueue, Armada, Volcano, or SlurmCompany OverviewStack AV operates in the transportation industry that develops advanced autonomous systems. It was founded in 2023, and is headquartered in Pittsburgh, Pennsylvania, USA, with a workforce of 51-200 employees. Its website is https://www.stackav.com.