[Remote] Platform Engineer II/III
Note: The job is a remote job and is open to candidates in USA. Zone 5 Technologies is redefining what's possible in unmanned aircraft systems, developing cutting-edge autonomous solutions. They are seeking a Platform Engineer to architect and operate scalable compute infrastructure that powers their autonomous vehicle simulation and testing framework.ResponsibilitiesDesign and implement auto-scaling compute infrastructure for simulation workloads using cloud platformsBuild and maintain on-premises GPU and CPU clusters for simulation and machine learning trainingArchitect hybrid cloud solutions that optimize cost and performance across cloud and local compute resourcesImplement job scheduling and orchestration systems using Kubernetes for thousands of concurrent simulationsDesign storage solutions for large-scale simulation data, logs, and artifacts using cloud and local storage systemsDeploy and maintain robotics simulation environments at scaleBuild CI/CD pipelines for automated simulation testing of autonomy softwareCreate infrastructure for distributed parameter sweeps, Monte Carlo testing, and regression suitesDevelop monitoring and observability systems for simulation fleet health and resource utilizationImplement data pipelines for simulation results ingestion, analysis, and visualizationWrite and maintain infrastructure as code for reproducible infrastructure deploymentBuild automation tools and CLI utilities to simplify developer access to compute resourcesImplement GitOps workflows for infrastructure changes and configuration managementCreate self-service interfaces for engineers to launch and manage simulation jobsDevelop cost monitoring and optimization strategies for cloud and on-prem resourcesMonitor and optimize infrastructure performance, reliability, and cost efficiencyTroubleshoot complex distributed systems issues across networking, storage, and compute layersImplement backup, disaster recovery, and business continuity strategiesMaintain security best practices including IAM, secrets management, and network isolationCollaborate with autonomy, ML, and robotics teams to understand compute requirements and optimize workflowsDesign and implement network architectures for distributed simulation workloads across AWS and on-premises environmentsConfigure VPCs, subnets, security groups, and routing for secure, high-performance compute clustersEstablish hybrid cloud connectivity (VPN, Direct Connect, site-to-site tunnels) between on-premises and cloud resourcesOptimize network performance for large data transfers, multi-node communication, and distributed workloadsSupport internal infrastructure network design and provide technical guidance to engineering programsTroubleshoot network issues including latency, packet loss, and connectivity problems across distributed systemsSkillsBachelor's in Computer Science, Software Engineering, or related technical field – equivalent industry experience also welcome2-5+ years of experience in platform engineering, DevOps, SRE, or cloud infrastructure rolesStrong hands-on experience with Kubernetes for container orchestration and workload managementExperience with cloud computing platforms and services (compute, storage, networking)Deep understanding of Linux system administration and troubleshootingStrong networking fundamentals including TCP/IP, routing, DNS, VPNs, and securityUnderstanding of infrastructure as code principles and configuration managementProficiency in scripting and automation (Python, Bash, or similar)Experience building and maintaining CI/CD pipelinesSolid grasp of distributed systems concepts, job scheduling, and resource managementAbility to design infrastructure from first principles and make architectural decisionsExperience building infrastructure for simulation, robotics, or autonomous systems workloadsUnderstanding of GPU computing and accelerated workload managementKnowledge of job scheduling systems for batch and parallel workloadsExperience managing on-premises clusters and hybrid cloud architecturesFamiliarity with robotics middleware (ROS/ROS2) or simulation platformsUnderstanding of cost optimization for compute-intensive workloadsExperience with monitoring, logging, and observability systemsKnowledge of containerization technologies and image managementBackground in data engineering, MLOps, or machine learning infrastructureExperience with network performance analysis and troubleshootingUnderstanding of software-defined networking and network automationFamiliarity with security compliance requirements in aerospace/defense environmentsBenefitsCompetitive total compensation packageComprehensive benefit package options include medical, dental, vision, life, and more.401k with company-match4 weeks of paid time off each year12 annual company holidaysCompany OverviewZone 5 Technologies is an aviation component manufacturing company that develops and tests unmanned aircraft systems. It was founded in 2011, and is headquartered in San Luis Obispo, California, USA, with a workforce of 201-500 employees. Its website is https://www.zone5tech.com.