[Remote] Infrastructure Operations Engineer
Note: The job is a remote job and is open to candidates in USA. Voltage Park is your enterprise AI factory, offering scalable compute power and AI infrastructure. They are seeking a highly skilled Infrastructure Operations Engineer to ensure the stability, scalability, and performance of their compute, storage, and platform infrastructure, supporting AI/ML workloads at scale.ResponsibilitiesAt the direction of the Manager of Infrastructure Operations, design, build, and roll out new platforms and patterns to minimize incidents and enable customer facing and internal featuresDeploy updates and improvements to support both Voltage Parkβs internal and end customer use casesCollaborate with colleagues in Infrastructure Engineering, Network Operations, Customer Success and Software and Platform Development TeamsParticipate in the on-call rotation which is evenly distributed across all team members in a primary / secondary pattern where you are primary then move to a secondary positionSkills8+ years working with Linux as a server / hosting platform, extra points for Ubuntu experience5+ years experience with AWS2+ years experience with Kubernetes and strong container fundamentals2+ years experience with Terraform and Ansible2+ years with network attached storage management (via NFS, ceph, or other protocols). Extra points for experience with VAST storage systemsExperience with monitoring systems (Prometheus, ELK stack)Familiarity with the gitops workflowSoftware development experience using Python, Go, bash, or other languages for the purposes of automation & connecting systems & APIs togetherDeep networking fundamentals, extra points for experience with datacenter level networks, 400Gb ethernet, and InfinibandExperience building and delivering complex systemsEffective at navigating tradeoffs between design, risk, cost, and outcomesComfortable with navigating ambiguityStrong written and oral communicationExperience with bare metal hardware troubleshooting and provisioning, extra points for working with Dell hardwareExperience with GPU servers, both in bare metal form or under virtualizationDeep experience with network switches, routers, and firewalls, particularly SONiC switches, Palo Alto firewalls and Juniper Networks as vendorsExperience with VAST storage systemsCompany OverviewVoltage Park is a cloud platform providing on-demand and reserved GPU infrastructure for AI and machine learning workloads. It is a sub-organization of Lightning AI. It was founded in 2023, and is headquartered in Berkeley, California, USA, with a workforce of 51-200 employees. Its website is https://voltagepark.com/.