[Remote] Infrastructure Operations Engineer

Remote Full-time
Note: The job is a remote job and is open to candidates in USA. Lightning AI is the company behind PyTorch Lightning, building an end-to-end platform for developing AI systems. They are seeking an experienced Infrastructure Operations Engineer to help scale and operate their next-generation AI infrastructure platform, focusing on reliability, automation, and operational efficiency.ResponsibilitiesAt the direction of the Manager of Infrastructure Operations, design, build, and roll out new platforms and patterns to minimize incidents and enable customer facing and internal featuresDeploy updates and improvements to support both Voltage Park’s internal and end customer use casesCollaborate with colleagues in Infrastructure Engineering, Network Operations, Customer Success and Software and Platform Development TeamsParticipate in the on-call rotation which is evenly distributed across all team members in a primary / secondary pattern where you are primary then move to a secondary positionSkills8+ years working with Linux as a server / hosting platform, extra points for Ubuntu experience5+ years experience with AWS2+ years experience with Kubernetes and strong container fundamentals2+ years experience with Terraform and Ansible2+ years with network attached storage management (via NFS, ceph, or other protocols). Extra points for experience with VAST storage systemsExperience with monitoring systems (Prometheus, ELK stack)Familiarity with the gitops workflowSoftware development experience using Python, Go, bash, or other languages for the purposes of automation & connecting systems & APIs togetherDeep networking fundamentals, extra points for experience with datacenter level networks, 400Gb ethernet, and InfinibandExperience building and delivering complex systemsEffective at navigating tradeoffs between design, risk, cost, and outcomesComfortable with navigating ambiguityStrong written and oral communicationExperience with bare metal hardware troubleshooting and provisioning, extra points for working with Dell hardwareExperience with GPU servers, both in bare metal form or under virtualizationDeep experience with network switches, routers, and firewalls, particularly SONiC switches, Palo Alto firewalls and Juniper Networks as vendorsExperience with VAST storage systemsBenefitsComprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)Retirement and financial wellness support (U.S.); Pension contribution (U.K.)Generous paid time off, plus holidaysPaid parental leaveProfessional development supportWellness and work-from-home stipendsFlexible work environmentCompany OverviewThe AI development platform - From idea to AI, Lightning fast ⚡️. Code together. Prototype. Train on GPUs. Scale. Serve. It was founded in 2019, and is headquartered in New York, New York, USA, with a workforce of 51-200 employees. Its website is https://www.pytorchlightning.ai.

Apply Now →
← Back