[Remote] Site Reliability Engineer

Remote Full-time
Note: The job is a remote job and is open to candidates in USA. Runpod is a rapidly growing company that provides a foundational platform for developers to build and run custom AI systems. As a Site Reliability Engineer, you will ensure the stability and resilience of Runpod’s distributed platform by partnering with engineering teams, improving system design, and enhancing observability to prevent incidents.ResponsibilitiesDefine and implement SLIs/SLOs for critical servicesLead incident response and coordinate cross-team mitigation effortsConduct blameless postmortems and ensure corrective actions are completedPerform production readiness reviews for new services and featuresIdentify systemic risks and drive preventative improvementsDesign and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)Improve signal-to-noise ratio in alerts and reduce alert fatigueBuild internal tooling for reliability tracking and reportingImprove visibility into GPU performance and distributed systems healthAutomate recurring operational workflowsBuild tools and scripts (Python, Go, Bash) to eliminate manual processesImprove deployment safety through automation and guardrailsStrengthen CI/CD reliability and release processesPartner with engineering teams to improve system resilienceProvide guidance on fault tolerance, scalability, and failure handlingContribute to architectural discussions with a reliability-first mindsetSkills5+ years of experience in SRE, Reliability Engineering, or Production EngineeringStrong Linux systems and Networking expertiseExperience managing containerized production systemsStrong understanding of distributed systems and failure modesExperience defining and managing SLIs/SLOsProven incident response and postmortem leadership experienceStrong scripting or programming skillsExperience with monitoring and alerting systemsExcellent written communication skillsSuccessful completion of a background checkExperience with GPU infrastructure or AI/ML platformsExperience improving reliability in high-growth or large scale environmentsFamiliarity with GPU observability toolingExperience with Infrastructure as CodeExperience working in startup environmentsExperience building internal reliability platforms or frameworksBenefitsMeaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside.Generous medical, dental & vision plansFlexible PTO- take the time you need to rechargeMost roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communicationJoin a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.Company OverviewRunpod is a cloud platform designed for GPUs, enabling developers to deploy customized full-stack AI applications. It was founded in 2022, and is headquartered in Mount Laurel, New Jersey, USA, with a workforce of 51-200 employees. Its website is https://www.runpod.io.Company H1B SponsorshipRunpod has a track record of offering H1B sponsorships, with 4 in 2025, 3 in 2024. Please note that this does not guarantee sponsorship for this specific role.

Apply Now →
← Back