[Remote] Site Reliability Engineer

Remote Full-time

Note: The job is a remote job and is open to candidates in USA. Runpod is a rapidly growing company that provides a foundational platform for developers to build and run custom AI systems. As a Site Reliability Engineer, you will ensure the stability and resilience of Runpod’s distributed platform by partnering with engineering teams, improving system design, and enhancing observability to prevent incidents.ResponsibilitiesDefine and implement SLIs/SLOs for critical servicesLead incident response and coordinate cross-team mitigation effortsConduct blameless postmortems and ensure corrective actions are completedPerform production readiness reviews for new services and featuresIdentify systemic risks and drive preventative improvementsDesign and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)Improve signal-to-noise ratio in alerts and reduce alert fatigueBuild internal tooling for reliability tracking and reportingImprove visibility into GPU performance and distributed systems healthAutomate recurring operational workflowsBuild tools and scripts (Python, Go, Bash) to eliminate manual processesImprove deployment safety through automation and guardrailsStrengthen CI/CD reliability and release processesPartner with engineering teams to improve system resilienceProvide guidance on fault tolerance, scalability, and failure handlingContribute to architectural discussions with a reliability-first mindsetSkills5+ years of experience in SRE, Reliability Engineering, or Production EngineeringStrong Linux systems and Networking expertiseExperience managing containerized production systemsStrong understanding of distributed systems and failure modesExperience defining and managing SLIs/SLOsProven incident response and postmortem leadership experienceStrong scripting or programming skillsExperience with monitoring and alerting systemsExcellent written communication skillsSuccessful completion of a background checkExperience with GPU infrastructure or AI/ML platformsExperience improving reliability in high-growth or large scale environmentsFamiliarity with GPU observability toolingExperience with Infrastructure as CodeExperience working in startup environmentsExperience building internal reliability platforms or frameworksBenefitsMeaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside.Generous medical, dental & vision plansFlexible PTO- take the time you need to rechargeMost roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communicationJoin a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.Company OverviewRunpod is a cloud platform designed for GPUs, enabling developers to deploy customized full-stack AI applications. It was founded in 2022, and is headquartered in Mount Laurel, New Jersey, USA, with a workforce of 51-200 employees. Its website is https://www.runpod.io.Company H1B SponsorshipRunpod has a track record of offering H1B sponsorships, with 4 in 2025, 3 in 2024. Please note that this does not guarantee sponsorship for this specific role.

Apply Now →

[Remote] Site Reliability Engineer

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

USPS Office Helper

barista - Store# 83049, HWY 31 & KIDD DR

Associate Scientist/ (Mammalian Mutation)

Senior Advisor, Enterprise Thought Leadership, External Affairs

Director, Enterprise Business Development

Commercial Counsel

Amazon Delivery Driver

Sales Development Representative (Boston)

Associate Relationship Manager

Data Entry (FedEx Jobs) -Remote without Experience – Part-Time

[Remote-Position] Freelance Legal Transcriptionist