[Remote] Manager Site Reliability Operations
Note: The job is a remote job and is open to candidates in USA. Mercury Insurance is a well-recognized company known for its achievements and culture, recently awarded as one of America's Best Midsize Employers for 2026. The Site Reliability Operations Manager will lead a team responsible for observability, real-time monitoring, and incident management across production platforms, ensuring operational excellence and service reliability.ResponsibilitiesLead the Site Reliability Operations team, including the Network Operations Center (NOC), responsible for observability, real-time monitoring, incident response, and operational excellence for key enterprise services; set direction, priorities, and success metrics for the teamPartner with Product Management, Engineering, SRE, and the rest of infrastructure team to embed CI/CD and release best practices into operations, including automated build/test/deploy, health checks, rollbacks, release monitoring via the NOC, and change-management guardrailsOversee service reliability monitoring and incident management: ensure appropriate observability (metrics, logs, traces, dashboards), well-tuned alerting thresholds, escalation paths, and effective communications to stakeholders and leadership during incidentsOwn and mature the Problem Management function for the team: drive root cause analysis (RCA) of recurring or high-severity incidents, standardize post-incident reviews, and ensure corrective actions and follow-ups are implemented and verifiedDefine, track, and report operational and reliability metrics (e.g., availability, MTTR, incident volume, change failure rate, deployment frequency, problem resolution time); provide regular insights and recommendations to Technology Operations leadershipChampion automation and “operations as code” (infrastructure as code, configuration as code, automated runbooks), working with engineering teams to reduce manual toil and improve consistency, speed, and safety of operations and releasesRecruit, develop, coach, and evaluate team members; provide performance feedback, make salary and promotion recommendations, and foster a high-performing, collaborative culture aligned with Mercury’s core valuesProvide leadership coverage for 7x24 mission-critical support through the NOC and on-call rotations; ensure sustainable on-call practices, high-quality runbooks, and continuous improvement of tooling and processesSkillsBachelor's degree in computer science, Information Systems, Engineering, or related field, or equivalent combination of education and work experience7+ years of experience in IT operations, SRE, DevOps, or related roles supporting mission-critical systems3+ years of experience in a lead or management role overseeing technical teams in a 24x7 environmentStrong understanding of CI/CD pipelines (build, test, security scanning, deployment, rollback) and how they support reliable operationsSolid knowledge of observability practices and tools (metrics, logs, traces, dashboards, alerts) and how to design actionable monitoring and alerting for production systemsDeep familiarity with incident and problem management processes, including root cause analysis methods and post-incident review facilitationWorking knowledge of DevOps/SRE concepts such as SLOs/SLIs, error budgets, resilience patterns, automation to reduce toil, and blameless cultureDemonstrated ability to lead and influence cross-functional teams, build relationships, and collaborate effectively with engineering, InfoSec, infrastructure, and business stakeholdersExcellent communication skills, both written and verbal; able to clearly communicate technical issues, risks, and recommendations to technical and non-technical audiences, including senior leadershipStrong analytical and problem-solving skills; able to analyze operational data and trends to identify risks, drive decisions, and prioritize improvementsSelf-motivated, adaptable, and able to operate with minimal supervision in a fast-changing environmentAbility to work extended hours, nights, or weekends as needed to support critical releases or resolve major incidentsAdvanced coursework or certifications or experience in Site Reliability Engineering, DevOps, Cloud platforms, or ITILExperience leading teams that support services deployed via modern CI/CD pipelines and running on cloud and/or container platforms (e.g., Kubernetes/OpenShift, AWS). Experience integrating operations functions with DevOps/SRE teams, including shared ownership of reliability goals and metricsBenefitsCompetitive compensationFlexibility to work from anywhere in the United States for most positionsPaid time off (vacation time, sick time, 9 paid Company holidays, volunteer hours)Incentive bonus programs (potential for holiday bonus, referral bonus, and performance-based bonus)Medical, dental, vision, life, and pet insurance401 (k) retirement savings plan with company matchEngaging work environmentPromotional opportunitiesEducation assistanceProfessional and personal development opportunitiesCompany recognition programHealth and wellbeing resources, including free mental wellbeing therapy/coaching sessions, child and eldercare resources, and moreCompany OverviewMercury Insurance has offered quality insurance for personal auto insurance to homeowners insurance to mechanical breakdown protection. It was founded in 1962, and is headquartered in Los Angeles, California, USA, with a workforce of 5001-10000 employees. Its website is http://www.mercuryinsurance.com.Company H1B SponsorshipMercury Insurance has a track record of offering H1B sponsorships, with 7 in 2026, 22 in 2025, 23 in 2024, 14 in 2023, 15 in 2022, 8 in 2021, 13 in 2020. Please note that this does not guarantee sponsorship for this specific role.