Manager of Reliability Operations

Remote Full-time
About NexcessNexcess brings together a portfolio of hosting, cloud, and digital experience brands to deliver high-performance infrastructure and services to businesses worldwide.Our platforms power mission-critical applications for thousands of customers. Reliability is foundational to everything we do. We operate complex environments spanning virtualization, storage, networking, and application hosting; where performance, availability, and consistency matter at scale.This is a permanent, full-time, remote position.US Pay Band - $110K - $150K Actual compensation will vary based on experience, skills, and location.About the RoleWe’re looking for a Manager of Reliability Operations to lead how we detect, respond to, and learn from failures across our platform ecosystem.This role sits at the intersection of Operations and Engineering, bringing structure to incident response, accountability to follow-through, and clarity to reliability insights. You’ll ensure that what we learn from production directly improves how our platforms are built, operated, and scaled.What You’ll DoOwn Reliability Operations & Incident CommandContinuously evolve and improve incident management, change management, and post-incident practicesEstablish clear standards for incident declaration, severity, escalation, and communicationEnsure consistent execution across teams and continuous process improvementOwn the incident command function, including roles, structure, and operating proceduresLead or oversee major incident response in a 24/7 production environmentBuild and manage on-call incident commander rotations with global coverageDrive Learning, Accountability & Reliability StrategyOwn post-incident reviews, ensuring strong root cause analysis and clear documentationTranslate incident trends into actionable reliability improvementsDrive completion of corrective actions across teams; escalate when neededDefine and maintain service performance and reliability targets (availability, latency, error rates)Own observability strategy, including monitoring, alerting, and signal qualityImprove detection, reduce time to resolution, and increase platform resiliencePartner with Engineering and Operations on capacity planning, patching, and lifecycle decisionsEnsure reliability insights directly inform platform and infrastructure roadmapsCollaborate with Security on vulnerability response, patch prioritization, and compliance alignmentOperate Across a Complex Platform EnvironmentWork across environments including virtualization platforms (VMware), distributed storage (Ceph), Linux-based systems, and hybrid cloud infrastructureSupport platforms that span dedicated hosting, managed applications, and high-availability cloud servicesEnsure reliability practices scale across multiple products, brands, and customer environmentsProvide regular, data-driven reporting to leadership on availability, incident trends, and operational performanceAct as the central authority on reliability insights across teamsWhat You BringBachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience)7+ experience in systems operations, site reliability, or platform engineering 2+ years experience leading teams or major operational functionsProven experience managing incidents in a 24/7 production environmentStrong background in troubleshooting, root cause analysis, and operational improvementExperience with change management practicesPlatform & Tooling ExperienceMonitoring and observability platforms (e.g., Datadog, Prometheus, Grafana, New Relic)Incident management and alerting tools (e.g., PagerDuty, Opsgenie)Infrastructure and platform technologies (Linux systems, VMware, Ceph, cloud platforms)Logging and telemetry systems (centralized logging, metrics, tracing)Ability to translate complex technical data into clear insightsStrong communication skills, especially in high-pressure situations Nice to HaveBackground in Computer Science, Engineering, or a related fieldExperience in managed hosting, cloud infrastructure, or SaaS environmentsExperience defining and tracking system reliability and performance targetsFamiliarity with ITIL or similar operational frameworksExposure to VMware, Ceph, Linux, and Windows platformsRelevant certifications (AWS, RHCE, etc.) We Offer:Traditional and Roth 401k with company matchingA collaborative team cultureConsistent/set work hoursChallenging non-redundant daily dutiesA voice in how things get done Disclaimer:This job description is only a summary of the typical functions of the position. It is not intended to be an exhaustive or comprehensive list of all job responsibilities, tasks, or duties. Additional duties and tasks may be assigned as part of the job function. Liquid Web Inc. reserves the right to modify, interpret, or apply this job description in a way that best supports the organizational needs. The job description in no way creates or implies an employment contract. The employment contract remains “at will”.Equal Employment Opportunity Policy: Liquid Web is committed to offering equal employment opportunity without regard to age, color, disability, gender, gender identity, genetic information, marital status, military status, national origin, race, religion, sexual orientation, veteran status, or any other legally protected characteristic. #LI-RemoteManual Data for Pay Scale - Suggested Range for Manager of Process Operations - $110 - $150PayFactors -Reliability Engineer $140KProcess Analyst $85KMedian Manager of Process Operations - $112K Higher-Scope Variant: SRE Manager This is an SRE but includes some leadership and strategy in scopeAverage: $132,583Typical range: $114K – $151KTop end: $175K Salary.com - SRE LeadershipAverage: $165KRange $150K - $185KMarket data shows baseline Reliability Manager roles averaging around $100K, but those positions are typically scoped to team-level responsibilities or localized operational support. Roles with broader operational ownership, particularly those responsible for defining and enforcing incident management, change management, and post-incident processes across an organization,trend higher, with comparable operations and SRE leadership roles averaging $115K – $130K and extending into the $150K+ range at senior levels.This position is not simply managing reliability within a team; it owns how the organization operates during incidents, how work is prioritized and escalated, and how accountability is enforced across functions. Because the role is responsible for driving consistency, governance, and follow-through across Operations, Engineering, and Security, it aligns more closely with senior operational leadership than traditional reliability management. Positioning the role in the $110K – $150K range ensures we can attract candidates with the experience to build, standardize, and scale these processes effectively.



Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Payroll Specialist

Remote

Senior IT Security Engineer

Remote

**Experienced Entry-Level Data Entry Clerk – Digital Database Management and Data Analysis**

Remote

[Remote] STEM Research Jobs in the United States

Remote

Data Center Implementation Project Manager

Remote

Senior Customer Success Manager- Remote

Remote

Experienced Remote Customer Care Agent – 4-Day Weekend Shift with Phone and Chat Support

Remote

Healthcare Systems Business Analyst | Gainwell Technologies | Remote (United States)

Remote

**Experienced Sales and Customer Service Manager – Driving Customer Loyalty and Excellence in arenaflex Stores**

Remote

Entry Level Applications Engineer (Background in: Software Development, Engineering, or Business Administration)

Remote
← Back