[Remote] Site Reliability and DevOps Engineering Lead

Remote Full-time
Note: The job is a remote job and is open to candidates in USA. Merative is a company that provides trusted clinical decision support solutions through its Micromedex platform. They are seeking a highly skilled Platform Reliability & DevOps Engineering Lead to ensure the platform is highly available, performant, scalable, and secure, while also driving the platform reliability and DevOps strategy.ResponsibilitiesLead, mentor, and grow Platform / DevOps engineersBuild a high-performing Platform teamDrive accountability for platform reliability and delivery outcomesLead vendors to deliver capabilities in productionEnsure platform capabilities accelerate product delivery, remove bottlenecksDefines and enforces platform engineering standards and DevOps practices across all teams and vendorsLead capacity planning, performance optimization, and cost efficiencyDefine operational standards, runbooks, and reliability practicesAccountable for platform reliability outcomes at enterprise/product levelAct as technical authority across platform, reliability, and deliveryDefine platform strategy and roadmapGovern delivery across internal teams and vendorsOwn SLIs, SLOs, and error budgetsLead resilience engineering, observability, and failure designDrive proactive risk reduction and continuous improvementOwn incident management frameworks and continuous improvementOwn end-to-end pipeline architecture and release automationStandardize, secure, and fully automate pipelinesDrive continuous integration, delivery, and validation practicesLead Sev1 response, escalation, and recoveryOwn RCA and drive systemic fixes (not point fixes)Embed AI into monitoring, risk prediction, and CI/CD optimizationDrive automation to reduce operational toil and improve decision-makingSkillsBachelor's degree in computer science, Engineering, or a related field6-10 years of hands-on experience in software operations, DevOps and Site Reliability Engineering, including managing large-scale, mission-critical systemsClear and confident communication skills with ability to lead teams and collaborate effectively across engineering, product, and architecture teamsProven track record ensuring high availability and performance in production environments, with expertise in fault-tolerant, distributed system designExcellent understanding of modern software delivery pipelines and DevOps practices, including CI/CD, configuration management, and version control (Git)Exceptional problem-solving skills, with experience diagnosing complex system issues under pressure and driving them to resolutionStrong proficiency in at least one programming or scripting language (e.g., Python, Bash, or Java) for automation and tool integrationSelf-driven and proactive, with a passion for automating manual processes and continuously improving systems to enhance reliability and team productivityProven experience releasing into and running mission-critical, high-availability SaaS platformsTechnically leading a Platform team and influence stakeholders and vendorsStakeholder engagement across Product, Architecture, and OperationsDeep expertise in Site Reliability Engineering (SLI/SLO, error budgets, incident management)DevOps operating models and platform engineering (engineering transformation)CI/CD architecture and release automationCloud, Systems & Infrastructure (DB2, Oracle, Infinispan, OpenLiberty)Automation-first engineering with proven usage of AI (self-healing, triage)Java application platforms and runtimes (performance tuning, troubleshooting, production operations)Strong experience with Cloud platforms (Azure preferred)Distributed systems and fault-tolerant architecturesPerformance Tuning and ScalingDatabase optimisation (DB2, Oracle, PostgreSQL)Multi-region / active-active environmentsMonitoring, logging, tracing frameworksExperience embedding reliability practices into the SDLCHands-on with DB2, Oracle, Infinispan, OpenLiberty, AzureInfrastructure as Code (Terraform or similar)Containerisation and orchestration (Docker/Kubernetes)BenefitsRemote first / work from home cultureFlexible vacation to help you rest, recharge, and connect with loved onesPaid leave benefitsHealth, dental, and vision insurance401k retirement savings planInfertility benefitsTuition reimbursement, life insurance, EAP – and more!Company OverviewMerative is an IT services company that offers products to improve decision-making and performance. It is a sub-organization of Francisco Partners. It was founded in 2022, and is headquartered in Ann Arbor, Michigan, USA, with a workforce of 1001-5000 employees. Its website is https://www.merative.com.

Apply Now →

Similar Jobs

← Back