[Remote] Sr Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Commence is a company focused on data-centric transformation in healthcare, aiming to improve health outcomes through efficient processes. They are seeking a Senior Site Reliability Engineer to ensure the reliability and operational health of their healthcare data platform, collaborating with engineering teams and managing incident responses.ResponsibilitiesDesign, implement, and own observability infrastructure including metrics, logging, tracing, and alerting across distributed systemsDefine and enforce SLOs, SLIs, and error budgets in partnership with product and engineering teamsLead incident response: triage, coordinate remediation, conduct blameless post-mortems, and drive systemic fixesBuild and maintain CI/CD pipelines that support rapid, safe delivery of changes to productionCollaborate with engineering teams on infrastructure changes; able to read, modify, and contribute to existing infrastructure-as-code (Terraform or CloudFormation)Design and operate highly available, fault-tolerant systemsâincluding auto-scaling, failover, and disaster recovery strategiesReduce operational toil through automation; eliminate manual processes before they become habitsCollaborate with software engineers to establish reliability-first design patterns and review architectures for operational riskManage Kubernetes or container orchestration environments at scaleEnsure systems meet compliance and security requirements, particularly those applicable to healthcare data (HIPAA, SOC 2)Provide technical mentorship and guidance to engineers across the organization on reliability practicesParticipate in on-call rotation with a commitment to continuously reducing the need for itSkills7+ years of experience in SRE, platform engineering, or DevOps rolesExceptional problem-solving under pressureâdemonstrated track record of diagnosing complex, high-stakes system failures and building durable solutionsDeep hands-on experience with AWS services including EC2, EKS/ECS, Lambda, RDS, S3, CloudWatch, and related toolingFamiliarity with infrastructure-as-code (Terraform or CloudFormation)âable to contribute to existing configurationsExperience designing and operating distributed systems with strict availability and latency requirementsProficiency in at least one scripting or systems language (Python, Go, Bash, or similar) for automation and toolingExperience with container orchestration (Kubernetes, ECS) in production environmentsExpertise in observability tooling (OpenSearch, Prometheus/Grafana, or equivalent)Hands-on experience with CI/CD platforms (GitHub Actions, Jenkins, CircleCI, or similar)Proven ability to define and operationalize SLOs and error budgetsExperience with relational and NoSQL databasesâperformance tuning, replication, and backup strategiesStrong working knowledge of networking fundamentals: DNS, load balancing, VPCs, TLSExcellent communication skillsâable to translate technical risk into business impact for non-engineering stakeholdersAWS Certifications (Solutions Architect, DevOps Engineer, or SysOps Administrator)Experience in healthcare technology or other regulated industries (HIPAA, SOC 2, FedRAMP)Familiarity with chaos engineering practices and toolingExperience with data pipeline reliability (ETL/ELT workflows, streaming systems)Exposure to AI/ML infrastructure and the reliability challenges unique to model servingFamiliarity with additional cloud platforms (Azure, Google Cloud)Contributions to open-source reliability or infrastructure toolingCompany OverviewCommence delivers AI-driven healthcare data platform and clinical expertise that supports analytics, decisions, and workflow improvement. It was founded in undefined, and is headquartered in Virginia Beach, Virginia, USA, with a workforce of 501-1000 employees. Its website is https://commence.ai.