[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. HavocAI is a leader in collaborative autonomy, focused on solving complex human problems through advanced technology. They are seeking a Senior Site Reliability Engineer to ensure the availability, performance, and resilience of mission-critical services while collaborating with various teams to improve operational maturity and reliability standards.ResponsibilitiesDesign and evolve reliability architecture for distributed and cloud-hosted systemsDefine and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planningPartner with platform and application teams to design systems for reliability, scalability, and operabilityIdentify and mitigate systemic reliability risks across infrastructure, applications, services, and data pipelinesEstablish reliability patterns that support autonomy, simulation, and mission-critical cloud workloadsLead incident response processes, including on-call rotations, escalation paths, and post-incident reviewsConduct root cause analysis for complex production incidents and drive long-term corrective actionsImprove operational readiness through runbooks, automation, resilience testing, and production-readiness reviewsReduce operational toil through tooling, automation, and process improvementsHelp build a culture of ownership, accountability, and continuous improvement across production systemsDesign, implement, and maintain observability systems for metrics, logging, tracing, alerting, and service healthEnsure services and data pipelines are observable, debuggable, and performant in productionDrive performance analysis and tuning across infrastructure, application, and service layersImprove alert quality, reduce noise, and ensure operational signals are actionablePartner with engineering teams to define meaningful reliability and performance metricsBuild automation to improve system reliability, deployment safety, and recovery processesPartner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patternsSupport and improve Kubernetes-based environments and containerized workloadsContribute to infrastructure-as-code practices and platform automationHelp define operational standards for cloud infrastructure, deployment workflows, and production servicesCollaborate with security teams to ensure secure and resilient system designParticipate in disaster recovery planning, backup strategy, and resilience testingMaintain strong operational practices around access control, secrets management, change management, and production accessSupport secure operations for systems that may serve defense, autonomy, or mission-sensitive use casesSkills7+ years of experience in SRE, infrastructure engineering, systems engineering, or related rolesStrong experience operating large-scale distributed production systemsDeep understanding of Linux systems, networking, cloud infrastructure, and distributed systems fundamentalsHands-on experience with Kubernetes and container orchestrationProgramming or scripting experience in Go, Python, or similar languagesExperience designing and operating observability systems for production environmentsProven ability to lead incident response and drive reliability improvementsStrong communication skills and ability to collaborate across engineering teamsAbility to operate calmly and effectively under pressureMust be a U.S. Citizen and eligible to obtain a U.S. Government security clearance if requiredExperience supporting autonomy, robotics, simulation, real-time systems, or data-intensive platformsFamiliarity with AWS and large-scale cloud infrastructureExperience with chaos engineering, fault injection, or resilience testingKnowledge of CI/CD systems and progressive delivery practicesExperience working in high-reliability, safety-critical, defense, or mission-critical environmentsExperience with Infrastructure as Code tools such as Terraform or PulumiExperience with Prometheus, Grafana, OpenTelemetry, Datadog, ELK/OpenSearch, or similar observability toolsBenefits100% Employer paid Health, Dental and Vision Insurance for you and your familiesLife Insurance (Employer Paid)Ability to participate in the companies 401k program (Matching)Unlimited PTO policy with an enforced 2 week minimumEquity PackageWork / Home Office StipendGlobal Entry16 Week Paid Parental LeaveMonthly Health and Wellness StipendCompany OverviewHavoc is the leader in all-domain collaborative autonomy. It was founded in 2024, and is headquartered in Providence, Rhode Island, USA, with a workforce of 51-200 employees. Its website is https://havocai.com/.