[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. HavocAI is a leader in collaborative autonomy, focused on solving complex human problems through advanced technology. They are seeking a Senior Site Reliability Engineer to ensure the availability, performance, and resilience of mission-critical services while collaborating with various teams to improve operational maturity and reliability standards.
Responsibilities
⢠Design and evolve reliability architecture for distributed and cloud-hosted systems
⢠Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning
⢠Partner with platform and application teams to design systems for reliability, scalability, and operability
⢠Identify and mitigate systemic reliability risks across infrastructure, applications, services, and data pipelines
⢠Establish reliability patterns that support autonomy, simulation, and mission-critical cloud workloads
⢠Lead incident response processes, including on-call rotations, escalation paths, and post-incident reviews
⢠Conduct root cause analysis for complex production incidents and drive long-term corrective actions
⢠Improve operational readiness through runbooks, automation, resilience testing, and production-readiness reviews
⢠Reduce operational toil through tooling, automation, and process improvements
⢠Help build a culture of ownership, accountability, and continuous improvement across production systems
⢠Design, implement, and maintain observability systems for metrics, logging, tracing, alerting, and service health
⢠Ensure services and data pipelines are observable, debuggable, and performant in production
⢠Drive performance analysis and tuning across infrastructure, application, and service layers
⢠Improve alert quality, reduce noise, and ensure operational signals are actionable
⢠Partner with engineering teams to define meaningful reliability and performance metrics
⢠Build automation to improve system reliability, deployment safety, and recovery processes
⢠Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns
⢠Support and improve Kubernetes-based environments and containerized workloads
⢠Contribute to infrastructure-as-code practices and platform automation
⢠Help define operational standards for cloud infrastructure, deployment workflows, and production services
⢠Collaborate with security teams to ensure secure and resilient system design
⢠Participate in disaster recovery planning, backup strategy, and resilience testing
⢠Maintain strong operational practices around access control, secrets management, change management, and production access
⢠Support secure operations for systems that may serve defense, autonomy, or mission-sensitive use cases
Skills
⢠7+ years of experience in SRE, infrastructure engineering, systems engineering, or related roles
⢠Strong experience operating large-scale distributed production systems
⢠Deep understanding of Linux systems, networking, cloud infrastructure, and distributed systems fundamentals
⢠Hands-on experience with Kubernetes and container orchestration
⢠Programming or scripting experience in Go, Python, or similar languages
⢠Experience designing and operating observability systems for production environments
⢠Proven ability to lead incident response and drive reliability improvements
⢠Strong communication skills and ability to collaborate across engineering teams
⢠Ability to operate calmly and effectively under pressure
⢠Must be a U.S. Citizen and eligible to obtain a U.S. Government security clearance if required
⢠Experience supporting autonomy, robotics, simulation, real-time systems, or data-intensive platforms
⢠Familiarity with AWS and large-scale cloud infrastructure
⢠Experience with chaos engineering, fault injection, or resilience testing
⢠Knowledge of CI/CD systems and progressive delivery practices
⢠Experience working in high-reliability, safety-critical, defense, or mission-critical environments
⢠Experience with Infrastructure as Code tools such as Terraform or Pulumi
⢠Experience with Prometheus, Grafana, OpenTelemetry, Datadog, ELK/OpenSearch, or similar observability tools
Benefits
⢠100% Employer paid Health, Dental and Vision Insurance for you and your families
⢠Life Insurance (Employer Paid)
⢠Ability to participate in the companies 401k program (Matching)
⢠Unlimited PTO policy with an enforced 2 week minimum
⢠Equity Package
⢠Work / Home Office Stipend
⢠Global Entry
⢠16 Week Paid Parental Leave
⢠Monthly Health and Wellness Stipend
Company Overview
ā¢
Responsibilities
⢠Design and evolve reliability architecture for distributed and cloud-hosted systems
⢠Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning
⢠Partner with platform and application teams to design systems for reliability, scalability, and operability
⢠Identify and mitigate systemic reliability risks across infrastructure, applications, services, and data pipelines
⢠Establish reliability patterns that support autonomy, simulation, and mission-critical cloud workloads
⢠Lead incident response processes, including on-call rotations, escalation paths, and post-incident reviews
⢠Conduct root cause analysis for complex production incidents and drive long-term corrective actions
⢠Improve operational readiness through runbooks, automation, resilience testing, and production-readiness reviews
⢠Reduce operational toil through tooling, automation, and process improvements
⢠Help build a culture of ownership, accountability, and continuous improvement across production systems
⢠Design, implement, and maintain observability systems for metrics, logging, tracing, alerting, and service health
⢠Ensure services and data pipelines are observable, debuggable, and performant in production
⢠Drive performance analysis and tuning across infrastructure, application, and service layers
⢠Improve alert quality, reduce noise, and ensure operational signals are actionable
⢠Partner with engineering teams to define meaningful reliability and performance metrics
⢠Build automation to improve system reliability, deployment safety, and recovery processes
⢠Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns
⢠Support and improve Kubernetes-based environments and containerized workloads
⢠Contribute to infrastructure-as-code practices and platform automation
⢠Help define operational standards for cloud infrastructure, deployment workflows, and production services
⢠Collaborate with security teams to ensure secure and resilient system design
⢠Participate in disaster recovery planning, backup strategy, and resilience testing
⢠Maintain strong operational practices around access control, secrets management, change management, and production access
⢠Support secure operations for systems that may serve defense, autonomy, or mission-sensitive use cases
Skills
⢠7+ years of experience in SRE, infrastructure engineering, systems engineering, or related roles
⢠Strong experience operating large-scale distributed production systems
⢠Deep understanding of Linux systems, networking, cloud infrastructure, and distributed systems fundamentals
⢠Hands-on experience with Kubernetes and container orchestration
⢠Programming or scripting experience in Go, Python, or similar languages
⢠Experience designing and operating observability systems for production environments
⢠Proven ability to lead incident response and drive reliability improvements
⢠Strong communication skills and ability to collaborate across engineering teams
⢠Ability to operate calmly and effectively under pressure
⢠Must be a U.S. Citizen and eligible to obtain a U.S. Government security clearance if required
⢠Experience supporting autonomy, robotics, simulation, real-time systems, or data-intensive platforms
⢠Familiarity with AWS and large-scale cloud infrastructure
⢠Experience with chaos engineering, fault injection, or resilience testing
⢠Knowledge of CI/CD systems and progressive delivery practices
⢠Experience working in high-reliability, safety-critical, defense, or mission-critical environments
⢠Experience with Infrastructure as Code tools such as Terraform or Pulumi
⢠Experience with Prometheus, Grafana, OpenTelemetry, Datadog, ELK/OpenSearch, or similar observability tools
Benefits
⢠100% Employer paid Health, Dental and Vision Insurance for you and your families
⢠Life Insurance (Employer Paid)
⢠Ability to participate in the companies 401k program (Matching)
⢠Unlimited PTO policy with an enforced 2 week minimum
⢠Equity Package
⢠Work / Home Office Stipend
⢠Global Entry
⢠16 Week Paid Parental Leave
⢠Monthly Health and Wellness Stipend
Company Overview
ā¢