[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Mozilla's Thunderbird is a trusted open-source email application, and they are seeking a Senior Site Reliability Engineer to establish and maintain the infrastructure that users depend on. The role involves designing and developing CI/CD systems, diagnosing production incidents, and implementing improvements for system reliability.ResponsibilitiesOperate and evolve our EKS-based Kubernetes platform, supporting service migrations, platform improvements, and reliability initiativesDesign and develop CI/CD systems supporting websites, services, and Thunderbird desktop releases, contributing to pipeline reliability and OIDC-based authentication across GitHub Actions workflowsWrite and maintain infrastructure in Pulumi and/or Terraform/OpenTofu across multiple AWS accountsOperate and evolve our observability stack (VictoriaMetrics, VictoriaLogs, Grafana, Vector) and partner with engineering teams to incorporate instrumentation and monitoring into service designApply security-conscious infrastructure practices, including least-privilege IAM, secrets management via AWS Secrets Manager and External Secrets Operator, and network segmentationDiagnose and debug production incidents; drive root-cause analysis and post-incident improvements to prevent recurring problemsParticipate in on-call rotation and collaborate with SDEs and fellow SREs to ship, maintain, and monitor new builds and support service onboardingContribute to runbooks, architecture documentation, and team processesSkills7+ years of experience in infrastructure, platform engineering, or site reliability roles, including hands-on production Kubernetes experience in workload operations, troubleshooting, and cluster managementHands-on experience with infrastructure-as-code on AWS using Terraform, OpenTofu, or PulumiSecurity awareness in day-to-day infrastructure work: identity, least privilege, secrets hygiene, and network controlsDemonstrated ownership mindset with the ability to proactively identify issues, drive work to completion, and communicate risks earlyExcellent async written communication skills; comfortable working with a geographically distributed teamAbility to collaborate effectively with software engineers and non-engineering stakeholders to improve platform reliability and operational efficiencyAbility to learn, evaluate, and responsibly use emerging technologies, including AI-enabled tools, to improve work processesExperience with GitOps workflows (ArgoCD or Flux)Familiarity with Keycloak or similar identity platforms (OIDC, SAML, federation)Knowledge of email protocols and/or experience operating email infrastructure (SMTP, IMAP)Prior work in or alongside an open-source communityFrench, German, Japanese, or other language proficiency in addition to EnglishBenefitsFully remote work & schedule flexibilityCompany-provided laptopAnnual bonus programMonthly remote work stipendAnnual professional development stipendIndustry conferencesCompany all-hands and team gatherings24 days PTO per year (prorated)Your birthdayYear-end company shutdown9 wellbeing daysPublic holidaysOther paid leaveQuarterly wellbeing stipend for personal / family activities401(k) / RRSP contributionsHealth, dental, & vision insuranceDisability insuranceLife insuranceEmployee assistance programPaid parental leavePaid sick daysCompany OverviewMozilla provides internet solutions and offers firefox, thunderbird, and raindrop. It was founded in 1998, and is headquartered in Mountain View, California, USA, with a workforce of 501-1000 employees. Its website is https://www.mozilla.org.