[Remote] Staff Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Thrive Market is an online, membership-based market focused on making healthy and sustainable living accessible. They are seeking a Staff Site Reliability Engineer to establish their SRE practice, define reliability metrics, and ensure system scalability during rapid growth.ResponsibilitiesDefine, implement, and own Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across critical platform servicesBuild and maintain comprehensive monitoring, alerting, and observability systems using tools like Datadog, Prometheus, Grafana, or similar platformsEstablish error budgets and use them to balance feature velocity with reliability investmentsLead incident response efforts, conduct blameless postmortems, and drive systemic improvements that prevent recurrenceDesign and implement chaos engineering practices to proactively identify failure modes before they impact membersArchitect and optimize our Kubernetes-based container orchestration platform for reliability, performance, and cost efficiencySupport large infrastructure migrations, ensuring a smooth transition with minimal disruption to business operationsContribute to the evaluation and execution of potential platform migrations, with a focus on reliability planning and risk mitigationDesign and implement automated deployment pipelines that enable rapid, error-free releases with feature flags and built-in rollback/roll-forward capabilitiesDevelop and own disaster recovery plans, capacity planning models, and system hardening initiativesCollaborate closely with product engineering teams to help them scale their infrastructure in AWS and adopt SRE best practicesHelp establish SRE as a practice at Thrive Market, defining the teamβs charter, processes, and engagement model with product engineering teamsChampion a culture of operational excellence, continuous improvement, and data-driven reliability decisionsCreate and maintain technical documentation covering architecture decisions, runbooks, incident response procedures, and operational playbooksParticipate in weekly on-call rotations and help build sustainable on-call practices that avoid burnoutIdentify systemic problems and inefficiencies across the engineering organization and make strategic recommendations for improvementSkillsB.S. in Computer Science or equivalent professional experience7+ years of hands-on experience in SRE, DevOps, or Infrastructure Engineering, with a proven track record of improving reliability at rapidly growing companiesDeep expertise in Kubernetes (K8s) β including cluster management, Helm charts, service meshes, and production-grade container orchestrationStrong systems engineering background with advanced proficiency in Linux administrationAdvanced scripting and automation skills in Bash, Python, Golang, Ruby, or similar languagesExtensive experience with core AWS services including EC2, ECS/EKS, S3, VPC, IAM, CloudWatch, Route 53, RDS, and LambdaStrong experience with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi, or similar)Hands-on experience defining and implementing SLOs, SLIs, and error budgets in production environmentsDeep understanding of CI/CD pipelines and deployment strategies (blue-green, canary, rolling deployments)Expertise in monitoring and observability platforms (Datadog, Prometheus, Grafana, New Relic, or similar)Strong knowledge of web application infrastructure, networking, load balancing, and security best practicesExcellent communication skills with the ability to lead incident response and facilitate blameless postmortemsExperience with e-commerce platforms (Magento, Shopify, or comparable) and the unique reliability challenges they present at scaleExperience with ConcourseCI, Github Actions (GHA) or similar deployment frameworksExperience with chaos engineering tools and practices (Gremlin, Litmus, Chaos Monkey, or similar)Familiarity with GitOps workflows (ArgoCD, Flux) and service mesh technologies (Istio, Linkerd)Experience building and managing cost-optimization strategies for cloud infrastructureBackground in establishing SRE practices in organizations transitioning from traditional DevOps modelsExperience with configuration management tools (Ansible, Chef, Puppet, or similar)BenefitsComprehensive health benefits (medical, dental, vision, life and disability)Competitive salary (DOE) + equity401k plan9 Observed HolidaysFlexible Paid Time OffSubsidized ClassPass Membership with access to fitness classes and wellness and beauty experiencesAbility to work in our beautiful office in Playa VistaFree Thrive Market membership with exclusive employee discountCoverage for Life Coaching & Therapy Sessions on our holistic mental health and well-being platformCompany OverviewThrive Market is a membership-based online company that offers natural and organic food products. It was founded in 2013, and is headquartered in Los Angeles, California, USA, with a workforce of 501-1000 employees. Its website is https://thrivemarket.com.