[Remote] Staff Site Reliability Engineer

Remote Full-time
Note: The job is a remote job and is open to candidates in USA. Thrive Market is an online, membership-based market focused on making healthy and sustainable living accessible. They are seeking a Staff Site Reliability Engineer to establish their SRE practice, define reliability metrics, and ensure system scalability during rapid growth.ResponsibilitiesDefine, implement, and own Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across critical platform servicesBuild and maintain comprehensive monitoring, alerting, and observability systems using tools like Datadog, Prometheus, Grafana, or similar platformsEstablish error budgets and use them to balance feature velocity with reliability investmentsLead incident response efforts, conduct blameless postmortems, and drive systemic improvements that prevent recurrenceDesign and implement chaos engineering practices to proactively identify failure modes before they impact membersArchitect and optimize our Kubernetes-based container orchestration platform for reliability, performance, and cost efficiencySupport large infrastructure migrations, ensuring a smooth transition with minimal disruption to business operationsContribute to the evaluation and execution of potential platform migrations, with a focus on reliability planning and risk mitigationDesign and implement automated deployment pipelines that enable rapid, error-free releases with feature flags and built-in rollback/roll-forward capabilitiesDevelop and own disaster recovery plans, capacity planning models, and system hardening initiativesCollaborate closely with product engineering teams to help them scale their infrastructure in AWS and adopt SRE best practicesHelp establish SRE as a practice at Thrive Market, defining the team’s charter, processes, and engagement model with product engineering teamsChampion a culture of operational excellence, continuous improvement, and data-driven reliability decisionsCreate and maintain technical documentation covering architecture decisions, runbooks, incident response procedures, and operational playbooksParticipate in weekly on-call rotations and help build sustainable on-call practices that avoid burnoutIdentify systemic problems and inefficiencies across the engineering organization and make strategic recommendations for improvementSkillsB.S. in Computer Science or equivalent professional experience7+ years of hands-on experience in SRE, DevOps, or Infrastructure Engineering, with a proven track record of improving reliability at rapidly growing companiesDeep expertise in Kubernetes (K8s) β€” including cluster management, Helm charts, service meshes, and production-grade container orchestrationStrong systems engineering background with advanced proficiency in Linux administrationAdvanced scripting and automation skills in Bash, Python, Golang, Ruby, or similar languagesExtensive experience with core AWS services including EC2, ECS/EKS, S3, VPC, IAM, CloudWatch, Route 53, RDS, and LambdaStrong experience with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi, or similar)Hands-on experience defining and implementing SLOs, SLIs, and error budgets in production environmentsDeep understanding of CI/CD pipelines and deployment strategies (blue-green, canary, rolling deployments)Expertise in monitoring and observability platforms (Datadog, Prometheus, Grafana, New Relic, or similar)Strong knowledge of web application infrastructure, networking, load balancing, and security best practicesExcellent communication skills with the ability to lead incident response and facilitate blameless postmortemsExperience with e-commerce platforms (Magento, Shopify, or comparable) and the unique reliability challenges they present at scaleExperience with ConcourseCI, Github Actions (GHA) or similar deployment frameworksExperience with chaos engineering tools and practices (Gremlin, Litmus, Chaos Monkey, or similar)Familiarity with GitOps workflows (ArgoCD, Flux) and service mesh technologies (Istio, Linkerd)Experience building and managing cost-optimization strategies for cloud infrastructureBackground in establishing SRE practices in organizations transitioning from traditional DevOps modelsExperience with configuration management tools (Ansible, Chef, Puppet, or similar)BenefitsComprehensive health benefits (medical, dental, vision, life and disability)Competitive salary (DOE) + equity401k plan9 Observed HolidaysFlexible Paid Time OffSubsidized ClassPass Membership with access to fitness classes and wellness and beauty experiencesAbility to work in our beautiful office in Playa VistaFree Thrive Market membership with exclusive employee discountCoverage for Life Coaching & Therapy Sessions on our holistic mental health and well-being platformCompany OverviewThrive Market is a membership-based online company that offers natural and organic food products. It was founded in 2013, and is headquartered in Los Angeles, California, USA, with a workforce of 501-1000 employees. Its website is https://thrivemarket.com.

Apply Now β†’

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Registered Nurse (RN) - Part Time

Remote

Experienced Full Stack Customer Service Representative – Remote Opportunity with careerzynith

Remote

Senior Amazon Marketing Manager | Remote | LATAM Only | 81141

Remote

Solar Sales Consultant - Closer

Remote

Experienced Associate Analyst IT CTS - JetBlue Airline At Home Careers - Data Entry Remote Jobs

Remote

Principal Consultant, Penetration Tester, Technical Testing Services

Remote

Remote Customer Solutions Representative

Remote

[Remote] Senior Manager Sales Operations

Remote

Remote Online Chat Careers - Start a Career in Virtual Customer Assistance | Earn $25-$35 Per Hour

Remote

**Experienced CVS Data Entry Specialist – Entry-Level Opportunity for Remote Work**

Remote
← Back