Senior Site Reliability Engineer (Fleet Management)

Remote Full-time
Requirements
• Have 6+ years of experience in software development and operating distributed systems,
• Are proficient in Go, Python, or a similar language, with a strong commitment to code quality and testing practices (writing unit, integration, and E2E tests),
• Have deep experience using and extending containerization technologies, preferably Kubernetes,
• Have a solid understanding of Linux operating system internals and networking concepts (e.g., filesystems, TCP/IP, DNS, TLS),
• Possess a customer focused mindset, treating internal developers as your primary users,
• Have strong operational ownership, including a track record of debugging complex production issues and driving them to resolution,
• Prefer automation over manual processes ("allergic to ops work"),
• We are a small team of software engineers with a strong bias toward building software solutions to eliminate toil,
• (Desirable) Designing and implementing secure, multi-tenant runtime environments from first principles,
• (Desirable) Proficiency with Kubernetes ecosystem tools such as Helm, Kustomize, Gatekeeper, Kyverno, and CRDs/Operators, CRI, CSI,
• (Desirable) Expertise in cloud infrastructure platforms, including AWS, GCP, or Azure,
• (Desirable) Proficiency in provisioning infrastructure using tools like Terraform, Crossplane, and AWS Controllers for Kubernetes (ACK),
• (Desirable) Advanced Linux systems internals and networking concepts specifically relevant to containers, such as namespaces and cgroups

What the job involves
• Platform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational functions that support the broader engineering organization,
• Among these are our multi-cloud-provider Kubernetes infrastructure, networking, load balancing (including our public-facing edge and internal service mesh), and observability and alerting systems,
• The Fleet Management team provides the core runtime environment that empowers our developers to build and ship products to delight our customers,
• We manage the end-to-end lifecycle of our Kubernetes fleet, alongside the critical components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager, and Gatekeeper),
• As our infrastructure scales to support new use cases and products, we are spearheading a migration from Terraform-based Infrastructure as Code (IaC) to an Operator-driven lifecycle management model,
• Contribute to developing and maintaining a scalable and secure runtime environment on top of Kubernetes that supports product needs across MongoDB,
• Provide internal support for our Kubernetes ecosystem, partnering with engineering teams to help them solve domain-specific problems,
• Participate in a 24/7 on-call rotation to resolve critical issues,
• Prioritize blameless post-mortems and dedicate engineering time to systemic fixes, ensuring you aren’t paged for the same issue twice

Apply tot his job

Apply To this Job
Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Systems Engineering Functional Aide

Remote

Technical Program Manager - Infrastructure Engineering

Remote

[Remote-Position] Content Moderators for a Social Network

Remote

Experienced Remote Customer Service Expert – Delivering Exceptional Support with careerzynith

Remote

[Remote] Data Scientist - Multifamily Revenue Management

Remote

HR Manager, Netflix House Dallas

Remote

Digital Sales Executive - Retail, Travel & Hospitality (Remote)

Remote

Azure Data Engineer

Remote

Senior Director, Health Care Disputes (Coding & Compliance)

Remote

Account Executive--Private Childcare (Remote)

Remote
← Back