[Remote] Cloud Site Reliability Engineer

Remote Full-time
Note: The job is a remote job and is open to candidates in USA. SambaNova is at the forefront of AI computing, specializing in generative AI platforms for enterprise and government organizations. They are seeking a Cloud Site Reliability Engineer to ensure the reliability, performance, and scalability of their AI Inferencing Service, focusing on maintaining exceptional uptime and efficient resource utilization.ResponsibilitiesTake shared ownership of the production inferencing service, including its availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning across multiple regionsParticipate in a balanced on-call rotation to provide 24/7 support for the serviceLead the response to incidents affecting the inferencing service, driving blameless post-mortems and implementing corrective actions to prevent recurrenceDevelop and maintain advanced monitoring, alerting, and dashboarding (using tools like Prometheus, Grafana, Datadog) to gain deep insights into service health, model performance (e.g., latency, throughput, error rates), and accelerator utilizationProactively identify and eliminate performance bottlenecksDesign and implement auto-scaling policies to handle variable inference loads cost-effectivelyManage and evolve our cloud infrastructure (on AWS, GCP, and/or Azure along with on-prem) using tools like Terraform and Ansible, ensuring it is secure, repeatable, and scalableChampion automation by building and improving CI/CD pipelines for the seamless and safe deployment of new model versions and service updatesForecast infrastructure needs based on product roadmaps and usage trendsWork with finance and engineering teams to manage cloud costs and optimize spendingDefine, measure, and report on Service Level Objectives (SLOs) and Indicators (SLIs) for the inferencing platform, using data to drive prioritization and reliability investmentsSkillsBachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience3-5+ years of experience in a Site Reliability Engineer, DevOps, or related role supporting a large-scale, customer-facing service in a public cloud environment (AWS, GCP, Azure)Strong programming/scripting skills in languages like Python, Go, or JavaProven experience with containerization and orchestration technologies (Docker, Kubernetes)Deep understanding of monitoring and observability principles and tools (e.g., Prometheus, Grafana, ELK Stack, Datadog)Solid experience with Infrastructure as Code (e.g., Terraform, CloudFormation)Familiarity with CI/CD principles and tools (e.g., Jenkins, GitHub Actions, ArgoCD)Excellent problem-solving skills and a systematic approach to troubleshooting complex distributed systemsExperience in a hybrid environment bridging cloud and on-premise/data center infrastructureDirect experience supporting ML/AI inferencing services in productionFamiliarity with GPU-accelerated computing and optimizing workloads for NVIDIA GPUs for purposes of mapping to RDUsKnowledge of model serving frameworks like vLLM, SGLang or RayUnderstanding of MLOps principles and practicesExperience with managing and tuning databases (SQL or NoSQL) and caching systems (Redis, Memcached)Strong Linux/Unix system administration fundamentalsBenefitsEquityExcellent benefitsA flexible work environment95% premium coverage for employee medical insurance77% premium coverage for dependentsHealth Savings Account (HSA) with employer contributionDental, Vision, Short/Long term Disability, Basic Life, Voluntary Life, and AD&D insurance plansFlexible Spending Account (FSA) options like Health Care, Limited Purpose, and Dependent CareA full subscription to HeadspaceGympass+ membership with access to physical gymsOne Medical membershipCounseling services with an Employee Assistance ProgramCompany OverviewSambaNova is an AI hardware and software company that specializes in providing infrastructure for AI and machine learning applications. It was founded in 2017, and is headquartered in Palo Alto, California, USA, with a workforce of 201-500 employees. Its website is https://sambanova.ai.Company H1B SponsorshipSambaNova has a track record of offering H1B sponsorships, with 6 in 2026, 29 in 2025, 23 in 2024, 37 in 2023, 41 in 2022, 35 in 2021, 29 in 2020. Please note that this does not guarantee sponsorship for this specific role.

Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

SW Engineer Intern

Remote

Junior Security Automation Engineer – Skillbridge Intern

Remote

Experienced Remote Customer Experience Specialist – Delivering Exceptional Support and Solutions from the Comfort of Your Home at blithequark

Remote

Customer Service - Inbound (WORK FROM HOME)

Remote

Principal Value Engineer- Retail

Remote

Looking for English Teachers - Earn Money Online in Bowling Green, KY

Remote

**Experienced Full Stack Data Entry Specialist – Amazon Customer Support & E-commerce Operations**

Remote

Analytics Engineer 5 - Business Forecasting

Remote

**Experienced Data Entry Specialist – Remote Part-Time Opportunity at arenaflex**

Remote

Amazon Delivery Driver

Remote
← Back