[Remote] Senior Site Reliability Engineer

Remote Full-time
Note: The job is a remote job and is open to candidates in USA. Luma is dedicated to building multimodal AI to enhance human capabilities, relying on a robust GPU infrastructure. They are seeking a Senior Site Reliability Engineer to architect, maintain, and scale their infrastructure across on-prem and multi-vendor clouds, ensuring high availability and performance for their AI systems.ResponsibilitiesArchitect for Reliability & Scale: Participate in critical re-architecture sessions to redesign our systems for higher efficiency and scale. You won't just maintain existing clusters; you will help define how our next-generation infrastructure operatesOwn Multi-Cloud GPU Clusters: Take end-to-end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performanceDrive Security & Compliance: Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices in a fast-moving AI startup environmentDeep Linux Performance Tuning: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS and kernel levelBuild Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure without relying on heavy operational toilDebug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues, often collaborating directly with hardware vendors like NVIDIASkills5+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environmentDeep Linux Mastery: You possess deep, hands-on expertise in Linux, containerized systems, and debugging low-level system performanceExpert in Technologies: You have working experience with Terraform, Airflow, and RayCloud Infrastructure Expert: You have strong experience with providers like AWS or OCITenacious Troubleshooter: You thrive on solving complex, low-level problems where hardware and software intersectStartup DNA: You are energetic and thrive in a less structured, fast-paced environmentSecurity-Minded: You possess a working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISOExpert in High-Performance Networking: You have practical experience with InfiniBand, RDMA, or RoCE and understand how to optimize throughput for massive distributed training jobsDeep expertise with GPU tooling for NVIDIA and AMD GPUs like DCGM or ROCmExperience managing large-scale GPU clusters for AI/ML workloads (training or inference)Familiarity with job management systems based on Kubernetes or orchestration frameworks like RayDeep expertise in Data Pipeline and InfrastructureCompany OverviewLuma AI’s mission is to build Multimodal AGI: AI that can generate, understand, and operate in the physical world. It was founded in 2021, and is headquartered in Palo Alto, California, USA, with a workforce of 51-200 employees. Its website is https://lumalabs.ai.Company H1B SponsorshipLuma has a track record of offering H1B sponsorships, with 2 in 2026, 10 in 2025, 3 in 2024. Please note that this does not guarantee sponsorship for this specific role.

Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

UKIMEA Advisory and Professional Services Sovereign AI Enterprise Architect

Remote

**Experienced Virtual Data Entry Specialist – Entry Level Opportunity at arenaflex**

Remote

**Experienced Part-Time Data Entry Specialist – E-commerce Product Database Management**

Remote

Assistant Store Manager in Brewster, NY

Remote

Corporate Trainer/Instructor - Onsite (1 day a week remote)

Remote

[Remote-Position] Apply Target Jobs Near Me

Remote

Insights Solutions Partner II

Remote

14 & 15 year olds needed | Irving

Remote

Solutions Manager

Remote

Licensed Independent Social Worker

Remote
← Back