[Remote] Principal Site Reliability Engineer - AI Infrastructure Operations

Remote Full-time
Note: The job is a remote job and is open to candidates in USA. Nscale is a GPU cloud provider focused on AI, offering high-performance infrastructure for AI start-ups and large enterprises. They are seeking a Principal Site Reliability Engineer to lead reliability strategy, design foundational systems, and drive operational excellence across their AI Infrastructure Operations team.ResponsibilitiesOwning and evolving the long-term reliability strategy for Nscale’s AI and HPC infrastructureDesigning and leading the development of large-scale control-plane systems, automation frameworks, and operational toolingDefining reliability standards, SLO frameworks, and operational best practices used across multiple teamsActing as a senior technical escalation point during critical incidents, guiding resolution and ensuring systemic fixesIdentifying structural reliability risks and driving cross-functional initiatives to address them at the architectural levelPartnering with Engineering, Network Operations, and Fleet Operations leadership to influence platform design and operational maturityMentoring senior and mid-level engineers, raising the overall quality and effectiveness of SRE practicesDriving measurable improvements in availability, MTTR, cost efficiency, and operational scalabilitySkills10+ years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering roles operating complex, large-scale infrastructureExpert-level software engineering skills, with a strong track record of building production-grade automation and systemsDeep expertise in Linux, networking, and distributed systems design at scaleExtensive experience debugging and resolving failures across hardware, OS, networking, and application layersProven ability to lead technical initiatives across teams without direct authorityStrong systems-thinking mindset, with the ability to balance reliability, velocity, and costDeep hands-on experience with AI or HPC platforms, including GPUs, high-speed interconnects (InfiniBand/RDMA), and workload schedulers (e.g. SLURM)Experience designing observability systems for high-cardinality, high-throughput environmentsFamiliarity with Kubernetes at scale and hybrid or bare-metal cloud architecturesA history of driving step-change improvements in reliability, scalability, or operational efficiencyBenefitsHighly competitive package (base + equity) with reviews every 12 months.In addition to base salary, this role may be eligible for bonus, equity, and/or commission programs.Nscale may offer a competitive benefits package including medical, dental, vision, flexible paid time off, parental leave, and retirement plan participation.Human-First Flexibility: We treat you as humans first. 🫶🏽 Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.Company OverviewNscale builds AI data centers and provides GPU cloud infrastructure that companies use to train, run, and scale large AI models. It was founded in 2024, and is headquartered in London, England, GBR, with a workforce of 201-500 employees. Its website is https://www.nscale.com.

Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Care Manager I-Waiver Support (Full-time Remote, North Carolina Based)

Remote

Sales Development Rep - Remote

Remote

Experienced Data Entry Specialist – Remote Work Opportunity at careerzynith

Remote

Senior IT Project Manager

Remote

**Experienced Customer Service Representative – Mesquite, TX Branch at arenaflex**

Remote

[Remote/WFM] Billing Representative- Remote

Remote

Experienced Investor Relations Associate – Strategic Communication and Partnership Development Specialist

Remote

Customer Service Representative (9:30a-6:00p MT)

Remote

Life Insurance Agent (Full Time or Part Time | 100% Remote)

Remote

Part-time Cyber Fraud Analyst I - Purdue University West Lafayette, Indiana

Remote
← Back