Senior Site Reliability Engineer -AI Infrastructure Operations

Remote Full-time
About Nscale

Nscale is the GPU cloud engineered for AI—purpose-built to deliver high-performance, cost-efficient infrastructure for AI-native startups and global enterprises. We enable organizations to accelerate innovation, reduce the complexity of AI development, and achieve meaningful business outcomes through scalable, sustainable compute.

Our culture is defined by ownership, accountability, and rapid innovation. We operate with urgency and transparency, and every team member contributes to building the infrastructure powering the future of AI.

The Opportunity

Nscale's AI Infrastructure Operations team supports one of the most demanding AI platforms in the industry. We are looking for a Senior Site Reliability Engineer to help design, build, and operate reliable, scalable infrastructure across our GPU cloud.

This role is focused on hands-on engineering, system reliability, and operational excellence. You will work across software, systems, and infrastructure to improve performance, automate operations, and ensure platform stability at scale.

What You'll Be Doing
• Design, build, and improve automation, tooling, and infrastructure systems supporting AI and HPC workloads
• Contribute to the development of control-plane systems and operational frameworks
• Define and implement SLOs, SLIs, and monitoring strategies to ensure system reliability
• Participate in incident response and root cause analysis, driving improvements to reduce recurrence
• Identify and address reliability and performance bottlenecks across systems
• Collaborate with Engineering, Network, and Fleet teams to improve system design and operational processes
• Drive improvements in availability, scalability, and operational efficiency
• Mentor junior engineers and contribute to a strong engineering and reliability culture

What You Bring
• 5–8+ years of experience in SRE, Systems Engineering, or Software Engineering in production environments
• Strong software engineering skills with experience building automation and infrastructure tooling
• Solid understanding of Linux systems, networking, and distributed systems
• Experience troubleshooting issues across infrastructure, OS, networking, and application layers
• Familiarity with monitoring, alerting, and observability tools
• Ability to balance reliability, performance, and delivery speed

Preferred Experience
• Experience with AI or HPC environments, including GPUs or high-performance systems
• Exposure to high-speed networking (InfiniBand/RDMA)
• Familiarity with Kubernetes, cloud platforms, or bare-metal environments
• Experience with observability systems in high-scale environments

For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.

Apply tot his job

Apply To this Job
Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

[Remote] Senior Director of Bookkeeping & Tax Operations

Remote

**Experienced Full Stack Data Entry Specialist – Remote Work Opportunity at arenaflex**

Remote

[Remote] Engineering Manager - SmartDesk

Remote

Product Manager

Remote

Experienced Data Entry Specialist – Remote Opportunity at careerzynith

Remote

Vibee - Marketing Manager [Remote]

Remote

Remote Customer Support Associate – Join careerzynith’s Dynamic Home‑Based Service Team

Remote

Entry Level - Virtual Jobs (Remote) – Full/Part-Time | Beginner Friendly & Start ASAP

Remote

Senior Product Sales Engineer

Remote

**Experienced Entry-Level Data Entry Specialist (Remote) – Flexible Work Arrangements at arenaflex**

Remote
← Back