[Remote] Principal Software Engineer, DGX Cloud Production Engineering
Note: The job is a remote job and is open to candidates in USA. NVIDIA is a leader in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. They are looking for Principal Software Engineers to help shape the technical direction for production engineering, Kubernetes-based operations, automation, and reliability across large-scale GPU clusters.ResponsibilitiesDefine and execute the technical strategy for DGX Cloud cluster operations, building the automation, GitOps, and Day 2 reliability needed to operate large-scale GPU clusters across NVIDIA Cloud Partners (NCPs) and on-prem environmentsLead design and implementation of systems for cluster lifecycle, validation, repair, upgrades, observability, and readinessEstablish patterns for Kubernetes-based GPU cluster operations across partner and on-prem environmentsIdentify and eliminate operational toil through software, APIs, automation, and agent-assisted workflowsSet technical standards for production readiness, SLOs, incident response, handoff gates, and operational acceptanceMentor engineers and influence platform, infrastructure, storage, networking, security, and workload teamsSkills15+ years of experience building and operating large-scale distributed systems or cloud infrastructureDeep experience with Kubernetes, Linux, infrastructure automation, and production operationsStrong programming experience in Go, Python, or similarProven ability to lead complex cross-org technical initiativesExperience designing reliable systems with clear SLOs, observability, incident response, and automationBS/MS in Computer Science or equivalent experienceExperience with GPU clusters, AI/ML infrastructure, Kubernetes operators, GitOps, BMaaS/VMaaS, managed Kubernetes, or multi-cloud fleet operationsExperience building internal platforms, control planes, lifecycle automation, or production readiness frameworksTrack record of turning operational pain into reusable software, APIs, and engineering standardsBenefitsEquityBenefitsCompany OverviewNVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI. It was founded in 1993, and is headquartered in Santa Clara, California, USA, with a workforce of 10001+ employees. Its website is https://www.nvidia.com.Company H1B SponsorshipNVIDIA has a track record of offering H1B sponsorships, with 448 in 2026, 1872 in 2025, 1354 in 2024, 976 in 2023, 835 in 2022, 601 in 2021, 529 in 2020. Please note that this does not guarantee sponsorship for this specific role.