[Remote] Principal Network Architect- AI Infrastructure
Note: The job is a remote job and is open to candidates in USA. Nscale is a GPU cloud company designed for AI, providing high-performance infrastructure for AI startups and enterprises. They are seeking a Principal Network Architect to lead the development and operational excellence of their global AI networking infrastructure, focusing on RDMA and Infiniband technologies to enhance AI training outcomes.ResponsibilitiesOwn the technical direction and operational lifecycle management of Nscaleβs high-performance RDMA network fabricsDefine long-term architecture, reliability strategy, and operational standards for AI interconnect networksLead availability and performance improvement initiatives across globally distributed GPU clustersAct as a technical authority (SME) across networking, influencing platform-wide decisionsSupport design, build, and evolve large-scale Infiniband and RoCE fabricsDrive deep debugging and resolution of complex cross-layer issues (hardware, firmware, kernel, distributed workloads)Lead incident response and postmortems, ensuring systemic fixes and long-term improvementsDefine and enforce standards across: Congestion control and traffic engineering, Routing (BGP, ECMP, fabric-level routing strategies), Firmware lifecycle and change management, Network observability and telemetryDevelop and scale automation frameworks for network provisioning, validation, and operationsBuild tooling to support high-reliability, low-touch network operations at scaleImprove operational efficiency across hundreds of thousands of endpoints and high-throughput linksLead complex technical initiatives across Network, SRE, Compute, and Platform teamsServe as technical lead on critical programs, coordinating engineers and stakeholdersInfluence product and infrastructure roadmaps based on operational insights and customer needsMentor senior engineers and raise the bar for technical rigor and executionSkills10+ years of experience in network engineering in hyperscale, AI, or HPC environmentsDeep expertise in RDMA, Infiniband, and/or large-scale RoCE fabricsStrong understanding of RDMA internals and performance tuningStrong understanding of congestion control and fabric failure modesStrong understanding of distributed system communication patternsExpert-level knowledge of data center networking protocols (BGP, OSPF, ECMP)Proven ability to debug multi-layer issues across network, system, and application layersStrong programming/scripting skills for automation (Python, Go, etc.)Experience designing high-scale, highly available network systemsDemonstrated ability to lead complex technical programs without direct authorityExperience acting as a senior escalation point for critical production issuesStrong ability to drive cross-team alignment and executionSystems-level thinking balancing performance, reliability, scalability, and costExperience with NVIDIA / Mellanox networking platformsFamiliarity with distributed AI training frameworks and GPU communication patternsExperience building network observability systems at scaleBackground influencing infrastructure strategy in high-growth environmentsBenefitsHighly competitive package (base + equity) with reviews every 12 months.Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI.Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.Human-First Flexibility: We treat you as humans first. Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.Company OverviewNscale builds AI data centers and provides GPU cloud infrastructure that companies use to train, run, and scale large AI models. It was founded in 2024, and is headquartered in London, England, GBR, with a workforce of 201-500 employees. Its website is https://www.nscale.com.