[Remote] Infrastructure Software Engineer, Fleet & Automation

Remote Full-time
Note: The job is a remote job and is open to candidates in USA. Nscale is a GPU cloud company focused on AI infrastructure, providing high-performance solutions for AI development. As an Infrastructure Software Engineer for Fleet & Automation, you will ensure the performance and scalability of AI and High-Performance Computing environments by building and maintaining automation and control systems.ResponsibilitiesPerform technical architecture, roadmap and implementation for workflow automation systems, driving architecture decisions that balance automation complexity, reliability, and maintainabilityIdentify and resolve performance and scalability issuesEstablish technology and product direction in collaboration with other tech leads, managers, and senior leadershipOwn end-to-end delivery of device provisioning, validation, testing, and remediation workflows at scaleDesign and build workflow orchestration systems for hardware lifecycle management, including GPU nodes and network switchesPartner with Infrastructure, Platform, and SRE teams to translate operational needs into robust, scalable automationEstablish engineering standards for reliability, observability, and operational excellence across all servicesHelp set up engineering best practices in collaboration with the broader engineering teamBuild production-grade Python systems for hardware lifecycle automation, leveraging AI tools to accelerate deliveryAssess impact to team software stack from new hardware product programs and explore AI driven process improvement and automationCollaborate with cross-functional teams (product, design, operations, infrastructure) to build efficient, interoperable, and maintainable automated systemsSkillsBachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience5+ years relevant experience building large-scale infrastructure applications or similar experienceExperience in utilizing languages such as C, C++, Java, and scripting languages such as Python for API design and unit testing techniquesDeep understanding of Linux operating systems, networking fundamentals (TCP/IP, BGP), and familiarity with configuration management tools (e.g., Ansible, Terraform)Experience building, running and debugging large-scale infrastructure, stateful and stateless services for distributed systems or networks, and experience with compute technologies, storage, or hardware architectureExperience integrating with infrastructure tooling such as: DCIMs, NetBox, OpenStack, bare metal APIs (MAAS, Ironic, IPMI)Master's degree or PhD in Engineering, Computer Science, or a related technical fieldExperience designing, analyzing and improving efficiency, scalability, and performance of various system resourcesDirect experience with AI/HPC infrastructure, including NVIDIA GPUs, InfiniBand or high-speed Ethernet fabrics, and related management software (e.g., NCCL, SLURM)Experience with advanced observability and monitoring systems (Prometheus, Grafana, OpenTelemetry) for complex, high-cardinality telemetry dataFamiliarity with cloud-native technologies (Kubernetes, Docker) and infrastructure-as-code principlesDemonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)Familiarity with SLOs/metrics measurement, logs/telemetry/metrics integration with tools for enhanced operator experienceBenefitsMedicalDentalVisionFlexible paid time offParental leaveRetirement plan participationCompany OverviewNscale builds AI data centers and provides GPU cloud infrastructure that companies use to train, run, and scale large AI models. It was founded in 2024, and is headquartered in London, England, GBR, with a workforce of 201-500 employees. Its website is https://www.nscale.com.

Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Acquisition Marketing Manager

Remote

Experienced Teen Data Entry Specialist – Remote Full-Time Opportunity for Career Growth and Development with arenaflex

Remote

Customer Service Technician (Part-Time) - Roanoke CSC 02588

Remote

Microsoft SQL Developer

Remote

Remote Customer Service Rep - Must Reside in New Jersey (Starts 7/13/2026)

Remote

Experienced Operations Intake Coordinator I – Data Entry Specialist for Dynamic Healthcare Environment

Remote

Online Platform Instructor (Fully Remote)

Remote

Experienced Customer Service Representative – Amazon Customer Service Team (Fully Remote)

Remote

Sales Engineer III - Conveying Equipment

Remote

Work at Home at Amazon

Remote
← Back