4 Remote Nvidia Engineers

Remote Full-time
About the position

We are seeking a highly skilled AI Infrastructure & Kubernetes Platform Engineer with a proven track record in deploying and managing NVIDIA DGX-based AI clusters, orchestrating containerized AI workloads using Kubernetes, and ensuring secure, high-throughput operations across InfiniBand-powered networks. The ideal candidate will hold a combination of Kubernetes certifications (CKA, CKAD, CKS) and NVIDIA certifications (NCA-AIIO, NCP-AIO, NCP-AII, NCP-AIN), coupled with hands-on training in DGX, BlueField, and high-speed network operations. This position plays a key role in supporting AI/ML infrastructure at scale, enabling efficient training and inference for complex models, and integrating NVIDIA's cutting-edge compute, storage, and fabric solutions with modern DevOps practices.

Responsibilities
• Deploy and manage NVIDIA DGX BasePODs and SuperPODs for high-performance AI workloads.
• Oversee DGX system lifecycle operations including provisioning, monitoring, firmware upgrades, and capacity planning.
• Operate Base Command Manager to manage GPU clusters, schedule workloads, and integrate with MLOps tools.
• Perform DGX node health validation, NCCL interconnect testing, and NVLink topology verification following new deployments or hardware changes.
• Architect secure and scalable Kubernetes clusters optimized for GPU-accelerated workloads using NVIDIA GPU Operator.
• Leverage expertise from CKA/CKAD/CKS to develop, deploy, and secure AI applications on Kubernetes.
• Implement CI/CD pipelines and GitOps methodologies for deploying and managing ML workflows.
• Administer InfiniBand networks and BlueField DPUs using Unified Fabric Manager (UFM).
• Enable NVLink/NVSwitch performance across GPU nodes and tune fabric configurations for minimal latency and maximum throughput.
• Use BlueField for offloading storage, firewalling, and telemetry, enhancing AI workload security and performance.
• Apply best practices from the CKS certification to secure containerized AI environments.
• Configure runtime security, secrets management, network segmentation, and auditing using DPU-enhanced Kubernetes deployments.
• Support zero-trust architecture initiatives by enforcing workload identity, RBAC policies, and supply chain integrity across AI container images and model artifacts.
• Monitor GPU, CPU, and I/O performance using NVIDIA DCGM, Prometheus, Grafana, and Base Command APIs.
• Tune system performance and model training pipelines for cost-efficiency and throughput.
• Build and maintain operational runbooks, incident response playbooks, and SLA reporting dashboards covering GPU utilization, thermal thresholds, and fabric health.

Requirements
• NVIDIA Certification required or no interview
• Kubernetes certifications (CKA, CKAD, CKS)
• NVIDIA certifications (NCA-AIIO, NCP-AIO, NCP-AII, NCP-AIN)
• Hands-on training in DGX, BlueField, and high-speed network operations
• Expertise with DGX System, BasePOD, and SuperPOD Administration
• Expertise with BlueField DPU Configuration & Operations
• Expertise with InfiniBand Fabric and UFM Management
• Expertise with Base Command Manager for workload orchestration

Apply tot his job

Apply To this Job
Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Experienced Customer Service Representative - Delivering Exceptional Shopping Experiences at blithequark

Remote

**Experienced Full Stack Data Entry Specialist – Remote Data Management and Operations**

Remote

Exciting FULL TIME Netflix Remote$72000/year – …

Remote

Principal Business Analyst, Commercial Excellence

Remote

Experienced Customer Service Associate - Full-Time Remote Opportunity with a Fast-Growing Fragrance E-commerce Leader

Remote

Senior Account Executive

Remote

Ecommerce Growth Manager

Remote

Entry Level Sales Rep, Work from Home Remotely

Remote

Apple Internship Jobs, Apple Jobs From Home, Apple Remote Jobs Entry Level $$ Click Now!!

Remote

(PT) Crisis Line Worker- Chicago ( Remote) Bi-lingual preferred

Remote
← Back