Infrastructure Engineer (Infiniband / NCCL)

Remote Full-time
We are seeking an Infrastructure Engineer with a focus on InfiniBand/NCCL to join our Infrastructure Engineering team. Our engineers design and build automation, tooling, and systems that bridge the gap between physical infrastructure and the platforms that power large-scale AI/ML and HPC workloads. This role combines the breadth of a core infrastructure engineer with a specialty in high-performance networking and GPU communication. You’ll help ensure our InfiniBand fabric and NCCL stack are tuned, reliable, and efficient at scale — supporting some of the world’s largest GPU clusters. This is a fully remote position, although candidates must be based in the continental United States. Unfortunately, we are unable to provide sponsorship for this role. Responsibilities • Design, build, and maintain automation, APIs, and frameworks to manage physical infrastructure at scale. • Develop and extend systems for server lifecycle management. • Implement and tune InfiniBand networking and NCCL configurations for multi-GPU communication. • Collaborate with Network, Platform, and Infrastructure Operations teams to support new infrastructure rollouts. • Diagnose and improve performance across GPU, NVSwitch, PCIe, and InfiniBand layers. Write clear design documents and technical documentation to capture best practices. Qualifications • 8+ years of professional experience in infrastructure engineering, HPC, or related domains. • Strong experience with Linux in production environments. • Proficiency in Python or similar languages for automation. • Deep understanding of InfiniBand networking (CX7 HCAs, fabrics, partitioning, GPUDirect). • Familiarity with NCCL, CUDA, and GPU topology optimization. • Knowledge of containerization and orchestration concepts. • Strong written and verbal communication skills. Ideal Experiences • Experience with Dell PowerEdge XE9680 or other GPU-dense servers. • Prior work with NVIDIA H100s, NVSwitch, and large-scale NCCL testing. • Familiarity with Mellanox OFED, UCX, and Redfish/iDRAC for management. • Broader experience across infrastructure areas (storage, virtualization, networking). Culture • Enjoy collaborating with a motivated, execution-focused team. • Comfortable operating with autonomy while aligning to company objectives. • Value precision, documentation, and knowledge-sharing. • Excited to grow as both a domain specialist (InfiniBand/NCCL) and a generalist infrastructure engineer. Voltage Park is an equal opportunity employer and makes employment decisions on the basis of merit. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. If you require an accommodation during the job application process, please notify your recruiter. Apply tot his job
Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Remote Assistant Database Admin/Worksite Specialist

Remote

Alliance and Partnership Manager (Pharmacovigilance Software)

Remote

Manager, Data Engineering (Data Collection)

Remote

**Experienced Full Stack Software Engineer – Web & Cloud Application Development**

Remote

Field Applications Engineer – Manufacturing, Machine Vision, AI/ML

Remote

RevOps Analyst (Marketing Operations)

Remote

Online High School Math/Science (2 to 5 hours weekly) Tutor

Remote

**Experienced Junior Customer Support Specialist – Remote Customer Service Representative**

Remote

Senior Business Intelligence & Data Reporting Consultant (Government / Public Sector)

Remote

Public Assistance Consultant (On Call)

Remote
← Back