[Remote] Machine Learning Infrastructure Engineer

Remote Full-time
Note: The job is a remote job and is open to candidates in USA. TRM Labs is a company dedicated to building a safer world through AI-powered intelligence solutions. The Senior Software Engineer, ML Infrastructure will design and operate scalable GPU-backed infrastructure that supports TRM's AI systems, collaborating with various teams to ensure effective model deployment and optimization.ResponsibilitiesDesign and operate GPU cluster infrastructureBuild and manage GPU-backed environments in cloud settings, including orchestration, autoscaling, resource isolation, and workload management across multiple concurrent models and usersOptimize high-throughput inferenceImplement and tune serving systems that maximize token throughput, batching efficiency, GPU occupancy, and cost effectiveness across interactive and batch workloadsEnable distributed inference strategiesSupport and operationalize model parallelism, tensor parallelism, and other distributed serving patterns for large-scale modelsImplement model optimization and compilation workflowsIntegrate and optimize acceleration stacks such as TensorRT, ONNX Runtime, vLLM, FlashAttention, and related tooling to improve performance and reduce inference costSchedule heterogeneous workloadsDesign systems that manage multiple models, multiple users, and mixed workload types across heterogeneous accelerators (e.g., NVIDIA GPUs, Inferentia), ensuring predictable performance under varying demandBuild observability into ML infrastructureInstrument systems to measure GPU load, memory utilization, batching efficiency, queue depth, and token throughput, and use data to continuously improve performance and reliabilityPartner across engineering teamsWork closely with infrastructure, ML, and product teams to ensure models transition smoothly from experimentation to production-grade, highly available servicesSkillsBachelor's degree (or equivalent) in Computer Science or related field5+ years of experience building and operating distributed systems or infrastructure in production environmentsExperience deploying and operating ML/LLM inference workloads on GPU clusters in cloud environments (AWS and/or GCP)Deep understanding of high-throughput inference systems, including batching strategies, token throughput optimization, and the trade-offs between latency, throughput, and costExperience with one or more ML serving frameworks such as Triton Inference Server, vLLM, Ray Serve, ONNX Runtime, or HuggingFace OptimumExperience optimizing GPU load, memory efficiency, and performance bottlenecks in production systemsFamiliarity with distributed inference strategies including model parallelism and tensor parallelismExperience working with Kubernetes or equivalent orchestration systems in cloud environmentsAdaptable. Goals can change fast. You anticipate and react quicklyAutonomous. You own what you work on. You move fast and get things doneExcellent communication. You communicate complex ideas effectively to both technical and non-technical audiences, verbally and in writingCollaborative. You work effectively in a cross-functional team and with people at all levels in an organizationFamiliarity with heterogeneous accelerators (e.g., Inferentia) is a plusCUDA familiarity and experience debugging GPU-related issues is a plusCompany OverviewTRM Labs is a software company that offers blockchain, transaction monitoring, and analytics to help financial institutions and governments. It was founded in 2018, and is headquartered in San Francisco, California, USA, with a workforce of 201-500 employees. Its website is https://trmlabs.com.Company H1B SponsorshipTRM Labs has a track record of offering H1B sponsorships, with 2 in 2026, 1 in 2025, 4 in 2024, 3 in 2023, 3 in 2022, 1 in 2021. Please note that this does not guarantee sponsorship for this specific role.

Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Remote Account Representative

Remote

Regional Case Manager Remote

Remote

Senior Program Manager, Operations

Remote

Experienced Customer Service Representative – TurboTax Product Expert and Tax Support Specialist – Remote Work Opportunity with Flexible Hours

Remote

[Remote] Strategic Account Executive, Florida or Georgia - Internal Candidates Only

Remote

Regional Vice President (RVP) Provider Solutions

Remote

Field Access Manager - South Carolina - Charlotte SC, Columbia SC, Greenville SC

Remote

Southwest Airlines Customer Support Remote Jobs (Part Time) - $21 An Hour

Remote

Experienced Product Manager - Customer Loyalty & Retention Strategies at blithequark

Remote

[Hiring] Learning and Development Consultant @Paek Management Group

Remote
← Back