[Remote] Machine Learning Infrastructure Engineer
Note: The job is a remote job and is open to candidates in USA. TRM Labs is a company dedicated to building a safer world through AI-powered intelligence solutions. The Senior Software Engineer, ML Infrastructure will design and operate scalable GPU-backed infrastructure that supports TRM's AI systems, collaborating with various teams to ensure effective model deployment and optimization.ResponsibilitiesDesign and operate GPU cluster infrastructureBuild and manage GPU-backed environments in cloud settings, including orchestration, autoscaling, resource isolation, and workload management across multiple concurrent models and usersOptimize high-throughput inferenceImplement and tune serving systems that maximize token throughput, batching efficiency, GPU occupancy, and cost effectiveness across interactive and batch workloadsEnable distributed inference strategiesSupport and operationalize model parallelism, tensor parallelism, and other distributed serving patterns for large-scale modelsImplement model optimization and compilation workflowsIntegrate and optimize acceleration stacks such as TensorRT, ONNX Runtime, vLLM, FlashAttention, and related tooling to improve performance and reduce inference costSchedule heterogeneous workloadsDesign systems that manage multiple models, multiple users, and mixed workload types across heterogeneous accelerators (e.g., NVIDIA GPUs, Inferentia), ensuring predictable performance under varying demandBuild observability into ML infrastructureInstrument systems to measure GPU load, memory utilization, batching efficiency, queue depth, and token throughput, and use data to continuously improve performance and reliabilityPartner across engineering teamsWork closely with infrastructure, ML, and product teams to ensure models transition smoothly from experimentation to production-grade, highly available servicesSkillsBachelor's degree (or equivalent) in Computer Science or related field5+ years of experience building and operating distributed systems or infrastructure in production environmentsExperience deploying and operating ML/LLM inference workloads on GPU clusters in cloud environments (AWS and/or GCP)Deep understanding of high-throughput inference systems, including batching strategies, token throughput optimization, and the trade-offs between latency, throughput, and costExperience with one or more ML serving frameworks such as Triton Inference Server, vLLM, Ray Serve, ONNX Runtime, or HuggingFace OptimumExperience optimizing GPU load, memory efficiency, and performance bottlenecks in production systemsFamiliarity with distributed inference strategies including model parallelism and tensor parallelismExperience working with Kubernetes or equivalent orchestration systems in cloud environmentsAdaptable. Goals can change fast. You anticipate and react quicklyAutonomous. You own what you work on. You move fast and get things doneExcellent communication. You communicate complex ideas effectively to both technical and non-technical audiences, verbally and in writingCollaborative. You work effectively in a cross-functional team and with people at all levels in an organizationFamiliarity with heterogeneous accelerators (e.g., Inferentia) is a plusCUDA familiarity and experience debugging GPU-related issues is a plusCompany OverviewTRM Labs is a software company that offers blockchain, transaction monitoring, and analytics to help financial institutions and governments. It was founded in 2018, and is headquartered in San Francisco, California, USA, with a workforce of 201-500 employees. Its website is https://trmlabs.com.Company H1B SponsorshipTRM Labs has a track record of offering H1B sponsorships, with 2 in 2026, 1 in 2025, 4 in 2024, 3 in 2023, 3 in 2022, 1 in 2021. Please note that this does not guarantee sponsorship for this specific role.