[Remote] Staff Machine Learning Systems Engineer (MLOps)

Remote Full-time
Note: The job is a remote job and is open to candidates in USA. Hims & Hers is the leading health and wellness platform, on a mission to help the world feel great through the power of better health. They are seeking a Staff Machine Learning Systems Engineer to design, build, and operate the production infrastructure that powers AI across the company, focusing on critical systems that support AI teams in a regulated healthcare environment.ResponsibilitiesOwn and scale the AI compute and deployment platformOwn and evolve our containerized application deployment platform and related systems for AI workloads, encompassing general process and job orchestration (e.g. Kubernetes) — cluster operations, node lifecycle, autoscaling (Karpenter), storage (EBS CSI), and workload isolation across staging and productionBuild and maintain GitOps-based deployment pipelines (Helm/Kustomize overlays, environment promotion) that let teams ship AI services safely and repeatablyDesign ephemeral/preview environments, feature-branched deployments, and nightly release pipelines so teams can validate AI changes in production-like conditions before releaseDrive efficiency and cost management across compute, autoscaling, and inference infrastructureOperate and scale inference infrastructure and a multi-provider LLM AI gateway (e.g. Bedrock, Vertex, and other providers) — including credentials, rate limits, and failoverBuild reliable serving patterns for LLM-powered workflows: routing, grounding, tool execution, and context assembly at the platform levelCreate reusable infrastructure abstractions and contracts that standardize how AI services are deployed, configured, and consumed across the companyOwn the LLM/AI observability and tracing stack — provisioning and scaling systems like Langfuse, Datadog (dd-trace), OpenTelemetry tracing (OTLP), and the underlying datastores (e.g. ClickHouse) — so AI behavior is auditable and debuggable in productionBuild analytics and monitoring pipelines that surface latency, error, quality, and regression signals to engineering and clinical stakeholdersDefine SLOs, alerting, on-call runbooks, and incident response for AI infrastructure; lead troubleshooting and continuously raise platform reliabilityOwn and improve the monorepo build system and CI/CD pipelines for AI workloads — including eval workflows, Docker image builds, automated PR checks and convention enforcement, and cross-platform test executionOwn shared infrastructure tooling, CLIs, and IaC modules (Terraform, Scalr) that AI and product engineers use dailyIdentify and eliminate platform bottlenecks — reducing CI/CD cycle times, build latency, and deployment friction — to improve developer velocity across the Applied AI organizationBuild IAM, OIDC, and secrets management as first-class infrastructure — scoped, least-privilege roles, write-only secret rotation, and cross-account access auditsEncode security-by-default, scope boundaries, and access controls into the platform so AI services are HIPAA-compliant and privacy-firstPartner with clinical, legal, security, and data platform teams (including Databricks/Unity Catalog access governance) to enforce compliant, auditable data accessDrive multi-quarter infrastructure initiatives, from cluster and deployment architecture to inference platform, GPU compute strategy, and observability evolutionWrite and lead technical design documents and design reviews, define infrastructure standards and development-workflow conventions, and contribute to technical governance across AI engineeringMentor engineers on reliability engineering, infrastructure-as-code, and MLOps best practices, and bridge the gap between prototypes and production-grade systemsSkills8+ years of professional experience in infrastructure, platform, DevOps, or SRE engineering — with at least 3 years focused on ML/AI systems in productionDeep, hands-on experience with Kubernetes (ideally EKS) and the cloud-native ecosystem — autoscaling, GitOps, Helm/Kustomize, operating clusters at scale, and general process/job orchestrationStrong infrastructure-as-code skills (Terraform) and experience designing secure cloud architectures: IAM, OIDC, secrets management, and least-privilege accessStrong proficiency in Python, with experience building production infrastructure tooling, CLIs, and data/observability pipelines2+ years of experience operating LLM-based systems in production (LLMOps) — inference routing, serving, tracing, and the reliability patterns needed to run them at scaleHands-on experience with observability/tracing stacks (Datadog, OpenTelemetry, Langfuse, or equivalent) and metrics/log/trace pipelinesExperience designing and maintaining CI/CD pipelines, build systems, and developer tooling for fast-moving engineering teamsA systems-and-operations mindset: you think about failure modes, SLOs, observability, security, and long-term maintainability before shippingExperience writing and leading technical design documents (TDDs/RFCs) for infrastructure-scale initiativesStrong collaboration skills across engineering, ML, product, security, and clinical teamsA deep appreciation for safety, privacy, and security — ideally with experience in a regulated domain such as healthcare, fintech, or life sciencesExperience with AWS (EKS, Bedrock, S3, CloudFront, IAM) and multi-cloud (GCP/Vertex AI) inference routingExperience with Databricks (MLflow, Unity Catalog, Spark, Delta) and data platform access governanceExperience provisioning LLM observability infrastructure (Langfuse, ClickHouse, OpenTelemetry/OTLP tracing, LogFire) and LLM behavior monitoringExperience with Karpenter, cluster autoscaling, and cost optimization for ML computeExperience with monorepo build systems (Pants, Bazel) and large-scale CI/CDExperience building automated PR-review / convention-enforcement pipelines and developer-workflow standardsFamiliarity with Vertex AI Agent Builder, Vertex AI Model Registry, or GCP managed AI/ML services as a stretch growth areaContributions to open-source infrastructure, IaC modules, SDKs, or developer tooling projectsBenefitsCompetitive salary & equity compensation for full-time rolesUnlimited PTO, company holidays, and quarterly mental health daysComprehensive health benefits including medical, dental & vision, and parental leaveEmployee Stock Purchase Program (ESPP)401k benefits with employer matching contributionOffsite team retreatsCompany OverviewHims & Hers Health, Inc. (better known as Hims & Hers) is a multi-specialty telehealth platform building a virtual front door to the healthcare system. It was founded in 2017, and is headquartered in San Francisco, California, USA, with a workforce of 501-1000 employees. Its website is https://www.hims.com.

Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Services Component Operation Manufacturing Engineer

Remote

[Remote] Financial Services | Remote |

Remote

Job Title:** Experienced Customer Service Representative – Union Benefits Specialist (REMOTE WORK)

Remote

Remote Entry-Level Sales Agent at FreeUp

Remote

Sr. Account Manager - AEC

Remote

Sales Manager

Remote

Video Editor & Content Creator (Part-Time, Freelance)

Remote

Experienced Remote Customer Service Representative for Technical Support and Product Expertise – Work from Home Opportunity with blithequark

Remote

[Work From Home] Account Development Representative (ADR)

Remote

Experienced Remote Data Entry Specialist – Flexible Work from Home Opportunity with arenaflex

Remote
← Back