[Remote] Executive Director, AI Infrastructure & Platform Engineering

Remote Full-time
Note: The job is a remote job and is open to candidates in USA. CVS Health is dedicated to shaping a more connected and compassionate health experience. They are seeking an Executive Director for AI Infrastructure & Platform Engineering, responsible for leading the development and operational excellence of their AI compute platform, ensuring high availability and reliability for frontier AI workloads.ResponsibilitiesDefine and execute the long-range vision and strategy for AI infrastructure and platform engineering, with availability (>99.99%), reliability, and platform performance as the primary measures of successRecruit, hire, develop, and retain a high-performing engineering organization spanning infrastructure, network, platform reliability, observability, security, 24/7 operations, change and release management, and FinOpsEstablish clear ownership, accountability, and performance expectations across all functional teams; foster a culture of operational excellence, engineering rigor, and continuous improvementProvide executive-level communication to senior leadership on platform status, milestones, risk posture, and strategic initiativesOwn the physical layer of the AI compute environment — GPU compute, storage, network fabric, capacity planning, and hardware lifecycle accountabilityDirect bare-metal Kubernetes and OpenShift operations, including cluster administration, GPU quota governance, infrastructure-as-code adoption, and availability baseline enforcementGovern high-performance network fabric operations — RoCE v2, spine-leaf topology, lossless Ethernet tuning, congestion management, and segmentationEstablish and enforce operational baselines across every layer of the stack — hardware, fabric, platform, and workload — with deviations detected, escalated, and resolved within defined SLAsDirect Innovation POD strategy to develop self-healing and autonomous capabilities that proactively prevent service degradation before it impacts availabilityBuild and sustain a high-performing 24/7 operations model — designed for sustainable, predictable coverage with no mandatory overtime and measurable team health and retentionDrive end-to-end observability across the physical and platform layers, with continuous feedback loops connecting monitoring data to incident response, change decisions, and improvement cyclesOversee change management so every modification is risk-assessed, monitored during rollout, and baseline-validated post-deploymentEnsure configuration consistency and drift detection across all platform components to prevent baseline degradation over timeLead GPU FinOps governance — utilization optimization, tenant quota enforcement, and cost reduction — in partnership with the Finance organizationEmpower the Security SRE Lead to maintain a world-class security posture across the infrastructure and platform layers, with robust compliance to frameworks including HIPAA and NIST AI RMFGovern access controls, audit logging, vulnerability management, and network segmentation across the AI compute environmentLead the operational transition from program-launch staffing to permanent CVS-owned operations — governing phased handoffs, competency validation, and milestone sign-offs to ensure minimal disruption to platform availability and business operationsEstablish and lead the long-term operating model by institutionalizing key technical, architectural, and delivery leadership capabilities into permanent CVS roles, ensuring the organization is fully self-sustaining at program closeOwn vendor relationships, contract performance, and accountability across the hardware, networking, platform, and managed-services stackManage budget ownership for the AI infrastructure and platform engineering organization, including capital planning and operational expense governanceSkills10+ years of engineering leadership experience, with substantial time directly owning physical infrastructure at data center scale — including hardware lifecycle, capacity planning, and facility coordination (power, cooling, rack-and-stack execution)Hands-on production ownership of bare-metal Kubernetes or OpenShift. Managed cloud services (EKS, GKE, AKS) alone do not substitute for the practitioner expertise this role requiresFluency with high-speed cluster fabrics — RoCE v2, InfiniBand, EVPN-VXLAN, or carrier-grade equivalent — and the operational discipline these fabrics require (PFC, ECN, lossless tuning, congestion management)5+ years leading multiple technical teams simultaneously, including 24/7 operations organizations, with measurable team health, retention, and performance outcomesProven success establishing and enforcing operational baselines, SLO / SLI / error-budget frameworks, and observability-driven continuous improvement in physical-infrastructure-anchored environmentsHardware lifecycle, vendor accountability, and facility coordination experience — including capacity planning, RMA management, and multi-vendor escalationExperience leading operational transitions or organizational build-outs at scale, with business continuity and minimal disruption as non-negotiablesExecutive-level stakeholder communication, vendor negotiation, and budget ownershipBachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or related technical fieldHands-on experience with Cisco UCS, NVIDIA HGX / DGX / Blackwell systems, and VAST or comparable distributed NVMe storageDirect experience operating GPU clusters of 32 or more GPUs in production environments — including HPC, AI training, research computing, or comparable workloadsNVIDIA AI Enterprise, NVIDIA Run:AI, NVIDIA Base Command Manager, or comparable GPU orchestration platform experienceHealthcare or other regulated-industry background (HIPAA, NIST AI RMF, SOX, FedRAMP, ITAR)Chaos engineering and AI-driven operations experience — predictive alerting and automated remediation patternsBackground in innovation programs, POD structures, or centers of excellenceBenefitsThis position is eligible for a CVS Health bonus, commission or short-term incentive program in addition to the base pay range listed above.This position also includes an award target in the company’s equity award program.This full‑time position is eligible for a comprehensive benefits package designed to support the physical, emotional, and financial well‑being of colleagues and their families.The benefits for this position include medical, dental, and vision coverage, paid time off, retirement savings options, wellness programs, and other resources, based on eligibility.Company OverviewCVS Health is a health solutions company that provides an integrated healthcare services to its members. It was founded in 1963, and is headquartered in Woonsocket, Rhode Island, USA, with a workforce of 10001+ employees. Its website is https://www.cvshealth.com/.

Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Experienced Social Media Manager - Twitter Work From Home Opportunity with Competitive Salary $26/Hour

Remote

**Experienced Part-Time Remote Data Entry Typist – Agriculture, Fishing, and Forestry Industry**

Remote

Apple Home Advisor (Work From Home Job) – United States

Remote

Associate System Administrator

Remote

Senior Product Manager (US - ET)

Remote

International Key Account Manager (f/m/d)

Remote

Sales Account Manager - Waste Management Services

Remote

Environmental Services Technician I - York Hospital - Evenings

Remote

[Remote/WFM] Customer Service (REMOTE)

Remote

Architecture Consultant (Microsoft 365 / Power Platform) Remote / Telecommute Jobs

Remote
← Back