[Remote] Senior Machine Learning Engineer - Agentic AI
Note: The job is a remote job and is open to candidates in USA. The University of Texas MD Anderson Cancer Center is a leading institution in cancer care and research, seeking a Senior Machine Learning Engineer – Agentic AI. This role focuses on designing and operating enterprise-scale agentic AI platform capabilities to ensure the safe and governed deployment of AI systems within healthcare environments.ResponsibilitiesLead the design, evolution, and operation of the enterprise agentic AI platform in collaboration with enterprise architects and platform ML engineersBuild platform components that enable interoperability between first‑party and third‑party agents, including identity, state, memory, tool access, orchestration, auditability, and policy enforcementDefine and document standardized integration patterns connecting agents with enterprise business systems, data platforms, APIs, and health IT systemsProvide reusable platform services, reference implementations, and SDKs that reduce risk and accelerate delivery for applied teamsDesign and operate validation and de‑risking frameworks, including simulation, sandboxing, shadow execution, canary releases, and continuous behavior monitoringEstablish and enforce platform standards for agent development, including interfaces, execution contracts, evaluation hooks, safety constraints, and observability requirementsParticipate in platform governance, release coordination, and incident response, supporting investigation and remediation of agent‑related failuresImplement platform safeguards such as fallback mechanisms, rollback strategies, approval gates, rate limiting, audit trails, and kill‑switch capabilitiesPartner with software engineering, security, IT, and health IT stakeholders to deploy agentic AI capabilities in secure enterprise environmentsSupport responsible AI practices through traceability of prompts, policies, tools, models, agent actions, and documentation of known failure modes and limitationsSkillsBachelor's degree in Computer Science, Software Engineering, Data Science, Physics, Math & Statistics, or another related engineering disciplineFive years of experience in machine learning engineering, data science, data engineering, and/or software engineeringAt least 5 years of industry experience in data science3+ years as a Senior ML Engineer focused on agentic AI systemsExperience building AI or ML platforms that serve multiple downstream teams and production workloadsStrong proficiency in Python and integration of modern ML frameworks (e.g., PyTorch) with large language models and agent systemsHands-on experience with agentic AI frameworks such as LangGraph, LangChain, AutoGen, CrewAI, Semantic Kernel, or equivalentWorking knowledge of agentic AI protocols and interoperability standards (e.g., MCP, agent-to-agent communication, structured tool invocation)Experience implementing planner-executor loops, hierarchical agents, and multi-agent coordination patternsFamiliarity with workflow orchestration tools (Airflow, Prefect, Temporal) and distributed execution frameworks (Ray or equivalent)Experience deploying containerized AI platforms using Kubernetes in enterprise cloud environments with lineage, auditability, and controlled promotion to productionAbility to reason at the systems and platform level, balancing safety, performance, flexibility, and usabilityExperience designing quantitative evaluation strategies for agentic systems, including success rates, latency, cost, recovery behavior, and safety metricsStrong understanding of enterprise data governance, security, and privacy requirements, including healthcare and health IT considerationsAbility to identify systemic risks stemming from agent autonomy, non-determinism, tool access, and multi-agent interactionsExperience analyzing failure modes caused by prompt drift, model updates, tool changes, and cross-system dependenciesCollaborate effectively with architects, applied MLEs, data scientists, software engineers, and IT partnersProduce clear documentation covering platform architecture, APIs, integration patterns, validation frameworks, and operational runbooksCommunicate platform capabilities, risks, and limitations to leadership and partner teamsContribute to internal standards and shared practices that improve safety, scalability, and consistency of agentic AI developmentProvide hands-on technical guidance, mentorship, and troubleshooting support to platform adoptersPresent technical and non-technical concepts clearly in meetings and institutional forumsMaster's degree or PHD with a concentration in Science, engineering, or related fieldExperience designing, deploying, and maintaining agentic AI systems that operate autonomously and collaboratively across distributed environmentsExperience in monitoring and troubleshooting autonomous agents post-deployment, including performance degradation, clinical incidents, model updates, or corrective actionsExperience raising the technical bar for team members, such as establishing reproducibility practices, review standards, or shared patternsExperience technically evaluating third-party agentic AI platforms within clinical workflowsBenefitsPaid medical benefitsPaid time off (PTO)Strong retirement plansTuition benefitsEducational opportunitiesIndividual and team recognitionReferral Bonus Available?Company OverviewThe University of Texas MD Anderson Cancer Center is one of the world’s most respected centers devoted exclusively to cancer patient care, research, education and prevention. It was founded in 1994, and is headquartered in Houston, Texas, USA, with a workforce of 10001+ employees. Its website is https://www.mdanderson.org/.