[Remote] AI Platform Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Oracle Health is seeking an AI Platform Reliability Engineer to ensure our AI agent platform and AI-enabled analytics workflows are reliable, observable, measurable, and safe in production. This role will focus on the operational foundation for production AI systems and support data reliability use cases.ResponsibilitiesBuild and maintain observability, logging, tracing, and monitoring for AI agents, agent tools, and AI-enabled analytics workflowsImplement release, rollout, rollback, and versioning controls for prompts, models, tools, and configurationsDesign and support production evaluation practices to detect regressions, silent failures, quality drift, and performance issuesContribute to data monitoring and reliability workflows, including detection of stopped processing, data gaps, freshness issues, schema drift, and anomaliesSupport incident response, triage, root-cause analysis, and operational reporting for AI and data reliability issuesPartner with architects and AI engineers to ensure systems are production-ready, measurable, and maintainableImplement latency, throughput, and cost monitoring controls for AI-enabled systemsHelp enforce operational safeguards, auditability, and controlled deployment practices for enterprise AI platformsSkills6 to 10+ years of experienceAbility to read, write, and speak EnglishExperience in observability, release safety, and operational toolingExperience with monitoring, tracing, evaluation in production, rollback controls, alerting, versioning, runtime diagnostics, and quality safeguardsExperience in data reliability use cases such as detection of stopped processing, data gaps, freshness issues, schema drift, and anomaly conditionsExperience in incident response, triage, root-cause analysis, and operational reporting for AI and data reliability issuesAbility to partner with architects and AI engineers to ensure systems are production-ready, measurable, and maintainableExperience in implementing latency, throughput, and cost monitoring controls for AI-enabled systemsAbility to enforce operational safeguards, auditability, and controlled deployment practices for enterprise AI platformsBenefitsMay be eligible for bonus and equity.Medical, dental, and vision insurance, including expert medical opinionShort term disability and long term disabilityLife insurance and AD&DSupplemental life insurance (Employee/Spouse/Child)Health care and dependent care Flexible Spending AccountsPre-tax commuter and parking benefits401(k) Savings and Investment Plan with company matchPaid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.11 paid holidaysPaid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.Paid parental leaveAdoption assistanceEmployee Stock Purchase PlanFinancial planning and group legalVoluntary benefits including auto, homeowner and pet insuranceCompany OverviewOracle is an integrated cloud application and platform services that sells a range of enterprise information technology solutions. It was founded in 1977, and is headquartered in Austin, Texas, USA, with a workforce of 10001+ employees. Its website is https://www.oracle.com/.