[Remote] Sr. Site Reliability Engineer (AI Platforms)
Note: The job is a remote job and is open to candidates in USA. Optomi, in partnership with a premier client in the financial services industry, is seeking a Site Reliability Engineer to establish and scale reliability practices for AI-powered applications and services in production. This role will drive production readiness, observability, incident management, and automation while partnering closely with engineering teams to ensure highly available, resilient systems.ResponsibilitiesDefine and enforce production readiness standards for AI services and agent-based applications prior to deploymentEstablish and manage SLIs, SLOs, and error budgets, including burn-rate monitoring and alertingEnsure services have appropriate runbooks, rollback procedures, monitoring, and on-call ownershipTrack reliability metrics and enforce operational standards across engineering teamsInstrument AI services and agent pipelines using structured JSON logging, custom metrics, and distributed tracingBuild dashboards and alerting for service health, latency, error rates, dependency performance, and agent execution metricsIdentify and address observability gaps unique to AI systems, including context limitations, model timeouts, tool invocation failures, and partial task executionDevelop monitoring strategies that surface reliability risks before production impact occursBuild and maintain automation that supports production readiness reviews, incident analysis, SLO monitoring, and reliability validationDevelop tooling and workflows that automate operational checks and reliability enforcementMaintain reliability standards, operational documentation, runbooks, and service ownership mappingsContinuously evolve reliability controls as new failure patterns emerge across AI-powered systemsLead incident response and post-incident review efforts for production servicesPerform root cause analysis and drive remediation efforts through completionIdentify recurring failure patterns and implement systemic reliability improvementsSupport on-call operations and validate escalation processes for critical servicesReview application architectures, infrastructure designs, and code changes through a reliability lensEvaluate resiliency patterns such as retries, circuit breakers, health checks, graceful degradation, and rollback strategiesPartner with engineering teams to address reliability risks before production deploymentSkills4+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Production OperationsHands-on experience managing production services and reliability programsStrong understanding of SLI/SLO frameworks, error budgets, and operational excellence practicesExperience building monitoring, alerting, and observability solutions using platforms such as Datadog, Dynatrace, New Relic, Grafana, or similarStrong scripting or programming experience with Python, TypeScript, or comparable languagesExperience with distributed systems observability, including structured logging, metrics, and tracingExperience supporting AI/ML, automation, or data-driven platforms in productionStrong background leading incident response and post-incident review processesExperience integrating operational workflows with ticketing and documentation platformsExperience working within regulated or highly available production environmentsCompany OverviewOPTOMI is an IT staffing firm that serves its consultants, clients, and employees through its consultant-focused approach. It was founded in 2012, and is headquartered in Roswell, Georgia, USA, with a workforce of 501-1000 employees. Its website is http://www.optomi.com/.Company H1B SponsorshipOptomi has a track record of offering H1B sponsorships, with 7 in 2025, 6 in 2024, 2 in 2023, 5 in 2022, 8 in 2021, 7 in 2020. Please note that this does not guarantee sponsorship for this specific role.