[Remote] Sr. Site Reliability Engineer (AI Platforms)

Remote Full-time
Note: The job is a remote job and is open to candidates in USA. Optomi, in partnership with a premier client in the financial services industry, is seeking a Site Reliability Engineer to establish and scale reliability practices for AI-powered applications and services in production. This role will drive production readiness, observability, incident management, and automation while partnering closely with engineering teams to ensure highly available, resilient systems.ResponsibilitiesDefine and enforce production readiness standards for AI services and agent-based applications prior to deploymentEstablish and manage SLIs, SLOs, and error budgets, including burn-rate monitoring and alertingEnsure services have appropriate runbooks, rollback procedures, monitoring, and on-call ownershipTrack reliability metrics and enforce operational standards across engineering teamsInstrument AI services and agent pipelines using structured JSON logging, custom metrics, and distributed tracingBuild dashboards and alerting for service health, latency, error rates, dependency performance, and agent execution metricsIdentify and address observability gaps unique to AI systems, including context limitations, model timeouts, tool invocation failures, and partial task executionDevelop monitoring strategies that surface reliability risks before production impact occursBuild and maintain automation that supports production readiness reviews, incident analysis, SLO monitoring, and reliability validationDevelop tooling and workflows that automate operational checks and reliability enforcementMaintain reliability standards, operational documentation, runbooks, and service ownership mappingsContinuously evolve reliability controls as new failure patterns emerge across AI-powered systemsLead incident response and post-incident review efforts for production servicesPerform root cause analysis and drive remediation efforts through completionIdentify recurring failure patterns and implement systemic reliability improvementsSupport on-call operations and validate escalation processes for critical servicesReview application architectures, infrastructure designs, and code changes through a reliability lensEvaluate resiliency patterns such as retries, circuit breakers, health checks, graceful degradation, and rollback strategiesPartner with engineering teams to address reliability risks before production deploymentSkills4+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Production OperationsHands-on experience managing production services and reliability programsStrong understanding of SLI/SLO frameworks, error budgets, and operational excellence practicesExperience building monitoring, alerting, and observability solutions using platforms such as Datadog, Dynatrace, New Relic, Grafana, or similarStrong scripting or programming experience with Python, TypeScript, or comparable languagesExperience with distributed systems observability, including structured logging, metrics, and tracingExperience supporting AI/ML, automation, or data-driven platforms in productionStrong background leading incident response and post-incident review processesExperience integrating operational workflows with ticketing and documentation platformsExperience working within regulated or highly available production environmentsCompany OverviewOPTOMI is an IT staffing firm that serves its consultants, clients, and employees through its consultant-focused approach. It was founded in 2012, and is headquartered in Roswell, Georgia, USA, with a workforce of 501-1000 employees. Its website is http://www.optomi.com/.Company H1B SponsorshipOptomi has a track record of offering H1B sponsorships, with 7 in 2025, 6 in 2024, 2 in 2023, 5 in 2022, 8 in 2021, 7 in 2020. Please note that this does not guarantee sponsorship for this specific role.

Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Bilingual Interpreter (Russian-English)

Remote

HNW Senior Tax Manager/Director - Fully Remote - 200k-250k - Wealth Management

Remote

Regional Sales Director - Midwest

Remote

**Experienced Benefits Customer Service Representative – Compassionate and Skilled Remote Support Specialist**

Remote

Experience Coordination Assistant

Remote

Asset Management and Investment Analyst role

Remote

Remote Biology Researcher (PhD)

Remote

Business Development Manager - Convenience Solutions

Remote

Product Manager II

Remote

**Experienced Customer Service Representative – Banking and Financial Services**

Remote
← Back