Junior Cloud Automation Engineer

Remote Full-time
Senior AI Site Reliability Engineer (AI SRE):

OpenKyber is hiring a Senior AI Site Reliability Engineer to lead the reliability, scalability, and performance of our production AI/ML platform. This role is deeply technical and hands on, owning end to end stability for mission critical model serving, data pipelines, and GPU intensive workloads. You will architect resilient systems, drive automation, and set reliability standards for OpenKyber's AI products.
Responsibilities:
• Own SLOs/SLAs for availability, latency, performance, and cost across AI services
• Architect and operate highly available, fault tolerant AI/ML infrastructure
• Lead incident response, deep dive troubleshooting, RCA, and postmortems
• Deploy, monitor, and scale ML models and real time inference services
• Manage model lifecycle (training validation deployment rollback)
• Detect and mitigate model drift, data skew, and inference degradation
• Build observability for model accuracy, data quality, pipelines, and system health
• Implement logging, tracing, and alerting for AI workloads
• Automate CI/CD and MLOps pipelines; manage IaC (Terraform, CloudFormation)
• Optimize cloud compute (GPU/CPU) for performance and cost efficiency
• Ensure secure handling of data, models, APIs, and compliance requirements

Must Have Skills:
• 7+ years in SRE, DevOps, or Platform Engineering
• Proven experience running production AI/ML systems at scale
• Strong Python; Go/Java a plus
• Deep expertise with Linux, Docker, Kubernetes
• Cloud experience with AWS, Google Cloud Platform, or Azure
• Strong understanding of model serving, inference pipelines, data pipelines, feature stores
• Experience with GPU workloads and performance tuning
• Advanced troubleshooting across data, model, and infrastructure layers
• Observability tools: Prometheus, Grafana, Datadog, OpenTelemetry
• ML monitoring (model metrics, drift detection, inference health)
• CI/CD, MLOps, IaC (Terraform, CloudFormation)

Nice to Have:
• Experience with Kubeflow, MLflow, SageMaker, Vertex AI
• Background in ML or data science
• Experience with real time, high throughput inference systems
• Exposure to AI governance, explainability, or responsible AI

Success Indicators:
• AI services consistently exceed reliability and performance targets
• Incidents decrease through strong operational rigor and automation
• Models are deployed safely, quickly, and with confidence
• Engineering teams rely on the platform and tooling you build

For applications and inquiries, contact: [email protected]

Apply tot his job

Apply To this Job
Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Warehouse Associate

Remote

Registered Nurse (OB/GYN Centralized Phone Triage) – Remote TX

Remote

Area Business Manager, Integrated Medicine - Yonkers, NY

Remote

Finishing Carpenter - QE2 Hotel Dubai, UAE

Remote

Senior Director, Employee Communications

Remote

Experienced Customer Care Representative – Remote Work Opportunity at careerzynith

Remote

Major Account Manager - Malvern, PA

Remote

Monitor and Patient Care Technician-Float

Remote

Revolution Technologies is hiring: Clinical Documentation Specialist - Remote (Must have CCDS OR CDIP) in Champaign

Remote

Proofreader job at Apex Systems in Richmond, VA

Remote
← Back