Junior Cloud Automation Engineer

Remote Full-time
Senior AI Site Reliability Engineer (AI SRE):

OpenKyber is hiring a Senior AI Site Reliability Engineer to lead the reliability, scalability, and performance of our production AI/ML platform. This role is deeply technical and hands on, owning end to end stability for mission critical model serving, data pipelines, and GPU intensive workloads. You will architect resilient systems, drive automation, and set reliability standards for OpenKyber's AI products.
Responsibilities:
• Own SLOs/SLAs for availability, latency, performance, and cost across AI services
• Architect and operate highly available, fault tolerant AI/ML infrastructure
• Lead incident response, deep dive troubleshooting, RCA, and postmortems
• Deploy, monitor, and scale ML models and real time inference services
• Manage model lifecycle (training validation deployment rollback)
• Detect and mitigate model drift, data skew, and inference degradation
• Build observability for model accuracy, data quality, pipelines, and system health
• Implement logging, tracing, and alerting for AI workloads
• Automate CI/CD and MLOps pipelines; manage IaC (Terraform, CloudFormation)
• Optimize cloud compute (GPU/CPU) for performance and cost efficiency
• Ensure secure handling of data, models, APIs, and compliance requirements

Must Have Skills:
• 7+ years in SRE, DevOps, or Platform Engineering
• Proven experience running production AI/ML systems at scale
• Strong Python; Go/Java a plus
• Deep expertise with Linux, Docker, Kubernetes
• Cloud experience with AWS, Google Cloud Platform, or Azure
• Strong understanding of model serving, inference pipelines, data pipelines, feature stores
• Experience with GPU workloads and performance tuning
• Advanced troubleshooting across data, model, and infrastructure layers
• Observability tools: Prometheus, Grafana, Datadog, OpenTelemetry
• ML monitoring (model metrics, drift detection, inference health)
• CI/CD, MLOps, IaC (Terraform, CloudFormation)

Nice to Have:
• Experience with Kubeflow, MLflow, SageMaker, Vertex AI
• Background in ML or data science
• Experience with real time, high throughput inference systems
• Exposure to AI governance, explainability, or responsible AI

Success Indicators:
• AI services consistently exceed reliability and performance targets
• Incidents decrease through strong operational rigor and automation
• Models are deployed safely, quickly, and with confidence
• Engineering teams rely on the platform and tooling you build

For applications and inquiries, contact: [email protected]

Apply tot his job

Apply To this Job
Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

High Risk Maternity Case Manager (REMOTE) NJ

Remote

Experienced Data Entry Specialist – Remote Work Opportunity for Detail-Oriented Individuals with Strong Organizational Skills

Remote

Remote Full Stack Engineer, Crypto - Cross River

Remote

Experienced Call Center Customer Service Representative – Delivering Exceptional Support and Solutions to Diverse Customer Base at blithequark

Remote

Overnight Live Chat Help Desk Representative - Fully Remote, No Experience Needed

Remote

**Experienced Customer Support Representative – Delivering World-Class Air Travel Experiences from the Comfort of Your Home**

Remote

OKTA IAM-REMOTE-(Okta Identity Access Management)

Remote

[Remote] CSR Remote Benefits Consultant Veterans & Families

Remote

Experienced Quantitative Study Consultant for Remote Work in the USA - Utilizing Strong Academic Background in Statistics, Accounting, Finance, and Analytics to Support Student Success

Remote

Communications Logistics Analyst - Active Top Secret / SCI Eligibility Required

Remote
← Back