Site Reliability Engineer (SRE) - Azure | DevSecOps | IaC | Governance | Observability

Remote Full-time
Description

We are seeking a Site Reliability Engineer (SRE) who will drive stability, reliability, and performance across our Azure and GCP-based platforms.
This role blends operational excellence, proactive incident management, and strong collaboration with DevOps, Cloud, and Security teams.

The ideal candidate will have hands-on experience with multi-cloud environments (Azure and GCP), IaC (Terraform/Ansible), CI/CD (Jenkins/GitHub Actions), and modern observability and AI-Ops systems. The engineer will also contribute to governance, cost optimization, and automation strategies that reduce toil and prevent issues before they occur. A key aspect of this role is the ability to perform deep-dive troubleshooting of application performance and errors by analyzing logs and traces in platforms like Grafana and Datadog.

This position includes 24 7 support coverage (rotational) and requires strong ownership in managing major incidents, RCA processes, and continuous service improvements.

Key Responsibilities

Reliability & Incident Management
• Serve as a key member of the 24 7 on-call rotation, responding to and managing incidents across production and pre-production environments.
• Lead incident bridges, coordinate root cause analysis (RCA), and ensure post-incident reviews drive systemic improvements.
• Maintain clear communication with cross-functional teams and leadership during major incidents.

Monitoring, AI-Ops, Alerts & Prevention
• Build, tune, and maintain observability dashboards (Azure Monitor, GCP Operations Suite, Prometheus, Grafana, Datadog, Log Analytics).
• Perform deep-dive troubleshooting of application and service-level issues using distributed tracing and log analysis (Grafana, Datadog) to pinpoint root causes beyond infrastructure.
• Define SLOs, SLIs, and error budgets to proactively identify and mitigate reliability risks before customer impact.
• Integrate AI-Ops tools for anomaly detection, predictive alerting, and automated incident correlation.
• Continuously enhance alert quality, reduce false positives, and automate runbooks for faster recovery.
• Analyze trends to prevent recurring issues and support teams in resilience engineering.

Requirements

Required Skills & Experience
• 5+ years in Site Reliability, DevOps, Cloud Operations, or Customer support roles.
• Demonstrated experience in application-level troubleshooting by analyzing logs and traces to identify bugs, performance bottlenecks, and error conditions.
• Expertise in Azure and GCP cloud operations and distributed system reliability.
• Understanding of Terraform, Ansible, and CI/CD pipelines (Jenkins, GitHub Actions).
• Experience with observability and AI-Ops tools (Azure Monitor, GCP Operations Suite, Grafana, Prometheus, Datadog, etc.).
• Solid grasp of incident management frameworks (P1-P3 handling, RCA, PIRs, on-call rotations).
• Excellent analytical, troubleshooting, and communication skills.

Desired Behaviours
• Proactive Prevention: Identifies and resolves risks before they escalate into incidents.
• AI-Driven Mindset: Applies AI and automation to improve reliability and reduce human intervention.
• Accountability: Owns service reliability and communicates with clarity.
• Collaboration: Works seamlessly with platform, DevOps, and product teams.
• Efficiency: Focuses on automation to reduce manual effort and improve MTTR.
• Continuous Improvement: Learns from failures, iterates processes, and enhances documentation.

The pay range for this opportunity is from $129,00 to $143,000 + performance-related bonus + benefits. This range represents the anticipated low and high end of the salary for this position. This role is also eligible to receive an annual bonus that aligns with individual and company performance. Actual salaries will vary and are based on factors such as a candidate s qualifications, skills, competencies.

Apply tot his job

Apply To this Job
Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

**Rewritten Job Title and Job Description in HTML Format**

Remote

**Experienced Customer Support Representative – E-commerce and Technology Industry**

Remote

Curriculum Developer; CTE | Remote

Remote

CloudOps Engineer - 3rd Shift

Remote

Become a Freelance Writer with Perfect Search | Perfect Search Media | Handshake

Remote

Remote Customer Service Representative – Medicare Insurance Agency | Work From Home Position

Remote

Sales Intern

Remote

[Remote] Senior Data Engineer - Remote

Remote

Experienced Entry-Level Data Entry Clerk – Global Logistics Operations Support at blithequark

Remote

College Admissions Consultant

Remote
← Back