[Remote] Senior Site Reliability Engineer

Remote Full-time

Note: The job is a remote job and is open to candidates in USA. Hard Rock Digital is a team focused on becoming the best online sportsbook, casino, and social gaming company in the world. They are seeking a Senior Site Reliability Engineer who will maintain and improve the reliability, scalability, and performance of Java-based applications while pioneering AI-driven operations. The role involves designing and building AI workflows, managing observability tools, and collaborating with cross-functional teams to enhance system reliability.ResponsibilitiesEnsure the availability, reliability, and performance of high-traffic Java-based applications in a distributed environmentTroubleshoot and resolve complex issues across production and non-production environmentsParticipate in pre- and post-deployment performance testing and monitoring to continuously improve application performanceOptimize Java application performance with a focus on JVM tuning, efficient resource utilization, and horizontal scalingDeploy and manage the Grafana stack (Grafana, Prometheus, Loki, Mimir, Alloy) to deliver real-time monitoring, logging, and alertingImplement and refine observability strategies that enhance visibility into application and infrastructure healthCreate and maintain dashboards, alerts, and log queries for comprehensive system health monitoringIntegrate AI/ML models into the observability pipeline for anomaly detection, predictive alerting, and intelligent alert correlation and noise reductionDesign, build, and operate agentic AI workflows that automate operational tasks such as alert triage, root cause analysis, runbook execution, and incident summarizationDevelop tool-calling LLM agents that interact with infrastructure APIs (Kubernetes, Grafana, Jira, Slack, PagerDuty) to execute diagnostic and remediation actions autonomously or with human-in-the-loop approvalBuild and maintain MCP (Model Context Protocol) servers and integrations that expose internal systems as tool surfaces for AI agentsEvaluate, select, and operationalize LLM frameworks and orchestration platforms (e.g., LangChain, LangGraph, CrewAI, n8n, or custom solutions) for production-grade agentic systemsImplement guardrails, evaluation harnesses, and feedback loops to ensure AI agent outputs are accurate, safe, and continuously improvingChampion the adoption of AI-assisted development and operations practices across the SRE and broader engineering organizationSupport the operations team’s incident response efforts, conduct post-mortems, and identify root causes to prevent recurrenceLeverage AI tools to accelerate incident timelines, auto-generate post-mortem drafts, and surface patterns across historical incidentsDocument and share lessons learned, contributing to a culture of continuous improvementIdentify repetitive operational workflows and engineer AI-augmented or fully automated replacementsBuild self-service tools and chatbot interfaces that allow engineering teams to query system status, retrieve logs, and execute standard operating procedures through natural languageMeasure and report on toil reduction metrics to quantify the impact of automation initiativesWork closely with developers, architects, and data/ML engineers to design solutions that improve reliability and leverage AI capabilitiesCollaborate with DevOps and NOC teams to support the application platformCommunicate SRE practices, AI/automation capabilities, and operational insights to technical and non-technical stakeholdersProvide feedback on application performance, potential improvements, and observability metricsSkillsDegree in Computer Science or a related field, or equivalent professional experience5+ years in SRE, DevOps, or similar infrastructure roles with experience managing large-scale, high-availability production systems3+ years hands-on experience managing production Kubernetes clusters, including deep understanding of architecture, networking, storage, and securityExperience with cluster autoscaling (Karpenter), upgrades, and multi-cluster managementProficiency with kubectl, Helm, Kubernetes operators, and container orchestration troubleshootingAdvanced expertise with the Grafana observability stack: dashboards, alerting, visualization, and Grafana Alloy for telemetry collectionProficiency in PromQL and experience with Loki for log aggregation and analysisHands-on experience managing Java-based applications in distributed environments, including JVM tuning and optimizationCloud platform expertise (AWS preferred; GCP or Azure also valued)Familiarity with Infrastructure as Code tools such as Terraform/Terragrunt or AnsibleArgoCD proficiency for GitOps workflows and continuous deploymentStrong scripting abilities in Python, Bash, or Go, with experience building CI/CD pipelines and deployment automationProven track record with on-call rotations, incident response, and root cause analysis1+ years of practical experience building or operating AI/LLM-powered tools, agents, or workflows in a production or production-adjacent contextDemonstrated ability to design agentic systems that use tool calling, retrieval-augmented generation (RAG), or multi-step reasoning to accomplish operational tasksExperience integrating LLM APIs (e.g., Anthropic Claude, OpenAI, or open-source models) into backend services or automation pipelinesFamiliarity with at least one agentic orchestration framework or workflow engine (LangChain, LangGraph, CrewAI, n8n, Temporal, or equivalent)Understanding of prompt engineering best practices, including structured outputs, system prompts, and few-shot examplesFamiliarity with AI-assisted coding tools (Claude Code, Codex, Cursor) and their integration into engineering workflowsExperience building or consuming MCP (Model Context Protocol) servers to expose internal tools to AI agentsAwareness of AI safety, hallucination mitigation, and human-in-the-loop design patterns for autonomous systemsHands-on experience with vector databases (Pinecone, Weaviate, pgvector) for RAG-based knowledge retrievalExperience with LLM evaluation frameworks (e.g., Galileo, LangSmith, Braintrust) for monitoring agent quality in productionContributions to open-source AI/ML or SRE tooling projectsBackground in data engineering or ML pipelines that complements SRE responsibilitiesBenefitsCompetitive pay and benefitsFlexible vacation allowanceA hybrid / remote working environmentStartup culture backed by a secure, global brandCompany OverviewHard Rock Digital is building the future of online sports betting and interactive gaming. It was founded in 2020, and is headquartered in Austin, Texas, USA, with a workforce of 501-1000 employees. Its website is https://www.hardrockdigital.com/.Company H1B SponsorshipHard Rock Digital has a track record of offering H1B sponsorships, with 3 in 2025, 4 in 2024, 5 in 2022, 1 in 2021. Please note that this does not guarantee sponsorship for this specific role.

Apply Now →

Experienced Customer Support Specialist – Remote Opportunity for Exceptional Client Service and Technical Support Professionals Across the US

Remote

[Remote] Senior Site Reliability Engineer

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

USPS Office Helper

Associate CRO Strategist, SMB

[Work From Home] Entry-Level Manufacturing Sales

Senior Consultant, Partnership & Acquisitions

Experienced Customer Support Specialist – Remote Opportunity for Exceptional Client Service and Technical Support Professionals Across the US

Loyalty & Engagement Strategy Manager

Safeway Starbucks Barista Job Description $25/Hour

Experienced Remote Customer Support Assistant for Exceptional Client Service and Relations Development at arenaflex

Occupational Therapist

Head of Commercial Contracts and M&A

Experienced Remote Data Entry Specialist – E-commerce and Cloud Computing Operations at Blithequark