[Remote] Lead Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Gradle Technologies is an AI-native company focused on transforming software development through their Develocity platform. They are seeking a Lead Site Reliability Engineer to define SRE vision, set operational standards, and ensure reliability across production services while mentoring a growing team.ResponsibilitiesOperate and maintain all Develocity instances and supporting services in productionDefine and evolve SRE standards, practices, and operating models, including on-call, incident response, postmortems, and SLOsParticipate in a follow-the-sun on-call rotation, acting as a technical escalation point for complex or high-severity incidentsLead incident response and blameless retrospectives, ensuring learnings result in measurable reliability improvementsSet reliability priorities using risk, customer impact, business goals, SLOs, and error budgetsIdentify systemic reliability risks and continuously evolve Develocityâs SaaS operations as the platform and customer base growLead and influence architectural and design reviews to ensure reliability, scalability, and operabilityDrive automation across deployment, upgrades, monitoring, self-healing, recovery, and operational workflowsBuild and maintain comprehensive observability for all managed services, including logging, metrics, tracing, and alertingOwn disaster recovery, backups, and business continuity planning and executionPartner with engineering leadership to balance feature delivery with reliability and operational excellenceMentor and coach SREs, supporting technical growth and strong operational practicesHelp onboard new SREs and contribute to hiring by defining and assessing SRE excellence at DevelocityCommunicate clearly with customers during incidents and maintenance windowsOptimize performance, resource utilization, and operational costsSkills7+ years in SRE, DevOps, or an equivalent role operating production services at scaleExperience leading reliability initiatives across multiple teams or servicesDemonstrated ability to influence technical direction without direct authorityExperience designing and operating systems with SLOs and error budgets, and exercising strong judgment in balancing reliability, velocity, and costStrong Kubernetes experience in production environmentsCloud infrastructure expertise, preferably AWS (EKS, RDS, S3, EC2)Proficiency with observability tools (Prometheus, Grafana) and Infrastructure as Code (Terraform)Track record of incident management and response in a 24/7 on-call environmentScripting proficiency (Python, Bash) for automationStrong written and verbal English communication skillsExperience as a founding or early SRE establishing practices in a growing SaaS organizationFamiliarity with DevelocityJVM language experience (Java, Kotlin)Experience with customer-facing and executive-level incident communicationsBenefitsA ground-floor role in a new SRE team - you'll shape how we do things, not inherit someone else's decisions.Real ownership of production systems used by engineers at companies you've heard of.Direct interaction with customers when things go wrong (and when they go right).A culture that values automation over heroics.In-person meetings, such as our annual company offsite and team meetings.Work from home in a remote-first environment.Competitive salaries and equity grants.Company OverviewGradle Technologies is the award-winning developer productivity company behind Gradle Build Toolâone of the most used build systems in the worldâand DevelocityÂŽ, the leading developer observability platform. It was founded in 2014, and is headquartered in San Francisco, California, USA, with a workforce of 51-200 employees. Its website is https://gradle.com/.Company H1B SponsorshipGradle Technologies has a track record of offering H1B sponsorships, with 1 in 2025, 1 in 2024, 2 in 2022. Please note that this does not guarantee sponsorship for this specific role.