[Remote] Principal Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Accela is an industry leader in designing and delivering government software to improve efficiency and citizen engagement. The Principal Site Reliability Engineer will be responsible for the reliability, scalability, performance, and operational excellence of Accela's Civic Platform, working closely with various engineering teams to modernize infrastructure and ensure high availability and security of SaaS offerings.ResponsibilitiesServe as a technical leader for reliability engineering, operational excellence, and platform modernization across the Civic PlatformDrive platform modernization initiatives, including the continued evolution from VM-based architectures toward containerized and cloud-native services, in partnership with DevOps Engineering, Database Engineering, Security, and Development teamsLead efforts that improve and sustain the availability, performance, scalability, security, and cost efficiency of Accela's SaaS offeringsDefine, implement, and operate service level objectives (SLOs), service level agreements (SLAs), and error budgets for critical platform services, using data to drive prioritization and risk-based decision makingLead observability initiatives across metrics, distributed tracing, logging, and monitoring platforms to improve system visibility and accelerate issue detection and resolutionDrive Root Cause Analysis (RCA) efforts for complex production incidents, facilitate blameless postmortems, and ensure corrective actions are implemented and tracked to completionDesign, develop, and maintain automation, tooling, and software solutions that improve reliability, operational efficiency, scalability, and developer productivityServe as a senior technical escalation point during production incidents and for platform changes that impact availability, performance, security, or compliancePartner with Security and Compliance teams to ensure platform operations meet regulatory and compliance requirements, including SOC 2, HIPAA, FedRAMP, StateRAMP, and PCI-DSSTranslate operational metrics, reliability trends, and platform health data into actionable insights for engineering leadership and executive stakeholdersMentor engineers across the Cloud Engineering organization and influence engineering best practices through technical leadership and collaborationSkills8+ years of experience in Site Reliability Engineering, Software Engineering, Cloud Infrastructure, or related disciplines within a SaaS environment, including experience leading complex technical initiativesDemonstrated technical leadership driving platform modernization in containerized and orchestrated environments, including Kubernetes or equivalent technologiesHands-on experience operating and supporting large-scale SaaS platforms on Microsoft AzureExperience developing automation and operational tooling using Python, PowerShell, Bash, or similar scripting languagesDeep expertise designing, operating, analyzing, and troubleshooting complex distributed systems across the application, infrastructure, networking, and operating system layersStrong experience with modern observability platforms, including monitoring, logging, metrics, and distributed tracingDemonstrated success leading incident response, Root Cause Analysis, and continuous improvement initiativesExperience establishing and maturing Incident, Problem, and Change Management practicesStrong written and verbal communication skills with the ability to effectively communicate technical concepts to engineering leadership and executive stakeholdersExperience using Git and GitHub-based development workflowsExperience with Infrastructure-as-Code practices and tooling, particularly TerraformExperience with configuration management platforms such as AnsibleExperience supporting SaaS platforms subject to public-sector compliance frameworks, including SOC 2, HIPAA, FedRAMP, StateRAMP, and PCI-DSSExperience implementing GitOps deployment methodologies using tools such as Argo CD or FluxExperience implementing and operating OpenTelemetry-based observability solutionsCloud FinOps experience, including cost optimization and resource efficiency initiatives within Microsoft Azure environmentsStrong Linux systems administration experience alongside Microsoft Windows expertiseExperience leveraging AI-assisted engineering tools such as GitHub Copilot, Claude Code, or similar technologies to improve engineering productivity, incident response, automation, and operational efficiencyBenefitsAnnual bonus targetFlexible time offComprehensive medical, dental, and vision plansFamily planning benefits401(k) retirement savings plan with company matchHealth savings account with company contributionsFlexible spending accountLife, accident, and disability coverageBusiness travel insuranceEmployee assistance programsOther well-being benefitsCompany OverviewAccela provides market-leading solutions that help governments to modernize and build thriving communities. It was founded in 1999, and is headquartered in San Ramon, California, USA, with a workforce of 201-500 employees. Its website is https://www.accela.com.Company H1B SponsorshipAccela has a track record of offering H1B sponsorships, with 5 in 2025, 7 in 2024, 10 in 2023, 2 in 2022, 5 in 2021, 17 in 2020. Please note that this does not guarantee sponsorship for this specific role.