[Remote] Manager, Site Reliability Engineering
Note: The job is a remote job and is open to candidates in USA. Paradigm is a software company transforming the residential, construction & building product industries. They are seeking a Manager of Site Reliability Engineering to lead a high-performing team, promote modern SRE practices, and enhance reliability across their Azure-based platform.ResponsibilitiesLead and grow a team of site reliability engineers. Provide guidance, mentorship, and career developmentContribute to and mature SRE practices across production services: SLOs, SLIs, error budgets, toil reduction, and blameless post-mortems that turn incidents into lasting improvementsOversee the incident management lifecycle end-to-end including detection, response, resolution, post-incident review, and systemic improvementDesign on-call rotations, runbooks, and escalation procedures that balance service reliability with engineer well-being and sustainable work practicesDrive measurable reductions in MTTR and MTTD through improved observability, intelligent automation, and predictive monitoringBuild automation to eliminate manual operational work including provisioning, deployment, scaling, self-healing, and reportingImplement chaos engineering practices to validate system resilience and surface weaknesses before they cause outagesPartner with engineering and product teams to embed reliability requirements into the development lifecycle, from design through deploymentCollaborate with the observability team to ensure comprehensive instrumentation, smart alerting, and actionable dashboards across all critical servicesMeasure, report, and advocate for reliability improvements with both technical and executive stakeholders using data to drive investment decisionsSkillsBachelor's degree in Engineering, or a related field or equivalent experience7+ years in site reliability engineering, DevOps, or infrastructure engineering, with at least 1 year in people management (or demonstrated tech lead experience with direct influence over team processes and career growth)Hands-on experience running production systems on Azure (including proficiency with key services such as AKS, App Services, Service Bus, Event Grid, and Azure Monitor) or comparable cloud platformsProven track record implementing SRE practices with measurable reliability improvements and familiarity with modern observability platforms (Datadog, Prometheus/Grafana, or equivalent)Experience leading incident response for high-severity production issues and running effective post-mortemsStrong background in automation, infrastructure as code (Terraform, Bicep, or similar), and systematically eliminating manual operational workExperience with Kubernetes container orchestration with production-grade operational experienceAbility to automate workflows and build scripts using Python, Bash, PowerShell, or GoStrong communication with the ability to make complex technical issues clear for both engineers and executivesData-driven approach. You use metrics and telemetry to guide decisions, not gut feelYou are collaborative cross-functionally and build trust and alignment naturallyAI-enhanced observability experience is preferredExperience with AI coding assistants and CI/CD systems (GitHub Actions, Azure DevOps, ArgoCD) with automation capabilities is preferredKnowledge of distributed systems patterns is preferredExposure to AIOps platforms or using LLMs for operational automation is preferredCompany OverviewParadigm provides a software platform that focuses on the building products industry. It was founded in 1999, and is headquartered in Middleton, Wisconsin, USA, with a workforce of 501-1000 employees. Its website is http://myparadigm.com/.Company H1B SponsorshipParadigm has a track record of offering H1B sponsorships, with 1 in 2026, 1 in 2025, 4 in 2024, 1 in 2023, 1 in 2022, 4 in 2021, 1 in 2020. Please note that this does not guarantee sponsorship for this specific role.