[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Ellucian is a company that powers innovation for higher education, serving over 21 million students globally. They are seeking a Senior Site Reliability Engineer to ensure the reliability, performance, and cost-efficiency of their production systems, focusing on DevOps practices and incident management.ResponsibilitiesOwn and improve system reliability, availability, and performance for production environmentsDesign, implement, and manage monitoring, alerting, and observability using DataDog (required)Lead incident response efforts, including troubleshooting, mitigation, and post-incident reviewsPerform detailed root cause analysis (RCA) and drive permanent resolutionsPartner with engineering and DevOps teams to build scalable, resilient infrastructureAutomate operational processes to improve efficiency and reduce riskAnalyze and optimize infrastructure and application costsDefine and manage SLIs/SLOs to meet reliability targetsContinuously improve deployment, monitoring, and operational practicesSkills5+ years of experience in Site Reliability Engineering, DevOps, or similar rolesStrong, hands-on expertise with DataDog (APM, logs, metrics, dashboards, alerting)Experience with cloud platforms (AWS, Azure, or GCP)Proficiency in DevOps practices and tools (CI/CD, Infrastructure as Code such as Terraform)Strong troubleshooting skills and experience conducting root cause analysis in distributed systemsExperience with containers and orchestration (Docker, Kubernetes)Scripting or programming experience (Python, Bash, or similar)Proven ability to analyze and optimize cloud costsOwn and improve system reliability, availability, and performance for production environmentsDesign, implement, and manage monitoring, alerting, and observability using DataDog (required)Lead incident response efforts, including troubleshooting, mitigation, and post-incident reviewsPerform detailed root cause analysis (RCA) and drive permanent resolutionsPartner with engineering and DevOps teams to build scalable, resilient infrastructureAutomate operational processes to improve efficiency and reduce riskAnalyze and optimize infrastructure and application costsDefine and manage SLIs/SLOs to meet reliability targetsContinuously improve deployment, monitoring, and operational practicesExperience with cost management tools (e.g., AWS Cost Explorer, Azure Cost Management)Familiarity with cloud security and compliance best practicesExperience supporting high-availability, customer-facing systemsStrong collaboration and communication skillsBenefitsComprehensive health coverage: medical, dental, and visionFlexible time offThrive Flex Lifestyle Account (LSA) that allows you to contribute towards your health, financial or learning interests401k w/ match & BrightPlan - to help you save for the futureParental Leave5 charitable days to support the community that supports usTelemedicineWellnessHeadspace Care (mental health)Wellbeats (virtual fitness classes)RethinkCare & Wellthy– caregiver supportDiversity and inclusion programs which provide access to internal employee resource groupsEmployee referral bonuses to encourage the addition of great new people to the teamWe Foster a learning culture with:Education Assistance ProgramProfessional development opportunitiesCompany OverviewEllucian delivers the software, services, and insights that help your institution thrive. It was founded in 1968, and is headquartered in Fairfax, Virginia, USA, with a workforce of 1001-5000 employees. Its website is http://www.ellucian.com.Company H1B SponsorshipEllucian has a track record of offering H1B sponsorships, with 2 in 2026, 31 in 2025, 27 in 2024, 28 in 2023, 31 in 2022, 33 in 2021, 30 in 2020. Please note that this does not guarantee sponsorship for this specific role.