[Remote] Principal Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Upstart is a leading AI lending marketplace on a mission to reduce the cost and complexity of borrowing for all Americans. They are seeking a Principal Site Reliability Engineer to lead the SRE team in ensuring the reliability and observability of their production systems while driving best practices and collaborating with cross-functional teams to enhance operational excellence.
Responsibilities
⢠Lead the definition, advocacy, and adoption of SRE principles across engineering teams
⢠Partner with leadership to shape long-term reliability, resiliency, and observability strategies
⢠Champion distributed tracing, real user monitoring (RUM), and key performance metrics such as Largest Contentful Paint (LCP) to improve system visibility and user experience
⢠Build and scale self-healing systems to minimize manual intervention and reduce downtime
⢠Drive enterprise-wide improvements to incident response processes, including those related to Machine Learning systems
⢠Collaborate closely with Development Productivity and Quality teams to improve engineering velocity without sacrificing reliability
⢠Influence technical and operational roadmaps through data-driven insights and hands-on technical contributions
⢠Own and deliver cross-functional initiatives from concept through execution, applying program management skills to align stakeholders and achieve results
Skills
⢠Bachelor's degree in Computer Science, Engineering, or Mathematics, or a related field (or its equivalent) + 8 years of experience
⢠Combined experience with both Software Engineering and Site Reliability Engineering, with a balanced background in both disciplines
⢠Proven track record as an SRE thought leader and evangelist, driving adoption of reliability best practices across organizations
⢠Strong communication and mentoring skills to influence engineers across disciplines
⢠Proficiency in Python, Go, JavaScript/TypeScript
⢠Proficiency with Infrastructure as Code (Terraform, CDK, CloudFormation, etc.)
⢠Experience building internal tooling from scratch in agile development environments
⢠Expertise with observability, distributed tracing, RUM, LCP, and performance monitoring tools (e.g., Datadog, Prometheus)
⢠Experience with on-call and incident management, including large-scale or ML-related incidents
⢠Strong background in automation and building self-healing systems
⢠Hands-on experience with LLM/GenAI to improve SRE efficiency and processes
⢠Program management skills, including the ability to propose innovative solutions, influence leadership, improve processes, and drive cross-functional projects to completion
⢠Experience with service mesh
⢠Full stack development skills
⢠Experience building or extending observability platforms
⢠Background in Development Productivity or Quality Platforms
⢠Experience in high-scale SaaS, microservice-oriented cloud environments
Benefits
⢠Target bonuses
⢠Equity compensation
⢠Generous benefits packages (including medical, dental, vision, and 401k)
⢠Competitive compensation, including base pay, bonus opportunities, and annual equity grants that vest quarterly
⢠Retirement benefits to help you plan for the future, including a 401(k) or Group Retirement Savings Plan with a company match of $2 for every $1 contributed, up to $15,000 annually (USD in the US, CAD in Canada)
⢠Employee Stock Purchase Plan (ESPP) with discounted stock purchase options for eligible employees (US only)
⢠Comprehensive health coverage designed to support you and your family, including medical, dental, vision, and wellness resources for US and supplemental health coverage for Canada.
⢠Health Savings Account contributions from Upstart for eligible plans (US only)
⢠Income protection benefits, including life insurance and disability coverage for added financial security
⢠Paid time off, sick leave, and company holidays, in line with local requirements
⢠Paid family and parental leave to support caregiving and major life moments (duration varies by country)
⢠Family-centered benefits to support fertility, parenthood, and caregiving needs
⢠Employee Assistance Program (EAP) offering mental health support and life-centered resources
⢠Financial wellness resources, including access to financial planning tools and a financial concierge service (US Only)
⢠Annual wellness allowance to support your physical and emotional well-being and personal development, based on what matters most to you
⢠Annual productivity allowance to invest in relevant tools and resources you need to do your best work, no matter where you work from
⢠Connection and community through team events, a
Responsibilities
⢠Lead the definition, advocacy, and adoption of SRE principles across engineering teams
⢠Partner with leadership to shape long-term reliability, resiliency, and observability strategies
⢠Champion distributed tracing, real user monitoring (RUM), and key performance metrics such as Largest Contentful Paint (LCP) to improve system visibility and user experience
⢠Build and scale self-healing systems to minimize manual intervention and reduce downtime
⢠Drive enterprise-wide improvements to incident response processes, including those related to Machine Learning systems
⢠Collaborate closely with Development Productivity and Quality teams to improve engineering velocity without sacrificing reliability
⢠Influence technical and operational roadmaps through data-driven insights and hands-on technical contributions
⢠Own and deliver cross-functional initiatives from concept through execution, applying program management skills to align stakeholders and achieve results
Skills
⢠Bachelor's degree in Computer Science, Engineering, or Mathematics, or a related field (or its equivalent) + 8 years of experience
⢠Combined experience with both Software Engineering and Site Reliability Engineering, with a balanced background in both disciplines
⢠Proven track record as an SRE thought leader and evangelist, driving adoption of reliability best practices across organizations
⢠Strong communication and mentoring skills to influence engineers across disciplines
⢠Proficiency in Python, Go, JavaScript/TypeScript
⢠Proficiency with Infrastructure as Code (Terraform, CDK, CloudFormation, etc.)
⢠Experience building internal tooling from scratch in agile development environments
⢠Expertise with observability, distributed tracing, RUM, LCP, and performance monitoring tools (e.g., Datadog, Prometheus)
⢠Experience with on-call and incident management, including large-scale or ML-related incidents
⢠Strong background in automation and building self-healing systems
⢠Hands-on experience with LLM/GenAI to improve SRE efficiency and processes
⢠Program management skills, including the ability to propose innovative solutions, influence leadership, improve processes, and drive cross-functional projects to completion
⢠Experience with service mesh
⢠Full stack development skills
⢠Experience building or extending observability platforms
⢠Background in Development Productivity or Quality Platforms
⢠Experience in high-scale SaaS, microservice-oriented cloud environments
Benefits
⢠Target bonuses
⢠Equity compensation
⢠Generous benefits packages (including medical, dental, vision, and 401k)
⢠Competitive compensation, including base pay, bonus opportunities, and annual equity grants that vest quarterly
⢠Retirement benefits to help you plan for the future, including a 401(k) or Group Retirement Savings Plan with a company match of $2 for every $1 contributed, up to $15,000 annually (USD in the US, CAD in Canada)
⢠Employee Stock Purchase Plan (ESPP) with discounted stock purchase options for eligible employees (US only)
⢠Comprehensive health coverage designed to support you and your family, including medical, dental, vision, and wellness resources for US and supplemental health coverage for Canada.
⢠Health Savings Account contributions from Upstart for eligible plans (US only)
⢠Income protection benefits, including life insurance and disability coverage for added financial security
⢠Paid time off, sick leave, and company holidays, in line with local requirements
⢠Paid family and parental leave to support caregiving and major life moments (duration varies by country)
⢠Family-centered benefits to support fertility, parenthood, and caregiving needs
⢠Employee Assistance Program (EAP) offering mental health support and life-centered resources
⢠Financial wellness resources, including access to financial planning tools and a financial concierge service (US Only)
⢠Annual wellness allowance to support your physical and emotional well-being and personal development, based on what matters most to you
⢠Annual productivity allowance to invest in relevant tools and resources you need to do your best work, no matter where you work from
⢠Connection and community through team events, a