[Remote] Senior Manager, Engineering - Observability Platform (Remote Eligible)
Note: The job is a remote job and is open to candidates in USA. Smartsheet is a company that empowers teams to manage work seamlessly and scale solutions smarter. They are seeking a Senior Manager of Engineering for the Observability Platform team to lead the development of a centralized platform that provides full-stack visibility into complex systems across the company. This role will involve engineering strategy and execution, focusing on observability tooling and AI integrations to enhance platform reliability.ResponsibilitiesLead a team of engineers focused on observability platform engineering, driving build-out of a unified observability stack used by all engineering teams at SmartsheetOwn and evolve the platform's technical roadmap, consolidating multiple tooling platforms, and AI observability tooling into a coherent, scalable capabilityDefine platform standards, contribute to architectural direction, and ensure the team operates with engineering rigor and strong operational habitsBuild and scale the team, hiring senior engineers and establishing effective global practices across distributed stakeholdersLead design and delivery of centralized observability infrastructure covering metrics pipelines, distributed tracing, alerting frameworks, and log analytics across Smartsheet servicesDrive SLO/SLA definition and tooling for platform-wide reliability visibility, partnering closely with infrastructure, platform engineering, and on-call teamsOwn governance including instrumentation standards, cost optimization, and rollout of advanced capabilities such as APM, RUM, and custom dashboardsLead architecture, scaling, and operational practices for log analytics across high-throughput production workloadsEstablish shared observability libraries, agents, and SDKs that reduce instrumentation burden for application engineering teamsBuild and maintain AI/ML observability integrations in partnership with the AI Platform teamPartner with the Data & AI Platform team to integrate MLflow tracing, Inference Tables, and LLM-as-judge evaluation pipelines into the observability stackDevelop dashboards and alerting for agentic AI workloads, including latency, token consumption, error rates, and evaluation metric driftContribute to the AI governance and cost observability program, providing telemetry for model usage, cost attribution, and compliance reportingServe as the primary engineering partner for platform consumers across Data & AI, Commerce, Infrastructure, and Security teams, ensuring observability needs are met across workstreamsLead complex, cross-functional observability projects with high ambiguity, managing delivery risk, communicating clearly to senior stakeholders, and building alignment across teamsPartner with delivery partners to coordinate instrumentation across platform modernization and migration workstreamsContribute to quarterly and annual platform goals, reporting on key reliability and observability metrics to engineering leadershipCommunicate platform status, risks, and roadmap progress to Engineering leadership and above audiences in a clear, executive-ready formatEmbed on-call culture and incident management discipline into the team, ensuring clear runbooks, fast MTTR, and post-incident learning loopsDrive cost governance for observability tooling, including spend optimization and efficient resource managementChampion AI-assisted engineering practices within the team, applying tooling and automation to reduce toil and accelerate deliverySkills10+ years of software or platform engineering experience, with strong fundamentals in distributed systems, infrastructure, and backend services3 years of engineering management experience, including direct team building, performance management, and cross-functional delivery ownershipDeep hands-on expertise with observability tooling: Datadog (APM, metrics, logs, alerting), OpenSearch or Elasticsearch, distributed tracing (OpenTelemetry or equivalent), and SLO/SLA management at scaleProven experience operating observability platforms for high-availability, high-throughput production environmentsExperience building and scaling engineering teams in distributed or international focusStrong execution track record on complex, cross-functional infrastructure programs with high ambiguityClear, direct communication (written and verbal) with both technical and non-technical audiences, including leadership and executive stakeholdersProactive risk identification and status communication without promptingExperience managing vendors, external delivery partners, and third-party integrations in a platform contextHands-on experience with AI/ML observability: MLflow tracing, LLM evaluation pipelines, or observability for agentic AI systemsFamiliarity with Amazon Bedrock, ECS Fargate, or LangGraph-based multi-agent architecturesExperience with cloud cost governance and FinOps practices for observability toolingExposure to data platform observability and data quality monitoring in a lakehouse contextExperience establishing internal developer platforms, shared libraries, or platform-as-a-service offerings for application teamsPrior work in SaaS environments with enterprise compliance requirements (SOC 2, FedRAMP, HIPAA)BenefitsEmployer subsidized medical/vision and dental coverage for full-time employees401k Match to help you save for your future (50% of your contribution up to the first 6% of your eligible pay)Monthly stipend to support your work and productivityFlexible Time Away Program, plus Sick Time OffUS employees are automatically covered under Smartsheet-sponsored life insurance, short-term, and long-term disability plansUS employees receive 12 paid holidays per yearUp to 24 weeks of Parental LeavePersonal paid Volunteer Day to support our communityOpportunities for professional growth and development including access to Udemy online coursesCompany Funded Perks, including a counseling membership, local retail discounts, and your own personal Smartsheet accountTeleworking options from any registered location in the U.S. (role specific)Company OverviewSmartsheet is a cloud-based work management platform that empowers collaboration, drives better decision-making, and accelerates innovation. It was founded in 2005, and is headquartered in Bellevue, Washington, USA, with a workforce of 1001-5000 employees. Its website is https://www.smartsheet.com.