[Remote] Staff Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. SimSpace is an AI Proving Ground that empowers organizations to train, test, and outmaneuver adversaries in various environments. The Staff Site Reliability Engineer will define the technical vision, lead architecture efforts, and secure the infrastructure for the SimSpace cyber range platform, ensuring reliable and scalable deployments.ResponsibilitiesDesign and architect the overarching infrastructure strategy that enables consistent, repeatable, and secure deployments across SimSpace-hosted data centers, customer-provided hardware, and highly restricted air-gapped environmentsLead the evolution of our CI/CD and Kubernetes platformsDrive advanced application packaging, templating, and configuration management strategies using Jsonnet and Grafana Tanka (alongside Kustomize)Define, measure, and govern Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets across the engineering organizationPartner with product and engineering leadership to balance feature delivery with platform stabilityArchitect our enterprise observability strategy using the Grafana stackDesign frameworks for proactive monitoring, complex anomaly detection, and distributed tracing that give teams unparalleled visibility into system health, pod scaling, and latency bottlenecksDrive the infrastructure security posture at an architectural levelEmbed advanced container security, zero-trust network segmentation, and automated compliance policies directly into our deployment pipelines and runtime environmentsServe as a strategic partner and consultant to development teamsAdvocate for an 'SRE culture' by designing self-service tooling, establishing 'paved roads' for developers, and reducing operational toil across the entire engineering orgAct as an Incident Commander during complex, high-severity outagesDrive blameless post-mortems and engineer long-term, systemic, and architectural fixes to ensure classes of failures never repeatAct as a technical mentor to senior and mid-level engineersRaise the baseline of engineering excellence across the company by coaching, documenting best practices, and leading by exampleSkills8+ years of experience in Site Reliability, Platform, or DevOps engineering, with a proven track record of operating at a Staff, Principal, or Lead level to drive organization-wide infrastructure initiativesDeep software engineering skills (beyond scripting) and can architect complex, production-quality systemsLanguage agnostic, but highly proficient in at least one modern language (e.g., Go, Python)Deep, architectural understanding of Kubernetes in multi-tenant and multi-cluster production environmentsExpert-level knowledge of Jsonnet and Grafana Tanka for managing complex, scalable Kubernetes configurations and application packagingExtensive experience architecting sophisticated CI/CD pipelines and GitOps workflows using GitHub Actions, ArgoCD, and infrastructure-as-code principles at an enterprise scaleSystems-level thinking with the ability to design architectures that span self-hosted, on-premises, VMware-based, and air-gapped deployment modelsDeep expertise with observability platforms (Grafana stack preferred) and a proven ability to design alerting and monitoring strategies for complex distributed systemsStrong background in infrastructure security architecture, including container hardening, network security, vulnerability management, and delivering software to heavily regulated or customer-managed environmentsExceptional communication and stakeholder management skillsAbility to influence cross-functional leadership, negotiate reliability tradeoffs, and align engineering teams behind a unified technical visionBenefitsBonuses tied to company performance and individual contributionsComprehensive medical, dental, and vision benefits, plus savings plans—coverage starts on day one!Access to company-paid counseling, coaching, and resources for you and your family through Spring Health401(k)-retirement savings plan featuring a company matchUnlimited vacation and dedicated health & wellness daysPaid leave plans to support you and your loved ones during life’s most important momentsEquity stock options at hire, with annual performance-based grantsEarn $1,500–$3,500 for every qualified hire through our employee referral programFull- and partial- subsidized membership plans and equipment discounts to help you reach your personalized fitness goalsAccess a LinkedIn Learning membership to prioritize your personal and professional developmentMonthly reimbursements for meaningful connections with teammates through our SocialSpace CommunityLegal plan coverage, pet insurance, wellness reimbursements, and more to simplify life’s detailsCompany OverviewSimSpace combines high-fidelity, military-grade cyber ranges and training content with unique user and adversary emulation techniques. It was founded in 2015, and is headquartered in Boston, Massachusetts, USA, with a workforce of 201-500 employees. Its website is https://www.simspace.com/.