[Remote] Senior Site Reliability Engineer, Core AI Infrastructure
Note: The job is a remote job and is open to candidates in USA. Coinbase is a remote-first company focused on increasing economic freedom, and they are seeking a Senior Site Reliability Engineer to join their IT Operations team. The role involves owning the reliability and automation of critical AI infrastructure, ensuring systems are resilient, observable, and secure at scale.ResponsibilitiesOwn the reliability, monitoring, and incident response lifecycle for AI infrastructure services, including on-call support for AWS deployment pipelines, root cause analysis, and blameless retrosBuild automation and tooling to streamline operational IT workflows, eliminate manual tasks, and improve deployment velocity across CI/CD frameworks and Kubernetes environmentsPartner with the Coinbase Infrastructure team to extend CI/CD frameworks supporting IT services and enterprise network platforms, and with Security and Compliance to integrate surveillance tooling into deployment pipelinesStrengthen observability and documentation standards across IT engineering by defining metrics, implementing monitoring solutions, and maintaining technical documentation that sets a standard of excellenceDevelop full-stack applications that power internal AI products and infrastructure with Go or PythonSkills5+ years of experience automating and supporting cloud infrastructure (AWS) and network environments, with hands-on use of infrastructure-as-code tools (Terraform, Ansible, Chef, Puppet, or Salt)Proven experience deploying, managing, and troubleshooting containerized workloads using Docker and Kubernetes in production environmentsProficiency in at least one scripting or programming language (Python, Bash, Ruby, or Go) and version control workflows using Git-based CI/CD pipelinesTrack record of leading incident response in environments with strict SLAs, including root cause analysis, blameless retros, and measurable reliability improvementsUtilizes generative AI responsibly, maintaining human oversight to deliver business-ready outputs and drive measurable improvements in workflow efficiency, cost, and qualityExpertise with linux, bash, ruby, python and/or goExpertise automating EC2 or containers deployment with terraformStrong network security fundamentalsExperience managing and leveraging log aggregationExperience working in a highly regulated environmentExperience in a fast-paced, high-growth companyExperience in a Remote-first IT environmentBenefitsTotal compensation may also include equity and bonus eligibility, and benefits (medical, dental, vision, 401(k)).Company OverviewCoinbase is a crypto exchange and wallet platform that allows merchants and consumers to buy, sell, and store digital currencies. It is a sub-organization of Coinbase. It was founded in 2012, and is headquartered in San Francisco, California, USA, with a workforce of 1001-5000 employees. Its website is https://www.coinbase.com.