[Remote] Site Reliability Engineer, Inference Infrastructure
Note: The job is a remote job and is open to candidates in USA. Cohere is a company focused on scaling intelligence to serve humanity through AI systems. They are seeking a Site Reliability Engineer to join their Model Serving team, responsible for developing and operating AI platforms that deliver large language models through API endpoints, ensuring high performance and reliability.ResponsibilitiesBuild self-service systems that automate managing, deploying and operating servicesThis includes our custom Kubernetes operators that support language model deploymentsAutomate environment observability and resilience. Enable all developers to troubleshoot and resolve problemsTake steps required to ensure we hit defined SLOs, including participation in an on-call rotationBuild strong relationships with internal developers and influence the Infrastructure teamβs roadmap based on their feedbackDevelop our team through knowledge sharing and an active review processSkills5+ years of engineering experience running production infrastructure at a large scaleExperience designing large, highly available distributed systems with Kubernetes, and GPU workloads on those clustersExperience with Kubernetes dev and production coding and supportExperience with GCP, Azure, AWS, OCI, multi-cloud on-prem / hybrid servingExperience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environmentsExperience in compute/storage/network resource and cost managementExcellent collaboration and troubleshooting skills to build mission-critical systems, and ensure smooth operations and efficient teamworkThe grit and adaptability to solve complex technical challenges that evolve day to dayFamiliarity with computational characteristics of accelerators (GPUs, TPUs, and/or custom accelerators), especially how they influence latency and throughput of inferenceStrong understanding or working experience with distributed systemsExperience in Golang, C++ or other languages designed for high-performance scalable serversBenefitsAn open and inclusive culture and work environmentWork closely with a team on the cutting edge of AI researchWeekly lunch stipend, in-office lunches & snacksFull health and dental benefits, including a separate budget to take care of your mental health100% Parental Leave top-up for up to 6 monthsPersonal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvementRemote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend6 weeks of vacation (30 working days!)Company OverviewCohere develops enterprise artificial intelligence software and provides language models, retrieval tools, and workplace platforms. It was founded in 2019, and is headquartered in Toronto, Ontario, CAN, with a workforce of 201-500 employees. Its website is https://cohere.com.Company H1B SponsorshipCohere has a track record of offering H1B sponsorships, with 11 in 2025, 14 in 2024, 13 in 2023, 5 in 2022, 2 in 2021. Please note that this does not guarantee sponsorship for this specific role.