[Remote] Director of Cloud Operations
Note: The job is a remote job and is open to candidates in USA. Firstup is a company dedicated to improving employee experience through innovative communication solutions. They are seeking a Director of Cloud Operations to lead their cloud infrastructure and operational practices, ensuring reliability and efficiency across their SaaS platform while fostering a high-performing team.ResponsibilitiesOwn the availability, performance, and resilience of our multi-region AWS platformDrive improvements in system reliability through well-defined SLIs/SLOs, error budgets, and proactive engineering practicesLead efforts to reduce MTTR and improve incident response effectiveness across the organizationGuide architecture decisions for microservices, Kubernetes (EKS), and serverless workloads to ensure scalability and fault toleranceAdvance our observability strategy using Datadog, ensuring actionable insights across infrastructure and applicationsEstablish and refine incident management practices, including on-call processes, escalation paths, and post-incident reviewsAct as an incident commander for critical events and contribute to the on-call rotationElevate operational standards through automation, standardization, and adoption of modern best practicesDrive cost optimization initiatives across AWS environments without compromising performance or reliabilityLeverage AI and automation to improve operational efficiency, accelerate root cause analysis, and enhance system insightsContinuously improve CI/CD pipelines (CircleCI) and infrastructure-as-code practices (Terraform)Lead, mentor, and support a distributed team of CloudOps engineers across the US and UKFoster a culture of accountability, learning, and continuous improvementProvide technical guidance while enabling the team to grow in ownership and capabilityEnsure stability and support for existing customers while maintaining clear operational boundaries with the cloud platformSkills10+ years in cloud infrastructure, SRE, or DevOps roles3+ years experience leading CloudOps/SRE teamsProven track record of leading operational or platform transformations in a SaaS environmentExperience operating multi-region, customer-facing systems at scaleStrong hands-on experience with AWS (multi-region architectures)Strong hands-on experience with Kubernetes (EKS) and containerized environmentsSolid understanding of microservices and distributed systems designFamiliarity with serverless architectures and modern cloud-native patternsDeep experience with incident management, on-call operations, and reliability engineering practicesStrong understanding of SLO/SLI frameworks, monitoring strategies, and performance optimizationDemonstrated ability to balance hands-on technical work with team leadershipCollaborative, pragmatic leader who can influence across teams and functionsPassion for building and supporting high-performing teamsFocus on continuous improvement, with a bias toward measurable outcomesInfrastructure as Code (Terraform preferred)CI/CD pipelines (CircleCI or similar)Observability platforms (Datadog or equivalent)BenefitsExcellent PTO programGreat health benefitsA casual and friendly environmentRemote workA leadership team who truly believes in your growth – both personally and professionallyCompany OverviewFirstup is an employee communication and engagement platform connecting companies with their valued asset and their employees. It was founded in 2010, and is headquartered in San Bruno, California, USA, with a workforce of 201-500 employees. Its website is https://firstup.io/.Company H1B SponsorshipFirstup has a track record of offering H1B sponsorships, with 1 in 2024. Please note that this does not guarantee sponsorship for this specific role.