[Remote] Senior Cloud Operations Engineer
Note: The job is a remote job and is open to candidates in USA. The Linux Foundation is a driving force in fostering open source collaboration and supporting communities across a range of projects, including PyTorch. They are seeking a Senior Cloud Operations Engineer who will focus on the infrastructure operations of the PyTorch project, automating processes, optimizing cloud-native tools, and ensuring a robust and scalable cloud environment.ResponsibilitiesManage multi-cloud environments, primarily focusing on AWS services (EKS, EC2, S3, IAM, ELB)Contribute to architectural exercises with open source community and technical leads to validate new cloud infrastructureImplement and maintain infrastructure-as-code using Terraform via pytorch/ci-infra and pytorch/test-infraOptimize cloud resource utilization and implement FinOps practices for cost management and reportingDesign, implement, and maintain CI/CD pipelines using GitHub Actions and ARC, including runner configurations and other elements of the CI ecosystemDebug and triage issues in build and test pipelines, including experience with unit testingDevelop monitoring and alerting solutions for CI/CD workflows and critical infrastructureManage and optimize Cloudflare CDN deployments for PyTorch assets (R2/S3)Implement best practices for CDN and overall infrastructure securityDevelop comprehensive monitoring and observability solutions using Datadog, AWS CloudWatch, and other telemetry data collection and processing toolsReview and recommend monitoring solutions as project and community needs evolveParticipate in on-call rotations supporting operations and incident response using incident.ioEstablish and maintain escalation procedures and resolution processesParticipate in ci-infra and multi-cloud working groups and support architecture decisionsCollaborate with external contributors and promote DevOps best practicesManage GitHub repositories, including user onboarding and access controlAttend and contribute to technical meetings, including Infrastructure, CI Workflow, and Technical Advisory Council sessionsDevelop and maintain technical documentation for infrastructure and processesProvide guidance on developer best practices and toolingCreate and update runbooks for common operational tasks and incident responseSkillsAbility to work with communities made up of industry specialists and collaborate outside of the Linux FoundationBachelor's degree in Computer Science, Engineering, or related field7+ years of experience in cloud operations with significant AWS expertiseStrong knowledge of infrastructure-as-code principles and tools, particularly TerraformProficiency in scripting languages (Python, TypeScript, Bash) and containerization technologies (Docker, Kubernetes)Experience with Cloudflare CDN management and optimizationExpertise in implementing and managing monitoring solutions, specifically Datadog and AWS CloudWatchFamiliarity with incident management tools and processes, particularly incident.ioDemonstrated experience in CI/CD pipeline design and implementationStrong problem-solving skills and ability to troubleshoot complex systemsExcellent communication skills and experience collaborating with open source communitiesExperience with PyTorch or other open source communitiesMulti-cloud expertise across AWS, GCP, and AzureGitHub ARC experienceKnowledge of FinOps principles and cloud cost optimization strategiesContributions to open source projects, especially in infrastructure management rolesFamiliarity with the Linux Foundation or similar open source foundationsExperience mentoring other engineers and fostering a collaborative team environmentBenefitsThe Linux Foundation maintains a predominantly remote workforceCommitted to hiring top-notch talentProviding a flexible and supportive work cultureCollaboration is embedded in our DNAWork closely together while not being confined to a traditional office spaceCompany OverviewThe Linux Foundation is the organization of choice for the world's top developers and companies to build ecosystems that accelerate open technology development and commercial adoption. It was founded in 2000, and is headquartered in San Francisco, California, USA, with a workforce of 201-500 employees. Its website is http://www.linuxfoundation.org.