[Remote] Senior Cloud DevOps & Infrastructure Engineer
Note: The job is a remote job and is open to candidates in USA. Diverse Lynx is seeking a Senior Cloud DevOps & Infrastructure Engineer with a focus on GCP and AI. The role involves designing, deploying, and maintaining secure and scalable cloud infrastructure, primarily on a multi-cloud platform, while implementing GitOps best practices and supporting AI/ML workloads.ResponsibilitiesInfrastructure as Code (IaC): Architect and provision production-grade infrastructure using Terraform. Manage state files, modules, and ensure infrastructure immutabilityAIML: Experience with LLM Models - in multi cloud environmentKubernetes & Containerization: Design and manage clusters. Create and optimize Docker files (multi-stage builds, distroless/hardened images). Manage complex deployments using Helm ChartsCI/CD & GitOps: Build end-to-end CI/CD pipelines using GitLab CI. Implement GitOps workflows to synchronize infrastructure and application stateDesign, configure, and manage scalable and secure cloud infrastructure for MLOpsAI Infrastructure Support: Configure and maintain environments suitable for AI/ML workloads (GPU node pools, LLM integration, large model serving, high-performance storage)Production Support & Troubleshooting: Act as the primary escalation point for deployment failures, network and Infra issues. Perform Root Cause Analysis (RCA)Security & Compliance: Implement 'Secure by Design' principlesHaving good knowledge of network security, identity and privilege access management, landing zone concepts for cloud platforms (Azure, AWS)Multi-Cloud Strategy: While GCP is primary, maintain and support secondary environments in AWS (and potentially Azure) to ensure business continuitySkills6 – 8 Years of experience in Cloud Infrastructure & DevOps EngineeringExpert in Kubernetes, Terraform, and GitLab CI/CDExperience supporting AI/ML workloadsArchitect and provision production-grade infrastructure using TerraformExperience with LLM Models in multi cloud environmentDesign and manage Kubernetes clustersCreate and optimize Docker files (multi-stage builds, distroless/hardened images)Manage complex deployments using Helm ChartsBuild end-to-end CI/CD pipelines using GitLab CIImplement GitOps workflows to synchronize infrastructure and application stateDesign, configure, and manage scalable and secure cloud infrastructure for MLOpsConfigure and maintain environments suitable for AI/ML workloads (GPU node pools, LLM integration, large model serving, high-performance storage)Act as the primary escalation point for deployment failures, network and Infra issuesPerform Root Cause Analysis (RCA)Implement 'Secure by Design' principlesGood knowledge of network security, identity and privilege access management, landing zone concepts for cloud platforms (Azure, AWS)Maintain and support secondary environments in AWS (and potentially Azure)Deep expertise in GCP (Compute Engine, GKE, Cloud Storage, IAM)Strong working knowledge of AWS (EC2, EKS, S3, IAM)Knowledge of using various programming languages (Python required, knowledge of Java, C#, JavaScript is a plus)Advanced proficiency in KubernetesAbility to write and manage custom Helm chartsExperience with Ingress Controllers (Nginx), Service Mesh, and Autoscaling (HPA/VPA/Cluster Autoscaler)Expert-level knowledge of GitLab CI/CD (writing .gitlab-ci.yml, runners, artifacts, caching)Understanding GitOps principlesStrong hands-on experience with Terraform for provisioning cloud resources across multiple environments (Dev/Stage/Prod)Proficiency in Bash/Shell scripting and PythonStrong Linux administration skillsExperience setting up monitoring and using Cloud Native tools, Prometheus, and GrafanaExperience with Azure Cloud infrastructureKnowledge of Identity Providers (Keycloak, Azure AD/Entra ID) and OIDC integrationExperience with Service MeshUnderstanding of ITIL processes (Incident/Change Management) and tools like ServiceNow, JIRABasic understanding of Python/Flask/Fast API applications to assist developers in troubleshootingCompany OverviewDiverse Lynx is a WBENC- and NMSDC-certified partner, helping organizations turn diversity goals into measurable impact through staffing and contingent workforce solutions. It was founded in 2002, and is headquartered in Princeton, New Jersey, US, with a workforce of 1001-5000 employees. Its website is http://www.diverselynx.com.Company H1B SponsorshipDiverse Lynx has a track record of offering H1B sponsorships, with 1 in 2024, 1 in 2021. Please note that this does not guarantee sponsorship for this specific role.