[Remote] DevOps Engineer - Atlanta, GA, Birmingham, AL, Louisville, KY, Richmond, VA, Charlotte, NC
Note: The job is a remote job and is open to candidates in USA. Dice is seeking an experienced Site Reliability Engineer (SRE) / DevOps Engineer with expertise in Incident Management and cloud-native platforms. The role involves ensuring the reliability and performance of distributed systems, managing incident responses, and implementing automation and governance strategies.ResponsibilitiesManage and improve platform reliability, availability, and performance across production environmentsLead and participate in incident management, root cause analysis, remediation planning, and post-incident reviewsDrive change control processes and ensure operational governance standards are followedMonitor and manage error budgets while implementing reliability improvementsDesign, build, and maintain scalable cloud infrastructure and automation frameworksDeploy and manage containerized applications using Kubernetes and DockerDevelop and maintain CI/CD pipelines to support efficient software deliveryImplement Infrastructure as Code (IaC) solutions for automated provisioning and configuration managementEstablish observability strategies using monitoring, logging, and alerting platformsCollaborate with development, infrastructure, security, and business teams to ensure platform stabilityTroubleshoot complex production issues across cloud, networking, infrastructure, and application layersContinuously improve operational processes, automation, and system resilienceSkills7+ years of experience in Site Reliability Engineering (SRE), DevOps, Cloud Infrastructure, or Production OperationsStrong experience managing workloads in cloud environments: Microsoft Azure, Amazon Web Services (AWS), Google Cloud Platform (Google Cloud Platform)Hands-on experience with: Kubernetes, Docker, CI/CD Pipelines, Infrastructure as Code (IaC)Strong scripting and automation expertise using: Python, Bash, PowerShell, Go (Golang)Experience with observability and monitoring platforms: Datadog, Grafana, Prometheus, SplunkStrong understanding of: Networking concepts, Linux Administration, Windows Administration, Distributed Systems, Cloud-Native ArchitecturesExperience with: Incident Response, Production Troubleshooting, Operational GovernanceExperience implementing reliability engineering best practices and SRE methodologiesExperience supporting large-scale enterprise production environmentsFamiliarity with high-availability and disaster recovery architecturesExperience automating operational workflows and infrastructure managementKnowledge of security best practices within cloud environmentsExperience working in Agile and DevOps-driven organizationsCompany OverviewDice is a job-searching platform for technology professionals. It is a sub-organization of DHI Group. It was founded in 1990, and is headquartered in Santa Clara, California, USA, with a workforce of 201-500 employees. Its website is http://www.dice.com.