[Remote] DevOps Engineer
Note: The job is a remote job and is open to candidates in USA. EPAM Systems is a major technology company specializing in infrastructure supporting AI research, and they are seeking a DevOps Engineer to help maintain production Kubernetes-based systems. The role focuses on site reliability engineering, observability, and SQL production support duties, ensuring system reliability and performance across an Azure Stack environment.ResponsibilitiesDesign, maintain and progressively improve observability solutions, including dashboards and visual reports built with Grafana or comparable monitoring toolsSet up, implement and oversee metrics, SLIs, SLOs and alerting approaches to guarantee reliability and transparency across production systemsDeliver business-hours operational support for Kubernetes-based production environments, involving initial troubleshooting, log review and metric-based investigationsAssist with SQL-based systems as part of production operations, contributing to issue examination and performance diagnosticsExamine incidents and system behavior to pinpoint root causes, take part in post-incident reviews and suggest enhancements for monitoring and reliability practicesWork hand in hand with engineering, platform and research teams to raise observability standards, refine operational processes and strengthen overall system stabilityAdd to documentation, knowledge-sharing activities and ongoing improvement initiatives within the teamSkillsAt least 2 years of relevant hands-on professional experienceDemonstrated track record in Site Reliability Engineering (SRE), DevOps, Production Support or equivalent roles working with production systemsPractical exposure to observability and monitoring stacks including Grafana, Prometheus, Elastic Stack, Datadog or similar toolsStrong command of Linux systems, supported by solid troubleshooting and log analysis capabilitiesWorking experience supporting Kubernetes-based environments in production settingsBackground in delivering SQL production support, including query troubleshooting and basic performance diagnosticsConfident scripting skills in Python, Bash or similar languages for automation and day-to-day operational activitiesCapability to investigate incidents, determine underlying causes and drive continuous improvement effortsEffective communication and teamwork skills for working successfully with distributed and cross-functional teamsProficient English communication skills, both spoken and written, at a B2+ level or higherExperience handling APIs and integration patterns to link services together and enable system interoperabilityKnowledge of databases, covering administration, tuning and production-level support activitiesExposure to Infrastructure as Code development and maintenance for automating environment provisioning and configurationPractical experience using Microsoft Azure to manage cloud resources and run production workloadsBenefitsInternational projects with top brandsWork with global teams of highly skilled, diverse peersHealthcare benefitsEmployee financial programsPaid time off and sick leaveUpskilling, reskilling and certification coursesUnlimited access to the LinkedIn Learning library and 22,000+ coursesGlobal career opportunitiesVolunteer and community involvement opportunitiesEPAM Employee GroupsAward-winning culture recognized by Glassdoor, Newsweek and LinkedInCompany OverviewEPAM leverages its core engineering expertise as a leading global product development and digital platform engineering services company. It was founded in 1993, and is headquartered in Newtown, Pennsylvania, USA, with a workforce of 10001+ employees. Its website is https://www.epam.com.Company H1B SponsorshipEPAM Systems has a track record of offering H1B sponsorships, with 11 in 2026, 120 in 2025, 172 in 2024, 232 in 2023, 373 in 2022, 359 in 2021, 502 in 2020. Please note that this does not guarantee sponsorship for this specific role.