[Remote] Staff Site Reliability Engineer - Kubernetes
Note: The job is a remote job and is open to candidates in USA. Okta is a company focused on securing identities in the AI era, and they are seeking a Staff Site Reliability Engineer to build and manage Kubernetes platforms. The role involves architecting reliable, scalable, and secure Kubernetes-based platforms on AWS, ensuring high availability and performance while optimizing costs and automation.ResponsibilitiesKubernetes Platform Creation: Design, implement, and maintain highly available, scalable, and fault-tolerant Kubernetes platforms. Ensure clusters are optimized for production workloads, providing high resilience and operational efficiencyAWS Infrastructure Management: Build, manage, and optimize AWS cloud infrastructure, including EKS,ECS, S3, VPCs, RDS, IAM, and more. Implement best practices for cost management, scaling, and security within AWSHelm Management: Utilize Helm to automate and streamline the deployment of applications and services to Kubernetes clusters. Create, maintain, and manage Helm charts for production-ready deploymentsKarpenter Implementation: Implement and manage Karpenter to dynamically scale Kubernetes clusters in response to workload demandsIstio Service Mesh Management: Configure and manage Istio to provide service-to-service communication, security, and observability within the Kubernetes clusters. Enable fine-grained traffic management, service discovery, and policy enforcementPlatform Automation & Scaling: Automate the deployment, scaling, and management of infrastructure and applications. Work with CI/CD pipelines to ensure a seamless flow from development to production with minimal downtimeIncident Management & Troubleshooting: Respond to incidents, troubleshoot, and resolve system issues related to performance, availability, and security in a timely and effective mannerSecurity & Compliance: Design and implement secure cloud infrastructure with appropriate access controls, network security, and compliance frameworksDocumentation & Knowledge Sharing: Create and maintain detailed documentation for Kubernetes platform setup, operational procedures, and best practices. Promote knowledge sharing across teamsSkills4+ years of experience with Kubernetes/Helm4+ years of Experience with Terraform5+ years of Experience with AWSExperience with multi-region cloud environmentsProven experience with AWS (EC2, RDS, S3, CloudFormation, IAM, etc.) and solid understanding of cloud-native architecturesStrong expertise in Kubernetes platform creation, management, and optimisation (e.g., setting up highly available clusters, networking, and storage)Hands-on experience with Helm for Kubernetes application deployment and managementPractical experience with Karpenter for dynamic scaling of Kubernetes clusters and optimising resource usageExpertise in managing and securing Istio for service mesh, including traffic management, security, and observability featuresProficiency in CI/CD pipelines and automation tools (e.g., Jenkins, GitLab, CircleCI, Terraform, Ansible, Spinnaker)Strong scripting and automation skills in Python, Bash, or Go for infrastructure management and platform automationExperience with monitoring, logging, and alerting tools such as Prometheus, Grafana, CloudWatch, and ELK StackUnderstanding of security best practices for cloud platforms and Kubernetes (e.g., role-based access control (RBAC), encryption, and compliance frameworks)Familiarity with Docker and containerization principlesBachelor's degree in Computer Science, Engineering, or related field (or equivalent professional experience)Certifications (Preferred): CKA (Certified Kubernetes Administrator), CKAD (Certified Kubernetes Application Developer), or AWS Certified DevOps Engineer are highly desirableBenefitsEquity (where applicable)BonusBenefits, including health, dental and vision insurance401(k)Flexible spending accountPaid leave (including PTO and parental leave)Immersive, in-person onboarding experience designed to accelerate your impact and connect you to our mission and team from day oneCompany OverviewOkta is a management platform that secures critical resources from cloud to ground for workforce and customers. It was founded in 2009, and is headquartered in San Francisco, California, USA, with a workforce of 5001-10000 employees. Its website is http://www.okta.com.