DevOps Engineer

Remote Full-time
About your new role:As our Founding DevOps Engineer, you will own the reliability of a high-throughput distributed platform processing network telemetry, voice, and security data for a global customer base. Your mandate: keep the platform fast, available, and scalable as CommandLink grows β€” enabling fast, iterative deployments without sacrificing uptime.You'll work hands-on across VMs, firewalls, Kubernetes clusters, Kafka and Flink pipelines, OpenSearch, and Azure infrastructure β€” designing systems to fail gracefully and recover automatically, not just monitoring them. You'll bring strong platform judgment to decisions that directly impact customer uptime, data latency, and our ability to scale new product lines without rearchitecting from scratch.Working closely with Engineering and Product leaders, you'll embed reliability into how we build. That means driving SLO definition, incident response, and postmortems, as well as building the automation that makes on-call sustainable long-term.You'll also lead a genuine greenfield initiative: transforming our infrastructure into a fully code-defined Infrastructure as Code model β€” bringing consistency, repeatability, and engineering rigor to how we provision, manage, and evolve the platform.Key Responsibilities:Own platform reliability end-to-end: define and enforce SLOs/SLIs, build alerting strategies, lead incident response, and drive blameless postmortemsKubernetes cluster operations: manage HA multi-node and cloud clusters in production, handle rolling upgrades, resource quotas, autoscaling, network policies, and pod disruption budgetsDistributed data infrastructure: operate and scale Kafka clusters, Flink streaming jobs, and OpenSearch clusters under sustained high-throughput workloads, including rebalancing, partition management, index lifecycle policies, and shard tuningTemporal workflow platform: maintain and scale Temporal server deployments; work with engineering to design workflows for durability and backpressureAzure/AWS/GCP infrastructure: manage and optimise Azure/GCP/AWS environments including K8S, Networking, Monitoring, Vaults, and IAM; contribute to IaC codebase (Terraform or Bicep)CI/CD and deployment pipelines: improve build, release, and deployment pipelines to enable safe, fast, and automated delivery across environmentsObservability: build and maintain a comprehensive observability stack, metrics, logs, traces, and dashboards that give engineers actionable signals rather than noiseSecurity and compliance: work with the security team to harden infrastructure, enforce least-privilege policies, and support compliance requirementsCapacity planning: proactively model growth, identify bottlenecks before they become incidents, and lead scaling initiatives for critical componentsTakes on additional responsibilities and projects as needed to support the success of the team and organization.What you'll need for success:Essential:6+ years in a Site Reliability Engineering, DevOps, or Platform Engineering role in a production environmentDeep, hands-on Kubernetes experience: cluster administration, HA configurations, networking (CNI, ingress, service mesh), and storage not just application deploymentProven experience operating Apache Kafka at scale: topic management, consumer group tuning, broker operations, and monitoring lagExperience with Apache Flink or equivalent stream processing frameworks in productionOpenSearch / Elasticsearch cluster operations: index management, scaling strategies, performance tuning, and snapshot managementAzure/AWS/GCP cloud platform expertise: AKS, virtual networking, managed identities, monitoring, and cost managementSolid understanding of distributed systems theory: CAP theorem, consensus protocols, failure modes, backpressure, and circuit breakingInfrastructure as Code mindset β€” Terraform, Helm, or equivalentTemporal workflow engine: deployment, operation, and scaling (or strong experience with an equivalent durable execution platform such as Cadence or Conductor)Strong scripting and automation skills (Bash, PHP, Python, or Go) Experience designing and operating high-availability architectures across multiple availability zones or regionsNice to Have:Experience with Vector (from Datadog) for log and metric collection and routing pipelinesDatadog for APM, infrastructure monitoring, log management, or dashboardsExperience with service meshes (Istio, Linkerd, or Cilium)Familiarity with chaos engineering practices (Chaos Monkey, LitmusChaos, or similar)Contributions to open source infrastructure toolingExperience working in or with network/telco SaaS productsKnowledge of eBPF-based networking or observability toolsWhy you'll love life at Command|LinkJoin us at CommandLink, where you'll have the opportunity to shape the future of business communication. We value the innovative spirit and seek individuals ready to bring their unique vision and expertise to a team that values bold ideas and strategic thinking. Are you ready to make an impact?Room to grow at a high-growth companyAn environment that celebrates ideas and innovationYour work will have a tangible impactFlexible time off Fun events at cool locationsEmployee referral bonuses to encourage the addition of great new people to the teamAt CommandLink, we’re committed to creating a fair, consistent, and efficient hiring experience. As part of our process, we use AI-assisted tools to help review and analyze applications. These tools support our recruiting team by identifying qualifications and experience that align with the requirements of each role.AI tools are used only to assist in the evaluation process β€” they do not make final hiring decisions. Every application is reviewed by a member of our recruiting or hiring team before any decisions are made.

Apply Now

Apply Now β†’

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Input Data from Home - Flexible Hours

Remote

Senior Risk Officer - Financial & Regulatory Reporting (Remote)

Remote

Senior Growth Operations Manager job at Kerridge Commercial Systems - KCS in Cary, NC

Remote

**Experienced Full Stack Live Chat Support Agent – Deliver Exceptional Customer Experience in a Dynamic Remote Work Environment**

Remote

Online Data Entry jobs for Teens - VacancyGlobal

Remote

Experienced Entry-Level Remote Customer Support Specialist – Delivering Exceptional Service at careerzynith

Remote

Customer Success Manager - Remote

Remote

Specialist I, College Curriculum

Remote

Southwest Airlines Remote Work From Home Jobs – Hiring Now

Remote

Input Data from Home - Flexible Hours

Remote
← Back