[Remote] Cloud Engineer - Senior (Observability - Datadog)

Remote Full-time
Note: The job is a remote job and is open to candidates in USA. Leidos is a company that supports various government contracts, and they are seeking a Senior Cloud Engineer to enhance their enterprise observability platform. This role involves engineering and operating observability solutions across hybrid cloud environments, focusing on performance, reliability, and capacity management.ResponsibilitiesEngineer and operate the enterprise observability stack (Datadog or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoringBuild, tune, and maintain dashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noiseInstrument services, infrastructure, and containerized workloads using agents, OpenTelemetry, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3C TraceContext propagation, and unified service tagging across the estateDevelop and maintain integrations between observability platforms, ITSM (ServiceNow), CI/CD pipelines, and on-call/paging workflowsDefine and enforce a unified tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to keep telemetry queryable, attributable, and cost-controlledDesign and deliver monitoring coverage for Microsoft Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data servicesEngineer managed database observability across AWS RDS/Aurora (MySQL, PostgreSQL, SQL Server, Oracle), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB, ElastiCache/Redis), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, connection pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM tracesEngineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes, pods, namespaces, ingress, service mesh, and workload APMBuild standardized, reusable monitoring modules deployable via infrastructure-as-code (Terraform, Bicep, ARM) and CI/CDSupport hybrid visibility across on-premises, cloud, and containerized workloads with correlated telemetryLead data-driven investigation and resolution of complex performance, latency, saturation, and reliability issues across the estateUse APM distributed traces, service/dependency maps, continuous code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM-to-backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and downstream dependenciesPartner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidenceDefine and implement trace-based SLOs, deployment tracking, and change-correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changesProvide senior technical leadership during major incidents, delivering impact analysis, contributing to root-cause analysis, and owning post-incident observability gapsAnalyze operational telemetry and trend data to identify capacity risks, recurring constraints, and opportunities for efficiencyBuild and maintain capacity and performance dashboards and reports that communicate posture, risk, and recommendations to technical and leadership stakeholdersDefine capacity thresholds, alert baselines, and trigger points for scaling, technology refresh, and resource reallocationDrive continuous improvement of observability coverage, alert quality, runbook linkage, and operational maturity aligned to SEC SLA/KPI expectationsSkillsCitizenship/Work Authorization: Must meet contract requirementsClearance: Ability to obtain and maintain SEC Public Trust (or higher if required)Minimum 8 years of experience in IT infrastructure or platform engineering roles, including 5+ years focused on observability, performance engineering, or site reliability engineeringDemonstrated experience engineering and operating an enterprise observability platform (Datadog strongly preferred; equivalent experience with Dynatrace, New Relic, Splunk Observability, or Grafana/Prometheus stacks considered)Proven experience building APM and distributed tracing coverage for production multi-tier applications -- including language-specific tracer deployment, custom instrumentation of business transactions, service/dependency mapping, continuous profiling, and RUM-to-backend trace correlation -- across cloud and containerized workloadsProven experience leading complex production performance and reliability problem-solving from telemetry to remediationHands-on experience monitoring Kubernetes or OpenShift clusters and containerized workloads in productionEnterprise observability platforms (Datadog or comparable): metrics, logs, traces, APM, RUM, synthetic, NPMInstrumentation with OpenTelemetry, Datadog agents/SDKs, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) including custom spans, trace sampling strategies, W3C TraceContext propagation, and continuous profilingMicrosoft Azure and AWS monitoring services and integrations (Azure Monitor, Log Analytics, CloudWatch, AWS X-Ray)Container and Kubernetes/OpenShift observability, including cluster, workload, and service mesh telemetryCloud database monitoring: AWS RDS/Aurora (including Performance Insights), Azure SQL/PostgreSQL/MySQL (Query Performance Insight), and NoSQL/cache (DynamoDB, Cosmos DB, ElastiCache/Redis); query-level performance tuning, execution-plan analysis, and Datadog DBM or equivalent deep database APMInfrastructure-as-code for monitoring (Terraform, Bicep, ARM) and CI/CD-driven monitor/dashboard deploymentAPM and distributed tracing: service/dependency maps, trace analytics, RUM-to-backend correlation, exception/error tracking, deployment tracking, and trace-based SLOsUnified tagging strategy and cardinality governance across metrics/logs/traces (environment, service, version, ownership, data classification, cost center), including custom tag enrichment and tag-driven access/cost controlsAlert engineering, SLO/SLI design, error budget management, and alert-noise reductionPerformance engineering, capacity analysis, and telemetry-driven root-cause analysisIntegration of observability with ITSM (ServiceNow) and on-call/paging workflowsExperience supporting federal agency IT environments under FISMA/FedRAMP/NIST-aligned security and compliance requirementsDatadog certification (Fundamentals and/or Administrator) or comparable enterprise observability certificationHands-on experience with Red Hat OpenShift Virtualization (CNV/KubeVirt) or other KubeVirt-based container virtualization observabilityExperience with eBPF-based observability tooling and service mesh telemetry (Istio, Linkerd)Experience implementing SLOs and error budgets at enterprise scale and integrating them into operational governanceExperience with cost-aware observability practices, including telemetry volume optimization and retention tuningExperience integrating observability outputs with executive reporting, SLA/KLI dashboards, and capacity forecastingITIL 4 FoundationAWS Certified Solutions Architect - Associate (or higher)Microsoft Certified: Azure Administrator Associate (or higher)Red Hat Certified Specialist in OpenShift Administration (or equivalent)HashiCorp Terraform AssociateCompany OverviewLeidos is an industry and technology leader serving government and commercial customers with smarter, more efficient digital and mission innovations. It was founded in 2002, and is headquartered in Bedford, Massachusetts, USA, with a workforce of 10001+ employees. Its website is http://www.revealimaging.com.

Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Experienced Data Entry Associate – Remote Work Opportunity at careerzynith

Remote

Remote Overnight Schedules – Detail-Oriented Night Jobs at $25–$35/Hour (No Degree Needed)

Remote

GTM Operations Engineer

Remote

**Experienced Data Entry Associate – Remote Work Opportunity at arenaflex**

Remote

**Experienced Customer Chat Support Representative – Digital Hospitality Ambassador**

Remote

Business Development Manager

Remote

Senior Brand Designer | Growth

Remote

[Remote] Lead Quality Analyst – HEDIS, Stars, and Clinical Performance

Remote

Commercial Relationship Specialist

Remote

Regional Marketing Specialist (Remote)

Remote
← Back