[Remote] Platform Engineering Manager
Note: The job is a remote job and is open to candidates in USA. Tango is a company focused on helping businesses make smarter decisions through technology and data. They are looking for a Platform Engineering Manager to lead the development and operation of their AI-native Internal Developer Platform, ensuring efficient cloud infrastructure and driving modernization efforts across the organization.ResponsibilitiesOwn and execute the Platform roadmap: compute, networking, identity, observability, shared services, and AI/ML tooling across AWS and AzureLead cloud modernization against the AWS and Azure Well-Architected Frameworks across all five pillars: operational excellence, security, reliability, performance efficiency, and cost optimizationDefine golden paths - standardized self-service workflows for service scaffolding, DB provisioning, environment spin-up, and AI workload deployment - with escape hatches for edge casesOwn multi-cloud strategy; ensure consistent IAM, networking, and FinOps governance across providersDrive OpenTofu/Ansible as source of truth for all infrastructure; enforce GitOps and policy-as-code for governance, auditability, and securityBuild and mature CI/CD pipelines (GitHub Actions, ArgoCD) to maximize deployment frequency, reduce lead time, and enable zero-ticket self-service provisioningOwn org-wide observability: metrics, logs, traces, and alerting – extended to AI/LLM signals (token usage, model latency, inference cost, agent task completion rates)Operate a centralized observability platform (Datadog/Signoz, OpenTelemetry, Grafana/Prometheus/Loki, or equivalent) delivered via golden paths; define SLIs/SLOs as onboarding defaults for all servicesEnsure full-stack coverage across infrastructure, Kubernetes, APM, distributed tracing, AI pipelines, and cost anomaly detectionBuild and operate a self-service shared services catalog: secrets management, API gateways, model registries, and LLM gatewaysRationalize duplicative per-team infrastructure; maintain shared services to production SLA standards with clear ownership and consistent security controlsOwn GPU/accelerated compute, model serving, vector databases, RAG pipelines, and LLM API gateway management (AWS Bedrock, Azure OpenAI, Anthropic)Build AI golden paths for self-service model deployment and LLM integration; design agentic infrastructure including orchestration runtimes, tool registries, memory/state services, and human-in-the-loop workflowsEstablish governance, cost controls, prompt injection guardrails, and model access policies for AI API usage and inference spendCollaborate on migration program: partner with peer managers to plan and execute structured workload migrations onto the platform with hands-on support - not just documentationDefine onboarding playbooks covering golden paths, shared services, observability setup, CI/CD cutover, and AI capability onboarding; track and report adoption metrics to leadershipIdentify and remove migration blockers - technical gaps, missing services, or organizational friction - and feed them into the platform roadmapBuild a self-service developer portal (Backstage, GitHub or equivalent) with service catalogs, golden paths, and AI/agentic workflow templates; track DORA metrics and developer experience KPIsHire, develop, and retain high-performing platform engineers; build AI fluency across the team and foster a platform-as-a-product culture with feedback loops, OKRs, and iterative roadmappingLead architecture reviews; make pragmatic build-vs-buy decisions; partner with security and compliance on governance prioritiesEmbed secure-by-default guardrails: IaC scanning, RBAC, secrets management, container hardening, and AI-specific controls (prompt injection defense, model access governance, data residency)Own cloud cost optimization across AWS and Azure including AI inference spend; maintain SOC 2/ISO 27001 compliance postureSkills8+ years in infrastructure, DevOps, or platform engineering; 2+ years in engineering managementCloud: Deep hands-on AWS and Azure expertise: multi-cloud architecture, IAM, networking, compute, and AI/ML services (SageMaker, Bedrock, Azure OpenAI, Azure ML)IaC & CI/CD: Terraform required; GitOps, policy-as-code; GitHub Actions / ArgoCD at scaleDP: Proven track record building an IDP with self-service workflows, golden paths, and developer portal (Backstage, GitHub, or equivalent)Observability: OpenTelemetry, Datadog, Signoz, or Prometheus/Grafana at scale; SLI/SLO definition and enforcementShared Services: Built and operated multi-team shared service catalogs with production-grade SLAsAdoption: Led structured platform migration and adoption programs in partnership with peer engineering leadersKubernetes & WAF: Kubernetes cluster management, Helm, RBAC, service mesh; AWS and Azure Well-Architected Framework reviewsStrong cross-functional influencing skills; comfortable as a peer to engineering managers and product leadersAWS SA Pro / Azure Expert / CKA/CKADPython, Go, or BashBenefitsHealth, dental, and vision insuranceA 401(k) plan with company matchGenerous paid time offFlexible Work EnvironmentWhether remote, hybrid, or in-office, we support work arrangements that promote productivity and balanceCompany OverviewTango builds software solutions that help to unite real estate, lease accounting and facilities management software into a single platform. It was founded in 2008, and is headquartered in Dallas, Texas, USA, with a workforce of 201-500 employees. Its website is https://tangoanalytics.com/.