[Remote] Senior Platform Telemetry Engineer
Note: The job is a remote job and is open to candidates in USA. NVIDIA is a leading technology company known for its innovations in GPU technology and AI computing. They are seeking a Senior Platform Telemetry Engineer to design and implement fleet management solutions for scaling AI infrastructure, while collaborating with customers and teams to ensure effective product development and delivery.ResponsibilitiesDrive next generation fleet management solutions for scaling AI infrastructure using GPUs and Grace solution from Nvidia. Work with customers, product management and other architects to narrow down on requirements for implementation to ensure speed of light product developmentBring up clarity on architecture for fleet health monitoring and fault-remediation solution at scale. Work with customers and other architects, understand their requirements on health monitoring, making best use of available capabilities in-band as well as out of band. Detailed architecture, do POCs to validate architectureEducate customers about product architecture and take feedback to make necessary changes. Write architecture specs, design documents and own end to end delivery of product by working across the teams. Do code review for the code produced because of architecture specsEnsure product is properly tested by working with the development team to enhance unit testing and proper test plan in placeDrive product life cycles with QA teams to productize the code and be responsible as a product ownerArticulate requirements as part of Jira and bug management tools and work out an end-to-end execution plan in collaboration with other managersContribute to all phases of product development, from product definition, architecture, and design, through implementation, debugging, testing and early customer supportSkillsBS, MS, or PhD in EE/CS or related field of education (or equivalent experience)5+ years hands-on coding experienceStrong knowledge of time series databases like Influxdb & PrometheusStrong knowledge of building and consuming REST APIs (Redfish is big plus)Strong knowledge of telemetry visualization solutions like Grafana & InfluxStrong knowledge of firmware architecture, optimize firmware for low latency APIsStrong knowledge of analyzing algorithms for time & space complexity and project system resource requirementsProven record of solutions for scalabilityStrong and demonstrable skill in C/C++ and PythonExperience programming and debugging skills for server platformsExperience in SCM (e.g., Git, Perforce) and project management tools like JiraExcellent written and oral communication skillsExcellent work ethicsGreat sense of teamworkLove to produce quality work and commitment to finish your tasks every single daySelf-starter who loves to find creative solutions to complicated problems and hands on with codingExperience building telemetry collection & analysis enginesExperience with RedfishExperience with notification systems like PagerDutyActive Open Compute (OCP) and DMTF contributor in relevant areasHands on with x86 or ARM system architectureFamiliarity with Confidential ComputeExperience with ML and multi-variable optimization techniquesEducation RequirementsBS, MS, or PhD in EE/CS or related field of education (or equivalent experience).BenefitsEquityBenefitsCompany OverviewNVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI. It was founded in 1993, and is headquartered in Santa Clara, California, USA, with a workforce of 10001+ employees. Its website is https://www.nvidia.com.Company H1B SponsorshipNVIDIA has a track record of offering H1B sponsorships, with 448 in 2026, 1872 in 2025, 1354 in 2024, 976 in 2023, 835 in 2022, 601 in 2021, 529 in 2020. Please note that this does not guarantee sponsorship for this specific role.