[Remote] Forward Deployed Engineer: AI + HPC
Note: The job is a remote job and is open to candidates in USA. Cedana is a company focused on maximizing AI and HPC cluster utilization and reliability. As a Forward Deployed Engineer, you will lead technical engagements with customers, deploying Cedana's solutions in various environments and optimizing platform performance.ResponsibilitiesEngineer solutions at client sites: Lead customer integrations. Install, configure, and deploy Cedana into SLURM, Kubernetes, and Dynamo environmentsDrive product innovation from the field: Identify technical gaps while embedded with clients, then provide product feedback for new capabilities that become core product featuresMeasure and optimize platform performance: Measure reliability, throughput, and performance using our internal tools. Design and implement policy-based migration automations to optimize reliability, throughput, and performanceOwn critical deployments: Ensure our platform performs reliably for clients' critical operations, debugging issues across the full stack. Debug install issues against unfamiliar customer infrastructure, and escalate to engineering when necessaryImprove scalability : Build and own the internal installation playbook so that the second customer in each segment is onboarded faster than the firstRespect our customers : Understand how to make their lives easier and minimize their time and overheadSkillsTeam management experience. Requires strong project and time management skills, delivering milestones on time, and effective3-10 years of software engineering experience with a track record of configuring and managing SLURM deploymentsA multi-month enterprise or research deployment you led end-to-end, from scoping through signoff. You write effective status updates to keep your team updated and on scheduleProduction experience in standing up SLURM in a customer or research environment. You've configured slurmctld, slurmdbd, accounting, cgroup integration, and GPU resource selectionStrong Linux fundamentals of systemd, cgroups v2, namespaces, networking, filesystems, firewalls, kernel module loading, PAM session modules. You can read strace and dmesg output and form a hypothesisExperience with Kubernetes operations including operators, CRDs, CNIs, device plugins, and node-level debugging. You've debugged a controller in production even if you haven't written one from scratchExperience in an HPC integrator field teamClient-facing technical experience working directly with customersBackground in national lab user services or university research computingYou've developed SLURM plug-ins, and understand their architecture and how they fit into the overall platformFamiliarity with CRIU, container runtimes, GPU driver internals, distributed training stacksHands-on with NVIDIA Dynamo, Determined, Ray, Kueue, KServe, or comparable AI orchestrationContributed to open-source schedulers or job systems (SLURM, Flux, Torque, PBS)A passion for debugging a weird cgroup issue at 11pm just as much as writing a clean install playbook the next morningBenefits100% covered medical, dental, and vision insurance for employees and familiesUnlimited PTO policy401K PlanCompany OverviewCedana is VMWare for GPUs. We enable enterprises to orchestrate and operationalize intelligence precisely, reliably, and efficiently. It was founded in 2023, and is headquartered in New York, New York, USA, with a workforce of 2-10 employees. Its website is https://www.cedana.ai.