[Remote] Principal Engineer, Compute Platform
Note: The job is a remote job and is open to candidates in USA. Pinterest is a platform that inspires creativity and innovation, and they are seeking a Principal Engineer to lead the consolidation and modernization of their compute infrastructure. This role involves designing and building a shared compute platform to support large-scale workloads, enhancing operational efficiency, and collaborating with various teams to meet unique customer needs.ResponsibilitiesSolving the challenges of replacing isolated pools of dedicated compute resources with a very large scale shared compute platform, shifting from machine-based designs to container-based designsWorking with leads across various platforms, especially stateful and data platforms, to build the right features and migration paths that work for themOwning and driving up utilization on the shared compute platform by designing and implementing workload stacking, optimizing and bin packing, safe oversubscription, etcWork with multiple customers with unique requirements to make sure the platform will address their needs and is not only a viable but a desirable solution for running their workloadsLeading a group of engineers around design topics, execution, trade offs, migration paths, observability, performance, and operability for the platformEvolving the platform towards a multi-cloud abstraction layer to enable running workloads across multiple cloud providersBeing a role model for setting a high bar for production quality and engineering excellence in delivering a foundational technology which empowers the entire companyWorking closely with partners around capacity planning, cost visibility, fungibility of virtual machine instance types, and efficiencyPutting special focus on the delivery of GPU resources through the platform, to enable and expedite AI workloadsLeverage AI tools to increase the velocity and ease of migrations, and create self service solutions for the customers of the platform as neededHelp the team apply AI to the operational aspects of running the cluster, discovering issues, and investigating and root causing issuesExpedite feature development using AI coding tools and be a thought leader on creating the right balance between speed and safety by designing safeguards and layers of defenseSkillsBachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience12+ years of relevant industry experience with large scale, production distributed systems5+ years of experience with Kubernetes in productionExperience working across SWE and SRE or Production Engineering teams to deliver robust production systemsAbility to work with cross-functional partners across multiple organizationsPassion for automation, reducing toil, and building proper tooling for getting the job doneExperience with running distributed data systems and migrating them to Kubernetes is highly preferredBenefitsThe position is also eligible for equity.Information regarding the culture at Pinterest and benefits available for this position can be found here.In-Office Requirement Statement: This role will need to be in the office for in-person collaboration 1-2 times/quarter and therefore can be situated anywhere in the country.Company OverviewPinterest is a visual bookmarking tool for saving and discovering creative ideas. It was founded in 2010, and is headquartered in San Francisco, California, USA, with a workforce of 1001-5000 employees. Its website is https://www.pinterest.com/.