About the job
About Our Team
The Compute Infrastructure Team at OpenAI manages a robust fleet of GPUs and extensive compute clusters that support the models powering ChatGPT and our API. This team also accommodates the training demands for our upcoming models. We specialize in operating a state-of-the-art GPU fleet, offering a cohesive platform for various OpenAI teams to effortlessly execute production-level Applied AI and research training tasks.
Our mission is to harness the potential of AI responsibly, ensuring its benefits are shared while prioritizing safety over unrestrained growth.
Role Overview
As a Technical Program Manager on our engineer-centric TPM team, you will take charge of the comprehensive delivery of large-scale GPU clusters, collaborating closely with engineers to initiate clusters across external providers and partners. You will manage a diverse portfolio that encompasses hardware, networking, power, and cooling—steering execution, risk management, and establishing clear alignment from operational teams to leadership, all aimed at delivering scalable, production-ready capacity.
This position is located in San Francisco, CA, operating under a hybrid work model requiring three days in the office weekly. We also provide relocation assistance for new hires.
Key Responsibilities
- Oversee the complete delivery of new Compute SKUs and large-scale GPU clusters within an external partner network while aiding capacity planning for both training and inference workloads.
- Drive multi-threaded program initiatives involving hardware, networking, power, and cooling—taking ownership of plans, interdependencies, and critical pathways.
- Collaborate with chip providers to mitigate risks associated with long-term onboarding to new hardware platforms, engaging with teams across kernels, communications, hardware, and scheduling.
- Develop and implement program mechanisms such as roadmaps, milestones, risk registers, and runbooks to ensure predictable delivery at scale.
- Work alongside engineering teams to enhance cluster turn-up reliability, repeatability, and automation, thereby decreasing the time-to-serve for new capacities.
- Facilitate cross-functional readiness involving security, finance, operations, and product/research stakeholders to ensure the launch of production-ready compute capabilities.
- Manage integrations and transitions among teams and partners to guarantee seamless execution, transparent communication, and prompt issue resolution.
- Identify operational bottlenecks and systemic deficiencies, driving sustainable improvements across tooling, processes, and partner interactions.

