About the job
About the Team
The Scaling team at OpenAI forms the architectural and engineering foundation of our infrastructure. We innovate and implement advanced systems that facilitate the deployment and operation of next-generation AI models. Our responsibilities encompass system software, networking, platform architecture, fleet-level monitoring, and performance enhancement.
About the Role
We are seeking a skilled software engineer proficient in transforming early-stage, sometimes chaotic, pre-production hardware into stable, operational systems. You will be pivotal in bootstrapping, imaging, integrating with the Kubernetes control plane, and ensuring observability. Your role will bridge early hardware bring-up, provisioning automation, fleet and cluster management, and integration with lab or cloud services—effectively converting new SKUs into usable capacity for our internal stakeholders.
Key Responsibilities
Manage the comprehensive bring-up and bootstrapping process for new systems and compute nodes, transitioning from bare metal or early access in lab or production/cloud settings to schedulable fleet capacity, including image building, user-data/configuration, cluster joining, and readiness gates.
Develop and uphold top-tier golden image and provisioning workflows across lab and production environments, collaborating with partner-provided base images while ensuring OS/version compatibility.
Collaborate with partner teams to integrate nodes into our fleet infrastructure and Infrastructure as Code (IaC) pipelines (Terraform, Chef, etc.), guaranteeing that cloud resources align seamlessly with our internal lifecycle expectations.
Work closely with scheduling and platform owners to ensure new hardware is accessible and properly scheduled, addressing pool definitions, network connectivity, routing, admission controls, and platform-specific requirements.
Ensure registration and inventory accuracy, providing hands-on support to track nodes and their metadata from end to end.
Partner with teams to establish baseline health and telemetry monitoring for bring-up, including critical health signals, pass/fail assessments, and automated reporting for initial ramp decisions.
Troubleshoot issues across various layers, including PXE/boot-loader, UEFI/BIOS, BMC, OS bring-up, NIC/network accessibility, kubelet/control-plane connectivity, storage limitations, and early lab/rack scenarios.

