companyOpenAI logo

Software Engineer - Hardware and System Bring-up for Industrial Compute

OpenAISan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Entry Level

Qualifications

Qualifications:- Proficiency in systems engineering and hardware bring-up processes.- Experience with Kubernetes and container orchestration.- Familiarity with Infrastructure as Code tools such as Terraform and Chef.- Strong problem-solving skills and the ability to debug complex systems.- A collaborative mindset with the ability to work across teams effectively.

About the job

About the Team
The Scaling team at OpenAI forms the architectural and engineering foundation of our infrastructure. We innovate and implement advanced systems that facilitate the deployment and operation of next-generation AI models. Our responsibilities encompass system software, networking, platform architecture, fleet-level monitoring, and performance enhancement.

About the Role
We are seeking a skilled software engineer proficient in transforming early-stage, sometimes chaotic, pre-production hardware into stable, operational systems. You will be pivotal in bootstrapping, imaging, integrating with the Kubernetes control plane, and ensuring observability. Your role will bridge early hardware bring-up, provisioning automation, fleet and cluster management, and integration with lab or cloud services—effectively converting new SKUs into usable capacity for our internal stakeholders.

Key Responsibilities

  • Manage the comprehensive bring-up and bootstrapping process for new systems and compute nodes, transitioning from bare metal or early access in lab or production/cloud settings to schedulable fleet capacity, including image building, user-data/configuration, cluster joining, and readiness gates.

  • Develop and uphold top-tier golden image and provisioning workflows across lab and production environments, collaborating with partner-provided base images while ensuring OS/version compatibility.

  • Collaborate with partner teams to integrate nodes into our fleet infrastructure and Infrastructure as Code (IaC) pipelines (Terraform, Chef, etc.), guaranteeing that cloud resources align seamlessly with our internal lifecycle expectations.

  • Work closely with scheduling and platform owners to ensure new hardware is accessible and properly scheduled, addressing pool definitions, network connectivity, routing, admission controls, and platform-specific requirements.

  • Ensure registration and inventory accuracy, providing hands-on support to track nodes and their metadata from end to end.

  • Partner with teams to establish baseline health and telemetry monitoring for bring-up, including critical health signals, pass/fail assessments, and automated reporting for initial ramp decisions.

  • Troubleshoot issues across various layers, including PXE/boot-loader, UEFI/BIOS, BMC, OS bring-up, NIC/network accessibility, kubelet/control-plane connectivity, storage limitations, and early lab/rack scenarios.

About OpenAI

OpenAI is dedicated to advancing digital intelligence in a way that is most likely to benefit humanity as a whole. We focus on creating safe and beneficial AI technologies through innovative research and development.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.