companyOpenAI logo

Software Engineer, Fleet Management

OpenAISan Francisco
Hybrid Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Proven experience in software engineering within large infrastructure environments; comprehensive understanding of cluster-level systems (Kubernetes, CI/CD, Terraform, cloud providers); deep knowledge of server-level systems (Linux, containerization, Chef, firmware management); passion for optimizing and enhancing compute fleet performance; adaptable to dynamic work environments.

About the job

Join the Fleet team at OpenAI, where we empower groundbreaking research and innovative product development by maintaining a robust computing environment. Our team manages extensive systems that encompass data centers, GPUs, and networking, ensuring peak performance, high availability, and efficiency. Our mission is to facilitate the seamless operation of OpenAI's models at scale, supporting both internal research initiatives and external products such as ChatGPT, while prioritizing safety, reliability, and responsible AI deployment over unchecked expansion.

About the Position

As a Software Engineer specializing in Operating Systems & Orchestration, you will play a crucial role in developing systems that manage our hardware, configurations, vendors, and the teams utilizing our infrastructure. Your work will involve designing and implementing solutions that fuse individual nodes and servers into cohesive clusters, directly enhancing the AI research experience. This role is located in San Francisco, CA, and follows a hybrid work model, requiring three days in the office each week. We also provide relocation assistance for new hires.

Key Responsibilities:

  • Architect and develop systems to manage extensive cloud and bare-metal infrastructures at scale.

  • Create tools that correlate low-level hardware metrics with high-level job scheduling and cluster management algorithms.

  • Utilize Large Language Models (LLMs) to streamline vendor operations and enhance infrastructure workflows.

  • Automate infrastructure processes to minimize repetitive tasks and bolster system reliability.

  • Work collaboratively with hardware, infrastructure, and research teams to ensure smooth integration across all components.

  • Continuously refine tools, automation, processes, and documentation to boost operational effectiveness.

Ideal Candidate Profile:

  • Demonstrates strong software engineering capabilities with experience in large-scale infrastructure environments.

  • Possesses extensive knowledge of cluster-level systems (e.g., Kubernetes, CI/CD pipelines, Terraform, cloud platforms).

  • Has deep expertise in server-level systems (e.g., systems, containerization, Chef, Linux kernels, firmware management, host routing).

  • Is passionate about enhancing the performance and reliability of large compute fleets.

  • Thrives in fast-paced environments and is eager to tackle complex challenges.

About OpenAI

OpenAI is at the forefront of AI research, dedicated to advancing technology responsibly and ethically. Our Fleet team ensures that our computing environment meets the demands of our cutting-edge research and innovative product development, fostering an atmosphere where creativity and collaboration thrive. We prioritize a balanced approach to growth, focusing on safety and reliability in AI deployment.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.