About the job
About Our Team
The Inference team at OpenAI is dedicated to translating our cutting-edge research into accessible, transformative technology for consumers, enterprises, and developers. By leveraging our advanced AI models, we enable users to achieve unprecedented levels of innovation and productivity. Our primary focus lies in enhancing model inference efficiency and accelerating progress in research through optimized inference capabilities.
About the Role
We are seeking talented engineers to expand and optimize OpenAI's inference infrastructure, specifically targeting emerging GPU platforms. This role encompasses a wide range of responsibilities from low-level kernel optimization to high-level distributed execution. You will collaborate closely with our research, infrastructure, and performance teams to ensure seamless operation of our largest models on cutting-edge hardware.
This position offers a unique opportunity to influence and advance OpenAI’s multi-platform inference capabilities, with a strong emphasis on optimizing performance for AMD accelerators.
Your Responsibilities Include:
Overseeing the deployment, accuracy, and performance of the OpenAI inference stack on AMD hardware.
Integrating our internal model-serving infrastructure (e.g., vLLM, Triton) into diverse GPU-backed systems.
Debugging and optimizing distributed inference workloads across memory, network, and compute layers.
Validating the correctness, performance, and scalability of model execution on extensive GPU clusters.
Collaborating with partner teams to design and optimize high-performance GPU kernels for accelerators utilizing HIP, Triton, or other performance-centric frameworks.
Working with partner teams to develop, integrate, and fine-tune collective communication libraries (e.g., RCCL) to parallelize model execution across multiple GPUs.
Ideal Candidates Will:
Possess experience in writing or porting GPU kernels using HIP, CUDA, or Triton, with a strong focus on low-level performance.
Be familiar with communication libraries like NCCL/RCCL, understanding their importance in high-throughput model serving.
Have experience with distributed inference systems and be adept at scaling models across multiple accelerators.
Enjoy tackling end-to-end performance challenges across hardware, system libraries, and orchestration layers.
Be eager to join a dynamic, agile team focused on building innovative infrastructure from the ground up.

