About the job
About Our Team
Join the Future of Computing Research team at OpenAI, an innovative applied research group within the Consumer Devices division. Our mission is to pioneer new methods and models that contribute to our overarching goal of developing Artificial General Intelligence (AGI) for the betterment of humanity.
Role Overview
As the Inference Technical Lead, you will collaborate with world-class machine learning researchers and top-notch design talents to push the boundaries of model capabilities. This position is stationed in San Francisco, CA, offering a hybrid work model that includes 4 days in the office, along with relocation assistance for new hires.
Key Responsibilities
Assess and select silicon platforms, including GPUs, NPUs, and specialized accelerators, for the deployment of OpenAI models on-device and at the edge.
Collaborate closely with research teams to co-design model architectures that satisfy real-world constraints such as latency, memory, power, and bandwidth.
Conduct system performance analyses to identify trade-offs in model design, memory hierarchy, compute throughput, and hardware capabilities.
Work hand-in-hand with hardware vendors and internal infrastructure teams to launch new accelerators, ensuring efficient execution of transformer workloads.
Lead a team of engineers in implementing the low-level inference stack, encompassing kernel development and runtime systems.
Navigate challenges to transform emerging research capabilities into scalable solutions.
Ideal Candidate Profile
Proven experience in evaluating or deploying workloads on GPUs, NPUs, or other specialized accelerators.
Strong understanding of transformer model performance characteristics, including attention mechanisms, KV-cache behaviors, and memory bandwidth requirements.
Experience designing or optimizing high-performance computing systems, such as inference engines, distributed runtimes, or hardware-aware ML pipelines.
Background in building or leading teams focused on low-level performance-critical software, including CUDA kernels, compilers, or ML runtimes.
Demonstrated ability to thrive in a fast-paced, innovative environment.

