About the job
About Voltai
At Voltai, we are pioneering the development of world models and agents capable of learning, evaluating, planning, experimenting, and interacting with the physical realm. Our journey begins with a focus on hardware, specifically in electronics systems and semiconductors, where we harness AI to design and innovate beyond human cognitive capabilities.
About the Team
Our team boasts extraordinary talent, including esteemed former Stanford professors, SAIL researchers, and medalists from prestigious competitions like IPhO and IOI. We are supported by top-tier investors from Silicon Valley and industry leaders, including CEOs and Presidents from Google, AMD, Broadcom, and Marvell.
About the Role
As a Research Engineer specializing in CUDA Kernel engineering, you will design, integrate, and optimize cutting-edge CUDA kernels that drive AI models, facilitating rapid advancements in semiconductor design and verification. Your contributions will empower extensive model training, inference, and reinforcement learning systems capable of reasoning about circuit layouts, generating and validating RTL, and optimizing chip architectures, all while efficiently utilizing thousands of GPUs.
You will create tools, performance benchmarks, and integration layers that maximize GPU utilization for compute-intensive workloads in AI-driven hardware design. Collaborating closely with fellow researchers and engineers, you will help position Voltai as the foremost organization in AI and semiconductor research. Furthermore, your kernels and tools will be released as valuable contributions to the open-source AI and HPC ecosystems.
You might excel in this position if you possess experience in:
Writing and optimizing CUDA kernels for large-scale AI applications (e.g., attention mechanisms, routing, graph-based operations, and physics-inspired operators).
Profiling and enhancing GPU performance for specialized compute or memory-bound workloads.
Integrating custom kernels into state-of-the-art training and inference frameworks (including PyTorch, Megatron, vLLM, and TorchTitan).
Engaging with the latest NVIDIA hardware and software frameworks (Hopper, Blackwell, NVLink, NCCL, Triton).
Creating GPU-accelerated primitives for graph reasoning, symbolic computation, or hardware simulation tasks.

