About the job
Join the dynamic world of Roblox, where millions engage daily in exploring, creating, learning, and connecting through immersive 3D experiences crafted by a global community of developers and creators.
At Roblox, we are committed to building innovative tools and platforms that empower our community to realize their creative visions. Our mission is to transform how individuals connect, regardless of geographical boundaries, and across any device. We strive to foster connections among a billion users with positivity and respect, and we are actively seeking exceptional talent to help us achieve this goal.
A career at Roblox is an opportunity to influence the future of human interaction, tackle unique technical challenges at scale, and contribute to creating safer, more respectful shared experiences for everyone.
As a vital member of our Infrastructure Foundation Hardware Engineering team, you will lead the charge in delivering a reliable, high-performance, and cost-effective infrastructure that supports the world’s play. In this specialized role, you will act as the technical lead for our GPU and AI accelerator ecosystem, managing the entire lifecycle of GPU hardware—from architectural evaluation and firmware qualification to large-scale fleet integration and performance optimization. Your expertise will ensure that Roblox's extensive rendering and machine learning workloads operate on the most efficient and stable hardware available.
Your Responsibilities Will Include:
- Architect & Prototype: Develop next-generation GPU-accelerated hardware platforms, ensuring seamless integration between high-density compute nodes, high-speed interconnects (NVLink/PCIe Gen5/6), and system firmware.
- GPU Optimization: Lead the integration, performance testing, and debugging of GPUs within our fleet, focusing on hardware-level optimizations, driver tuning, and thermal/power management.
- Validation & Certification: Create and implement comprehensive evaluation and stress-testing strategies for GPU-centric server platforms to meet Roblox's unique requirements for real-time rendering and low-latency AI inference.
- Firmware & Systems: Spearhead firmware qualification (BIOS/BMC) and troubleshooting, along with implementing automation systems to monitor GPU health and manage firmware updates.
- Vendor Collaboration: Collaborate with technology partners to enhance our GPU and AI infrastructure.

