About the job
About xAI
At xAI, we are on a mission to develop advanced AI systems that can profoundly comprehend the universe and support humanity in its quest for knowledge. Our team is compact yet highly driven, emphasizing engineering excellence and innovation. We seek individuals who relish challenges and thrive on curiosity, contributing directly to our mission in a collaborative, flat organizational structure. Initiative and a commitment to delivering outstanding results are paramount. Strong communication skills are essential, enabling team members to convey knowledge effectively and precisely.
About the Role
xAI has pioneered the creation of a 100k GPU cluster on an Ethernet network, achieving this remarkable feat twice in just 92 days. We are currently seeking an experienced engineer proficient in RoCEv2 to scale our operations while enhancing performance and reliability.
Our rapid development pace with cutting-edge hardware is crucial in deepening our understanding of the universe. To achieve our next major breakthrough, we must take charge of our network performance and availability, optimizing them for our training models and customer inference queries. Your role will predominantly involve diving deep into NCCL, creating metric dashboards, and fine-tuning configurations to maximize performance. You will play a pivotal role in designing the next generation of our backend and front-end networks, enabling seamless expansion of our GPU infrastructure with minimal engineering intervention.
Expect considerable travel to Memphis for capacity expansion, participation in team on-call rotations, and assistance with scaling and maintenance initiatives. This position promises to be both dynamic and rewarding.

