About the job
Our Mission
At Reflection AI, our mission is to develop open superintelligence and make it available to everyone.
We are creating open weight models that cater to individuals, agents, enterprises, and even nations. Our skilled team of AI researchers and innovators hails from leading organizations such as DeepMind, OpenAI, Google Brain, Meta, Character. AI, and Anthropic.
About the Role
The Compute Platform team at Reflection AI focuses on ensuring our compute layer is robust and highly available. Our K8s-based platform spans multiple neo-clouds, tackling complex systems challenges related to multi-cloud scheduling, node health, and performance debugging. You will collaborate closely with our training teams to design strategies for fault tolerance, health checks, and remediation processes.
Key Responsibilities
Cluster Management: Develop and maintain tools for automatic remediation, topology-aware scheduling, capacity planning, and expedited hardware debugging.
Platform Engineering: Design and refine our cluster management stack to efficiently handle workloads across extensive multi-GPU fleets.
Monitoring & Observability: Establish an all-encompassing monitoring system for the cluster, emphasizing durability and active performance benchmarking.
Roadmap Execution: Prepare the infrastructure for next-gen GPU deployments and larger cluster sizes. In the long run, you will contribute to managing multi-cloud storage, petabyte-scale data replication, and optimizing GPU-to-GPU network performance.

