companyReflection AI logo

Technical Staff Member - Compute Platform

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Qualifications:Proven systems-level engineering experience focusing on cluster-wide behavior and maintenance. Strong programming skills with a focus on systems or GPU infrastructure. In-depth knowledge of GPU hardware beyond standard Kubernetes, including familiarity with NCCL. Alignment with a K8s-first architecture. Expertise in cloud storage, specifically managing high-performance data products (such as VAST) across multiple data centers, connecting storage environments, and handling large datasets and checkpointing.

About the job

Our Mission

At Reflection AI, our mission is to develop open superintelligence and make it available to everyone.

We are creating open weight models that cater to individuals, agents, enterprises, and even nations. Our skilled team of AI researchers and innovators hails from leading organizations such as DeepMind, OpenAI, Google Brain, Meta, Character. AI, and Anthropic.

About the Role

The Compute Platform team at Reflection AI focuses on ensuring our compute layer is robust and highly available. Our K8s-based platform spans multiple neo-clouds, tackling complex systems challenges related to multi-cloud scheduling, node health, and performance debugging. You will collaborate closely with our training teams to design strategies for fault tolerance, health checks, and remediation processes.

Key Responsibilities

  • Cluster Management: Develop and maintain tools for automatic remediation, topology-aware scheduling, capacity planning, and expedited hardware debugging.

  • Platform Engineering: Design and refine our cluster management stack to efficiently handle workloads across extensive multi-GPU fleets.

  • Monitoring & Observability: Establish an all-encompassing monitoring system for the cluster, emphasizing durability and active performance benchmarking.

  • Roadmap Execution: Prepare the infrastructure for next-gen GPU deployments and larger cluster sizes. In the long run, you will contribute to managing multi-cloud storage, petabyte-scale data replication, and optimizing GPU-to-GPU network performance.

About Reflection AI

Reflection AI is dedicated to building an open superintelligence platform that is accessible to everyone. We are at the forefront of AI research and development, driven by a diverse and talented team with backgrounds from top tech companies.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.