About the job
Our Vision
At Reflection AI, we are dedicated to the mission of creating open superintelligence and ensuring it is accessible to everyone. Our team is committed to developing open weight models that cater to individuals, agents, enterprises, and even nation-states. Our diverse group of AI researchers and innovators hail from esteemed organizations such as DeepMind, OpenAI, Google Brain, Meta, Character. AI, and Anthropic.
Foundational Objectives
Vision:
We aim to establish and manage a company-wide foundational platform that enhances every team’s productivity by delivering dependable, scalable developer infrastructure, site reliability engineering capabilities, and high-throughput data ingestion tools—empowering Reflection AI to progress swiftly as we expand.
Team Responsibilities
Our team is responsible for constructing and managing essential shared services that fuel our research, training, and production environments. These systems form the backbone that supports various teams in model development, deployment, and evaluation, integrating data, compute, and workflow management while facilitating rapid experimentation and robust production systems.
Design and manage shared services that multiple teams utilize across research and production workflows.
Establish and maintain reliability targets through Service Level Indicators (SLIs), Service Level Objectives (SLOs), and effective on-call practices.
Ensure operational readiness through comprehensive runbooks, incident playbooks, and capacity planning.
Guarantee correctness and performance under load, addressing issues like consistency, tail latency, and failure modes.
Create APIs, SDKs, and internal platforms that support high-velocity experimentation and iteration.
Minimize operational burden through enhanced tooling, standardization, and scalable platform patterns across teams.
Technologies You'll Engage With
Container Abstractions: Containers-as-a-Service, Kubernetes abstraction layers, container orchestration, reproducible environments, multi-tenant isolation.
Distributed Systems Architecture: Sharding, replication, coordination services, high-concurrency systems, concurrency control.
Service Development Stack: gRPC, Protobuf, Go, Rust, C++.
Reliability & Performance: Idempotency, retries, backpressure, SLI/SLO design, tail latency optimization, service reliability engineering.
Your Profile
We are looking for a talented individual with a solid background in distributed systems and a passion for building scalable solutions.

