companyReflection AI logo

Technical Staff Member - Distributed Systems Engineer

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

To be successful in this role, candidates should possess:A strong understanding of distributed systems and their architecture. Experience with container orchestration and managing services in cloud environments. Proficiency in programming languages such as Go, Rust, or C++. Knowledge of reliability engineering principles and practices.

About the job

Our Vision

At Reflection AI, we are dedicated to the mission of creating open superintelligence and ensuring it is accessible to everyone. Our team is committed to developing open weight models that cater to individuals, agents, enterprises, and even nation-states. Our diverse group of AI researchers and innovators hail from esteemed organizations such as DeepMind, OpenAI, Google Brain, Meta, Character. AI, and Anthropic.

Foundational Objectives

Vision:

We aim to establish and manage a company-wide foundational platform that enhances every team’s productivity by delivering dependable, scalable developer infrastructure, site reliability engineering capabilities, and high-throughput data ingestion tools—empowering Reflection AI to progress swiftly as we expand.

Team Responsibilities

Our team is responsible for constructing and managing essential shared services that fuel our research, training, and production environments. These systems form the backbone that supports various teams in model development, deployment, and evaluation, integrating data, compute, and workflow management while facilitating rapid experimentation and robust production systems.

  • Design and manage shared services that multiple teams utilize across research and production workflows.

  • Establish and maintain reliability targets through Service Level Indicators (SLIs), Service Level Objectives (SLOs), and effective on-call practices.

  • Ensure operational readiness through comprehensive runbooks, incident playbooks, and capacity planning.

  • Guarantee correctness and performance under load, addressing issues like consistency, tail latency, and failure modes.

  • Create APIs, SDKs, and internal platforms that support high-velocity experimentation and iteration.

  • Minimize operational burden through enhanced tooling, standardization, and scalable platform patterns across teams.

Technologies You'll Engage With

  • Container Abstractions: Containers-as-a-Service, Kubernetes abstraction layers, container orchestration, reproducible environments, multi-tenant isolation.

  • Distributed Systems Architecture: Sharding, replication, coordination services, high-concurrency systems, concurrency control.

  • Service Development Stack: gRPC, Protobuf, Go, Rust, C++.

  • Reliability & Performance: Idempotency, retries, backpressure, SLI/SLO design, tail latency optimization, service reliability engineering.

Your Profile

  • We are looking for a talented individual with a solid background in distributed systems and a passion for building scalable solutions.

About Reflection AI

Reflection AI is at the forefront of AI research and development, striving to democratize superintelligence through open model architectures and innovative solutions.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.