About the job
Our Mission
At Reflection AI, we are dedicated to creating open superintelligence and making it universally accessible.
We are pioneering open weight models designed for individuals, agents, enterprises, and even nations. Our talented team consists of AI researchers and innovators from leading organizations such as DeepMind, OpenAI, Google Brain, Meta, Character. AI, and Anthropic.
Role Overview
Data is becoming increasingly vital in the realm of AI advancements. Recent significant breakthroughs have frequently stemmed from enhanced data rather than new architectures.
As a vital member of the Data Team, your primary role will be to guarantee that the data utilized for training our models adheres to the highest standards of quality, reliability, and impact. You will have a direct influence on our models' performance in essential capabilities.
Collaborating with exceptional researchers on our pre-training teams, you will help transform abstract concepts of "good data" into specific, quantifiable standards applicable across extensive data campaigns. We are seeking engineers who possess robust engineering skills combined with a profound curiosity about data quality and its relevance to model performance.
In close partnership with our pre-training teams, you will:
Take ownership of upstream data quality for LLM pre-training, functioning as either a specialist or generalist across various languages and modalities.
Collaborate with research and pre-training teams to convert requirements into measurable quality signals, providing actionable feedback to external data vendors.
Incorporate human-in-the-loop processes while designing, validating, and scaling automated QA methods to consistently measure data quality across large-scale campaigns.
Create reusable QA pipelines that ensure the delivery of high-quality data to pre-training teams for model training.
Continuously monitor and report on data quality, driving ongoing improvements in quality standards, processes, and acceptance criteria.
Candidate Profile
Strong engineering background with experience in building data pipelines, QA systems, or evaluation workflows for pre-training data.
Detail-oriented with an analytical mindset, capable of identifying failure modes, inconsistencies, and nuanced issues affecting data quality.
Solid understanding of the influence of data quality on pre-training, with the capacity to translate quality concerns into tangible signals, decisions, and feedback.

