About the job
Reflection AI develops open-weight models with the goal of making superintelligence broadly accessible. The team draws on backgrounds from DeepMind, OpenAI, Google Brain, Meta, and Anthropic, and serves a wide range of users including individuals, enterprises, and government organizations.
Role overview
This Machine Learning Engineer position focuses on post-training and evaluation within the Applied AI group in San Francisco. The main responsibility is to fine-tune and evaluate Reflection AI’s open-weight models for enterprise customers, adapting them to specific domains and tasks using real customer data. The work covers the entire process: preparing and cleaning datasets, running fine-tuning workflows, building evaluation systems, and deploying models into production. Collaboration is central, both with clients to understand their needs and with research colleagues to advance model capabilities.
What you will do
- Fine-tune open-weight models for customer use cases, including dataset preparation, configuring training (such as SFT, preference optimization, and reinforcement fine-tuning), and iterating based on evaluation feedback.
- Design and maintain evaluation infrastructure: create evaluation suites, curate test sets, set baselines, and measure improvements on key customer tasks.
- Prepare training data from raw customer sources by assessing data quality, cleaning and formatting, identifying noisy or adversarial samples, and building reproducible data pipelines.
- Troubleshoot training and inference by analyzing loss curves, diagnosing data issues, and identifying problematic training dynamics.
- Deploy fine-tuned models in hybrid environments (public cloud, VPC, on-premises) to ensure reliable, high-performance inference in production.
- Contribute to developing playbooks, evaluation benchmarks, and best practices for fine-tuning and evaluation as the team’s approach evolves.
Requirements
- Hands-on experience in applied machine learning, especially fine-tuning language models. This includes preparing datasets, running training loops, evaluating results, and deploying models. Familiarity with SFT, DPO, RLHF, or related techniques is required.
- Strong understanding of evaluation methods, with the ability to design evaluations, interpret training metrics, and accurately assess model performance.
Location
San Francisco

