About the job
About Us
At Preference Model, we are pioneering the next generation of training data to fuel the evolution of AI technology. Although today's models demonstrate significant capabilities, they often fall short in diverse applications due to many tasks being out of distribution. We create reinforcement learning (RL) environments where models face research and engineering challenges, allowing them to iterate and learn from realistic feedback loops.
Our founding team boasts experience from Anthropic’s data division, where we built data infrastructure, tokenizers, and datasets for Claude. Collaborating with leading AI labs, we aim to bring AI closer to its transformative potential, supported by a16z.
About the Role
Every RL environment we deploy must withstand a model actively attempting to exploit it. A task with a weak evaluation or an easily exploitable reward signal is counterproductive; it teaches the model to cheat instead of reason. We seek an individual dedicated to identifying these vulnerabilities before the model does.
We have learned that domain knowledge alone does not make an effective reviewer. The ideal candidate is someone who has engaged in adversarial thinking: designing challenging problems that are difficult to exploit, dismantling others’ tasks, or directly researching reward hacking.
Your Responsibilities
- Review RL environments and training tasks for accuracy, robustness, and resistance to reward hacking.
- Identify potential ways a model could exploit grading systems, manipulate evaluation criteria, or bypass intended reasoning.
- Collaborate with environment authors to enhance grading systems, rectify reward signals, and redesign ineffective tasks.
- Develop and maintain review standards and checklists as we scale from hundreds to thousands of tasks monthly.
- Provide guidance on grader design during the planning phase of environments, ensuring quality before task construction.
Who We Are Looking For
You think like an attacker and have spent considerable time crafting problems that are challenging to exploit or deconstructing seemingly solid issues. A fundamental understanding of machine learning is essential, enabling you to anticipate model strategies, combined with enough engineering insight to assess whether a grader effectively tests its criteria.

