company

Reinforcement Learning Environment Reviewer

Preference ModelSan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Required QualificationsProven experience in adversarial or constructive problem design: such as authoring competitive programming problems (ICPC, Codeforces, etc.), designing CTF challenges, or similar. Familiarity with reinforcement learning, reward mechanisms, and evaluation strategies. Ability to think critically and creatively to identify vulnerabilities in existing tasks.

About the job

About Us

At Preference Model, we are pioneering the next generation of training data to fuel the evolution of AI technology. Although today's models demonstrate significant capabilities, they often fall short in diverse applications due to many tasks being out of distribution. We create reinforcement learning (RL) environments where models face research and engineering challenges, allowing them to iterate and learn from realistic feedback loops.

Our founding team boasts experience from Anthropic’s data division, where we built data infrastructure, tokenizers, and datasets for Claude. Collaborating with leading AI labs, we aim to bring AI closer to its transformative potential, supported by a16z.

About the Role

Every RL environment we deploy must withstand a model actively attempting to exploit it. A task with a weak evaluation or an easily exploitable reward signal is counterproductive; it teaches the model to cheat instead of reason. We seek an individual dedicated to identifying these vulnerabilities before the model does.

We have learned that domain knowledge alone does not make an effective reviewer. The ideal candidate is someone who has engaged in adversarial thinking: designing challenging problems that are difficult to exploit, dismantling others’ tasks, or directly researching reward hacking.

Your Responsibilities

  • Review RL environments and training tasks for accuracy, robustness, and resistance to reward hacking.
  • Identify potential ways a model could exploit grading systems, manipulate evaluation criteria, or bypass intended reasoning.
  • Collaborate with environment authors to enhance grading systems, rectify reward signals, and redesign ineffective tasks.
  • Develop and maintain review standards and checklists as we scale from hundreds to thousands of tasks monthly.
  • Provide guidance on grader design during the planning phase of environments, ensuring quality before task construction.

Who We Are Looking For

You think like an attacker and have spent considerable time crafting problems that are challenging to exploit or deconstructing seemingly solid issues. A fundamental understanding of machine learning is essential, enabling you to anticipate model strategies, combined with enough engineering insight to assess whether a grader effectively tests its criteria.

About Preference Model

Preference Model is at the forefront of AI training data innovation, developing environments that facilitate learning and adaptive feedback for models through realistic challenges.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.