company

Freelance AI Evaluation Engineer

MindriftRemote — Rome, Metropolitan City of Rome Capital, Italy
Remote Part-time $30/hr - $30/hr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Ideal Candidate ProfileThis role is well-suited for seasoned developers, software engineers, or test automation specialists willing to engage in part-time, non-permanent projects. The ideal candidates will possess:A degree in Computer Science, Software Engineering, or a related field. Over 5 years of experience in software development, with a strong emphasis on Python (FastAPI, pytest, async/await, subprocess, file operations). A background in full-stack development, particularly in constructing React-based interfaces (JavaScript/TypeScript) and resilient back-end systems. Experience in writing tests (functional, integration—not just executing them). Familiarity with Docker containers and infrastructure tools (Postgres, Kafka, Redis). Understanding of CI/CD processes (GitHub Actions as a user: triggers, labels, reading results). Proficiency in English at a suitable level...

About the job

Please submit your CV in English and specify your English proficiency level.

Mindrift connects skilled professionals with contract-based AI projects from leading technology companies. This freelance role centers on evaluating, testing, and refining AI systems. It is a contract position only, not a path to permanent employment.

Role overview

This project focuses on building a dataset to assess how well AI coding agents perform real-world software development tasks. The work requires designing realistic challenges and evaluation methods within simulated developer environments.

Key responsibilities

  • Create virtual companies from high-level plans, building out codebases, infrastructure, and context such as conversations, documentation, and tickets to simulate authentic development histories.
  • Develop and refine tasks for different stages of the virtual company, including writing prompts, setting evaluation criteria, and ensuring tasks are solvable and fairly assessed.
  • Design assignments in isolated environments that closely resemble a developer’s workstation, including a Linux machine with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation), and a live web application codebase.
  • Implement tests that accept all correct solutions and reject incorrect ones, carefully balancing strictness and leniency.
  • Collaborate with an AI agent during test runs to confirm that tests catch real issues, flagging faulty solutions without penalizing correct ones.
  • Review code generated by AI agents, analyze reasons for success or failure, and design edge cases and adversarial scenarios to expose weaknesses.
  • Iterate on your work based on feedback from expert QA reviewers who will check your outputs against quality standards.

What this role does not involve

  • Data labeling
  • Prompt engineering
  • Writing code from scratch (the AI agent generates most of the code; your focus is on guiding and evaluating)

Much of the work involves close collaboration with AI models, as developing challenging tasks for advanced systems requires working directly with them.

About Mindrift

Mindrift specializes in connecting skilled professionals with innovative project-based AI opportunities from top technology companies, focusing on the vital areas of testing, evaluating, and refining AI systems.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.