About the job
Gramian Consultancy seeks an AI Evaluation Engineer specializing in data analysis and multi-agent systems. This remote contractor position is open to candidates based in Brazil, as well as Bangladesh, Colombia, Egypt, Ghana, India, Indonesia, Kenya, Nigeria, Turkey, and Vietnam.
Role overview
This role centers on designing and implementing benchmark tasks that reflect real-world analytical challenges for AI systems. The focus is on building scenarios where multi-agent systems analyze large, complex datasets from various sources, assign tasks to specialized agents, and generate clear, verifiable results. The contract requires a full-time commitment of 8 hours per day, with at least 4 hours overlapping with Pacific Standard Time. The minimum contract length is 4 weeks. No medical or paid leave is provided, as this is a contractor role. The interview process includes a 60-minute take-home assessment.
What you will do
- Design and implement benchmark tasks for multi-agent systems, emphasizing complex data analysis workflows
- Create or select realistic datasets in formats such as CSV, JSON, logs, reports, and financial or operational data
- Develop tasks that require cross-referencing multiple data sources, identifying anomalies and contradictions, and performing statistical analysis
- Define strategies for distributing tasks among specialized sub-agents (for example, financial, technical, or operational analysis)
- Develop verification logic to ensure analytical outputs are precise and not generic
- Build evaluation pipelines using Python and SQL
- Create reproducible environments with Docker
- Review and refine tasks for clarity, complexity, and scoring accuracy
Requirements
- Minimum 5 years of experience in data analysis or analytics-focused positions
- Advanced skills in Python (including pandas and NumPy) and SQL
- Experience working with real-world, messy datasets (CSV, JSON, logs, reports)
- Ability to design analytical problems with verifiable answers
- Strong understanding of statistics, including distributions, correlations, and outlier detection
- Familiarity with AI benchmarks or evaluation tools (such as SWE-bench or similar)
- Hands-on experience with Docker, including writing Dockerfiles, building images, and troubleshooting
