Evaluation Engineer jobs in San Francisco – Browse 5,187 openings on RoboApply Jobs

Evaluation Engineer jobs in San Francisco

Open roles matching “Evaluation Engineer” with location signals for San Francisco. 5,187 active listings on RoboApply Jobs.

5,187 jobs found

1 - 20 of 5,187 Jobs
Apply
companyBraintrust logo
Full-time|Remote|San Francisco

Join our dynamic team as an Evaluation Engineer at Braintrust, a leading talent network that empowers companies to harness the expertise of top talent. In this role, you will be responsible for developing and implementing evaluation frameworks to assess various projects and initiatives. You will work closely with cross-functional teams to ensure alignment with our strategic objectives and contribute to data-driven decision-making processes.

Mar 13, 2026
Apply
companydistyl logo
Full-time|Remote|San Francisco

distyl seeks an AI Evaluation Engineer based in San Francisco. This position centers on assessing artificial intelligence systems, measuring how well models perform, and guiding the process for testing and refining products. Role overview The main focus is to evaluate AI models for accuracy and reliability. The role involves shaping and maintaining testing protocols for both new and existing systems. Collaboration is key, as you will work with teams across the company to help ensure that AI outputs consistently meet quality standards. What you will do Assess AI models to determine their accuracy and reliability Create and update testing protocols for a range of systems Partner with teams throughout the organization to uphold quality benchmarks for AI outputs Requirements Keen attention to detail Interest in artificial intelligence and its real-world uses Comfort working with colleagues from diverse backgrounds

Apr 23, 2026
Apply
companyExa logo
Full-time|On-site|San Francisco, California

At Exa, we are pioneering the next generation of search engines designed for the era of artificial intelligence, starting from the foundational Silicon architecture. Our ambitious indexing operation is unparalleled, allowing us to crawl the vast open web at an extraordinary scale. We harness cutting-edge embedding models to comprehend this data and utilize our high-performance Rust-based vector database alongside a $5M H200 GPU cluster, which powers tens of thousands of machines simultaneously.The Machine Learning (ML) division is central to this mission, focusing on the training of foundational models that enhance search capabilities. Our vision is to create systems capable of swiftly filtering the world’s knowledge to deliver precisely what you need, regardless of the complexity of your inquiry—effectively transforming the web into a robust, searchable database.To achieve this ambitious goal, we must define what constitutes “effective search”. This is where your expertise will play a crucial role.We are seeking a talented Machine Learning Evaluations Engineer to develop and implement our evaluation framework at Exa. This position entails exploring methodologies to assess search engines in a world dominated by large language models (LLMs) and crafting the most thorough, innovative, and impactful evaluation suite. Your decisions will influence the future of search optimization and directly affect the research team’s focus, shaping the company’s strategic direction.

Oct 15, 2025
Apply
companyAnthropic logo
Full-time|Remote|Remote-Friendly (Travel-Required) | San Francisco, CA | New York City, NY

Anthropic is looking for a Research Engineer focused on model evaluations. This position involves research and development to assess and strengthen the performance of AI models. Teams are based in San Francisco and New York City, and the role supports remote work with required travel. Key responsibilities Design and implement evaluations for Anthropic's AI models Collaborate with team members to enhance model performance Contribute to research that pushes the boundaries of AI systems Location Remote-friendly (travel required) San Francisco, CA New York City, NY

Apr 28, 2026
Apply
companygleanwork logo
Full-time|Remote|San Francisco Bay Area

Join gleanwork as a Machine Learning Engineer specializing in LLM evaluations and observability. In this role, you will be instrumental in developing cutting-edge machine learning systems that enhance our understanding and effectiveness of language learning models. You will collaborate with cross-functional teams to drive the integration of advanced analytics and machine learning solutions.

Mar 16, 2026
Apply
companyWaymo LLC logo
Full-time|$170K/yr - $216K/yr|Hybrid|Mountain View, CA, USA; San Francisco, CA, USA

Waymo is at the forefront of autonomous driving technology, dedicated to becoming the world's most trusted driver. Originating from the Google Self-Driving Car Project in 2009, we have consistently focused on developing the Waymo Driver—The World’s Most Experienced Driver™—to enhance mobility access and prevent countless lives lost in traffic accidents. The Waymo Driver powers our fully autonomous ride-hailing service and is adaptable to various vehicle platforms and applications. With over ten million rider-only trips completed, we have driven autonomously over 100 million miles on public roads and conducted simulations totaling tens of billions of miles across 15+ U.S. states.The Planner Evaluation team tackles a pivotal challenge in autonomous driving: assessing and enhancing the quality of software that operates the vehicle. We seek passionate and experienced software engineers and data scientists who are data-driven and eager to refine how we assess and characterize modifications to our onboard software stack (including Planner and Perception). If you are enthusiastic about autonomous vehicles and adept at utilizing complex data to influence decision-making, we invite you to apply for this exciting opportunity!

Feb 10, 2026
Apply
companyScale AI logo
Full-time|$216.3K/yr - $300.3K/yr|On-site|San Francisco, CA; St. Louis, MO; New York, NY; Washington, DC

Senior Machine Learning Engineer - Model Evaluations for the Public Sector The Public Sector Machine Learning team at Scale AI pioneers the deployment of cutting-edge AI systems, including Large Language Models (LLMs), agentic models, and comprehensive multimodal pipelines, within critical government operations. We establish robust evaluation frameworks that ensure these models function reliably, safely, and effectively in real-world scenarios. As a Senior Machine Learning Engineer, you will architect, implement, and enhance automated evaluation pipelines that empower our clients to trust and effectively utilize advanced AI systems in defense, intelligence, and federal missions. Your Responsibilities Include: Creating and maintaining automated evaluation pipelines for machine learning models, focusing on functional, performance, robustness, and safety metrics, including evaluations based on LLM judges. Designing test datasets and benchmarks to assess generalization, bias, explainability, and potential failure modes. Building evaluation frameworks for LLM agents, which includes the infrastructure for scenario-based and environment-based testing. Conducting comparative analyses of model architectures, training procedures, and evaluation results. Implementing tools for continuous monitoring, regression testing, and quality assurance of machine learning systems. Designing and executing stress tests and red-teaming workflows to identify vulnerabilities and edge cases. Collaborating with operations teams and subject matter experts to generate high-quality evaluation datasets. This position requires an active security clearance or the ability to obtain one.

Mar 26, 2026
Apply
companyScale AI logo
Full-time|$280K/yr - $380K/yr|On-site|San Francisco, CA; Seattle, WA; New York, NY

At Scale AI, we are the premier partner for data and evaluation in the rapidly evolving field of artificial intelligence. Our commitment to advancing the assessment and benchmarking of large language models (LLMs) positions us at the forefront of AI innovation. We are dedicated to creating leading-edge LLM evaluation methodologies that set new benchmarks for model performance. Our research teams collaborate with the top AI laboratories in the industry to provide high-quality data, accelerate progress in generative AI research, and inform what excellence looks like in this domain. As a Staff Machine Learning Research Scientist on our LLM Evals team, you will spearhead the creation of novel evaluation methodologies, metrics, and benchmarks to assess the strengths and weaknesses of cutting-edge LLMs. Your work will shape our internal strategies and influence the broader AI research community, making this role essential for establishing best practices in data-driven AI development.

Mar 26, 2026
Apply
companyaiedu logo
Full-time|On-site|San Francisco, United States

Join aiedu as a Senior Lead in Research & Evaluation, where you will drive impactful research initiatives that shape educational practices and policies. In this role, you will lead a team of researchers in designing and executing comprehensive evaluations that inform our strategic direction. Your expertise will be critical in analyzing data, generating insights, and communicating findings to stakeholders.

Mar 13, 2026
Apply
companyAnthropic logo
Full-time|Remote|Remote-Friendly (Travel-Required) | San Francisco, CA | Washington, DC; San Francisco, CA | New York City, NY

Join Anthropic as a Safeguards Enforcement Analyst, where you will play a pivotal role in ensuring safety evaluations within our innovative AI systems. This role focuses on analyzing compliance with safeguards and developing strategies to enhance safety protocols. Collaborate with cross-functional teams to assess risks and implement robust solutions that align with our commitment to responsible AI.

Mar 12, 2026
Apply
companyReflection AI logo
Full-time|On-site|SF

Our MissionAt Reflection AI, we are dedicated to creating accessible open superintelligence for everyone.Our team is composed of top-tier AI researchers and innovators from prestigious organizations like DeepMind, OpenAI, Google Brain, Meta, Character.AI, Anthropic, and more. We are committed to building open weight models for individuals, enterprises, and even nation states.About the RolePerform essential comparative analyses to deepen our insights into model capabilities.Design and enhance evaluation systems and processes that establish robust feedback loops between data, evaluations, and model behavior.Create generalizable evaluation frameworks that effectively capture reasoning, alignment, and practical usefulness.Collaborate closely with pre-training, post-training, and applied teams to translate insights into tangible model improvements.Expand the boundaries of measurable metrics, utilizing synthetic evaluations, human feedback, and real-world interaction data.About YouProficient in statistical analysis and experimental design, with the ability to rigorously measure model advancements.Knowledgeable in LLM evaluation methodologies, including static benchmarks, human preference evaluations, and agentic tasks.Possess a high degree of agency and thrive in a fast-paced startup atmosphere, prioritizing impact over rigid processes.Eager to work in a pioneering lab, shaping how we measure and accelerate the development of more capable models.Collaborative, detail-oriented, and driven by the desire to create effective feedback loops that enhance model performance.What We Offer:We believe in building superintelligence that is genuinely open, starting from the ground up. Joining Reflection means you will be part of a small, talent-dense team where you will help shape our future and push the boundaries of open foundational models.You will have the opportunity to engage in the most impactful work of your career, knowing that you and your loved ones are well-supported.Competitive Compensation: Salary and equity structured to attract and retain top global talent.Health & Wellness: Comprehensive medical, dental, vision, life, and disability insurance.

Dec 17, 2025
Apply
companyScale AI, Inc. logo
Full-time|$197.4K/yr - $246.8K/yr|On-site|San Francisco, CA; New York, NY

Join Scale AI as a Research Scientist — Frontier Risk EvaluationsAt Scale AI, we are at the forefront of data and evaluation services for pioneering AI technologies. Our mission is to ensure the safe and effective deployment of AI systems by bridging the gap between advanced AI research and global policy frameworks. With the launch of Scale Labs, we are assembling a dedicated team focused on policy research to empower governments and industry leaders with scientific insights regarding AI risks and functionalities.This team addresses complex challenges in agent robustness, AI control mechanisms, and risk assessments to facilitate a comprehensive understanding of AI risks, while promoting its responsible adoption across various sectors. We are eager to welcome skilled researchers who are passionate about shaping the future of AI.As a Research Scientist specializing in Frontier Risk Evaluations, you will be responsible for designing evaluation metrics, harnesses, and datasets to assess the risks associated with cutting-edge AI systems. Your role may involve:Developing harnesses to evaluate AI models for potential security vulnerabilities and other high-risk behaviors.Collaborating with government entities and research labs to design evaluations that mitigate risks posed by advanced AI technologies.Publishing evaluation methodologies and drafting technical reports aimed at informing policymakers.

Mar 26, 2026
Apply
companydstaff logo
Full-time|On-site|San Francisco

We are seeking a motivated and detail-oriented CRA Evaluation & Impact Measurement Analyst to join our dynamic team at dstaff. In this role, you will be responsible for assessing the effectiveness of our programs and initiatives, utilizing data analysis to drive decision-making and improve outcomes.Your expertise in evaluation methodologies and impact measurement will be critical in shaping our strategic direction. You will collaborate with various stakeholders to collect, analyze, and interpret data, ensuring our projects align with our mission and deliver tangible results.

May 3, 2015
Apply
companyScale AI logo
Full-time|$179.4K/yr - $224.3K/yr|On-site|San Francisco, CA; New York, NY

Join Scale AI as a passionate and technically adept AI Research Engineer within our Enterprise Evaluations team. This pivotal role is integral to our goal of providing the industry's leading Generative AI Evaluation Suite. You will actively contribute to the foundational systems that guarantee the safety, dependability, and ongoing enhancement of LLM-driven workflows and agents for enterprise clients. The perfect candidate will possess a robust understanding of large language models, a fervor for addressing intricate evaluation dilemmas, and the ability to excel in a fast-evolving research atmosphere. We seek an engineer who can innovate, remains informed about the latest studies in AI evaluation, and is enthusiastic about incorporating cutting-edge research concepts into our workflows to create top-tier evaluation systems.

Mar 26, 2026
Apply
companyScale AI logo
Full-time|$154.4K/yr - $257K/yr|On-site|San Francisco, CA; St. Louis, MO; New York, NY; Washington, DC

Scale AI develops AI systems that support critical decision-making for organizations around the world. The Public Sector team partners with government agencies to bring AI solutions to important public missions. Role overview The Product Manager for Public Sector GenAI Test & Evaluation leads the vision and planning for Scale AI’s evaluation capabilities. This position manages the T&E technology stack, which is central to measuring, improving, and validating the performance of AI applications. The work often requires handling high-stakes and unpredictable scenarios. What you will do Set the direction and oversee development of the test and evaluation infrastructure for public sector AI projects. Collaborate with engineering teams to identify technical issues and translate them into clear action plans. Work closely with both commercial and government teams to define requirements, ensuring evaluation services meet strict standards. Continuously refine the technology stack so machine learning teams can enhance performance and reliability. Share key performance insights with stakeholders throughout the organization. Locations San Francisco, CA St. Louis, MO New York, NY Washington, DC

Apr 24, 2026
Apply
companyReducto logo
Full-time|On-site|San Francisco Office

Join Reducto as a Machine Learning Evaluation Engineer where you will play a critical role in assessing and enhancing machine learning models. You will collaborate closely with data scientists and engineers to ensure our systems are efficient and accurate, bringing innovative solutions to challenging problems in the machine learning space.

Mar 16, 2026
Apply
companyArena Intelligence logo
Analytics Engineer

Arena Intelligence

Full-time|Remote|Bay Area

Join Arena Intelligence as an Analytics Engineer!Arena Intelligence is a pioneering platform devoted to assessing the performance of AI models in real-world scenarios. Founded by talented researchers from UC Berkeley's SkyLab, our mission is to advance AI applications through rigorous evaluation and transparency.Each month, millions engage with Arena Intelligence to comprehend the effectiveness of cutting-edge AI systems, and we leverage our community’s insights to foster trustworthy and user-centric model evaluations. Top enterprises and AI laboratories depend on our assessments to gauge real-world reliability, alignment, and impact. Our leaderboards set the standard for AI performance, earning trust from industry leaders and steering global discussions on model reliability.Our dynamic team consists of researchers, engineers, and innovators with backgrounds from prestigious institutions such as UC Berkeley, Google, and Stanford. We prioritize truth, speed, and craftsmanship while maintaining an inclusive environment where passionate minds from diverse backgrounds can excel. Our workplace embodies excellence, energy, and concentration.Role OverviewWe are looking for an experienced Analytics Engineer to build and manage the data infrastructure that supports real-world AI evaluations. You will be responsible for designing and implementing analytics-ready data models, pipelines, and metrics that transform raw user data and votes into valuable insights for our stakeholders.This role is positioned at the intersection of data engineering, analytics, and product development. You will collaborate closely with researchers, product managers, and engineers to define schemas, standardize metrics, and ensure accuracy and scalability of our evaluation data. Your contributions will significantly influence how AI performance is assessed and interpreted across the sector.We seek someone who is passionate about crafting clean, robust data systems, values data integrity, and desires to see their work drive product decisions and benefit our external clients.

Feb 3, 2026
Apply
companyCartesia logo
Full-time|On-site|*HQ - San Francisco, CA

About CartesiaAt Cartesia, we are on a mission to revolutionize artificial intelligence by creating interactive, ubiquitous intelligence that operates seamlessly wherever you are. Current AI models struggle to continuously process and reason over extensive streams of data, including a year’s worth of audio, video, and text. Our innovative team is developing advanced model architectures to overcome these challenges.Founded by PhDs from the Stanford AI Lab who pioneered State Space Models, we blend deep expertise in model innovation with a design-focused engineering approach. With backing from top-tier investors such as Index Ventures and Lightspeed Venture Partners, along with a network of industry-leading advisors, we are pushing the boundaries of AI.About the RoleJoin our New Horizons Evaluations team as the Evaluations Lead, where you will redefine how we measure progress in interactive machine intelligence. You will create evaluation frameworks that assess not only what models know but also how they reason, remember, and engage over time. This multifaceted role bridges research, product development, and infrastructure to establish metrics and systems that articulate the essence of “intelligence” in the next wave of AI. Ideal candidates will possess a blend of scientific rigor and technical prowess, alongside a genuine curiosity about user interactions with intelligent systems. Your contributions will be pivotal in shaping Cartesia’s model development, focusing on deeper qualities such as understanding, naturalness, and adaptability in real-world applications.Your ImpactDefine and identify essential model capabilities and behaviors for next-generation evaluations.Develop and implement comprehensive evaluation pipelines with robust statistical analysis and transparent reporting.Collaborate closely with model training and research teams to integrate evaluation systems into the model development process.Design and prototype user studies and behavioral experiments to ground evaluations in practical use.

Oct 21, 2025
Apply
companyAnthropic logo
Full-time|On-site|San Francisco, CA | New York City, NY

Join Anthropic as an Engineering Manager, where you will lead a talented team focused on developing innovative agent prompts and evaluations. In this role, you will drive the engineering direction and collaborate closely with cross-functional teams to enhance our product offerings. Your expertise will help shape our technology and influence our long-term vision.

Mar 26, 2026
Apply
company
Contract|On-site|San Francisco, California, United States

Dane Street is actively expanding our network of expert physician reviewers in California! This is an exciting in-person opportunity for physicians seeking supplemental income while working on a caseload that fits their individual schedules. Our physician panel consists of independent contract reviewers (1099), compensated on a per-case basis.We are looking for candidates who possess a valid California medical license and are Board Certified in Physical Medicine and Rehabilitation, Internal Medicine, Orthopedic Surgery, Occupational Medicine, Neurology, or Psychiatry. We welcome applications from providers across California.Role Overview:As a Physician Reviewer/Advisor, you will leverage your clinical expertise to review insurance appeals and evaluate prospective and retrospective claims. Your role will involve assessing the medical necessity of services provided by other healthcare professionals, ensuring compliance with client-specific policies, nationally recognized evidence-based guidelines, and standards of care.Key Responsibilities:Thoroughly review medical records and respond to each inquiry posed by the client using client-specific criteria or other nationally recognized guidelines.Ensure that your rationale is clear, concise, and supported by adequate documentation to substantiate your decisions.Identify and utilize current criteria and resources such as national, state, and professional association guidelines and peer-reviewed literature to support objective decision-making; avoid reliance on case studies and cohorts due to their limited sample sizes.Provide timely copies of any criteria utilized in your reviews alongside your reports.Return cases by the specified due date and time.Conduct necessary telephone communications as required by state regulations or client specifications.Maintain proper credentialing, state licenses, and any special certifications required for the role.Attend all required orientation and training sessions.Complete additional duties as assigned, including addressing quality assurance issues, complaints, regulatory matters, depositions, court appearances, or audits.Important Note: In our commitment to security, Dane Street will never conduct interviews via text or request checks from candidates for any purpose, including the purchase of equipment.

Dec 18, 2025

Sign in to browse more jobs

Create account — see all 5,187 results

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.