Infrastructure Research Engineer - Training Systems

Thinking Machines LabSan Francisco

On-site Full-time $350K/yr - $475K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Minimum Qualifications:Bachelor's degree in Computer Science, Engineering, or a related field. Experience with distributed systems and high-performance computing. Familiarity with machine learning frameworks and tools. Strong programming skills in languages such as Python, C++, or similar. Ability to work collaboratively in a team environment.

About the job

Our team comprises scientists, engineers, and builders who have developed some of the most widely utilized AI products, such as ChatGPT and Character.ai, alongside open-weight models like Mistral, and popular open-source initiatives like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.

About the Position

We are seeking an Infrastructure Research Engineer to design and construct the foundational systems that facilitate the scalable and efficient training of large models for both deployment and research purposes. Your primary objective will be to streamline experimentation and training at Thinking Machines, enabling our research teams to concentrate on scientific advancements rather than system limitations.

This role is a perfect match for an individual who possesses a strong blend of deep systems expertise and a keen interest in machine learning at scale. You will take full ownership of the training stack, ensuring that every GPU cycle contributes to scientific progress.

Note: This is an evergreen role that we keep open continuously to express interest. We receive numerous applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. However, we encourage you to apply. We regularly review applications and reach out to candidates as new opportunities arise. Feel free to reapply as you gain more experience, but please avoid applying more than once every six months. We may also post specific roles for individual projects or team needs, in which case you are welcome to apply directly alongside this evergreen role.

Key Responsibilities

Design, implement, and optimize distributed training systems that scale across thousands of GPUs and nodes for extensive training workloads.
Develop high-performance optimizations to maximize throughput and efficiency.
Create reusable frameworks and libraries that enhance training reproducibility, reliability, and scalability for new model architectures.
Establish standards for reliability, maintainability, and security, ensuring systems remain robust under rapid iterations.
Collaborate with researchers and engineers to construct scalable infrastructure.
Publish and disseminate findings through internal documentation, open-source libraries, or technical reports that contribute to the advancement of scalable AI infrastructure.

About Thinking Machines Lab

Thinking Machines is at the forefront of AI innovation, dedicated to empowering individuals and organizations by providing cutting-edge tools and knowledge to leverage artificial intelligence for diverse applications. Join us in our mission to create a more informed and capable world.

Similar jobs

1 - 20 of 11,539 Jobs

Search for Research Infrastructure Engineer At Thinking Machines San Francisco

11,539 results

Select all on this page (20)

Apply

Research Infrastructure Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we are on a mission to empower humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and objectives.We are a diverse team of scientists, engineers, and builders responsible for developing some of the most influential AI products on the market, such as ChatGPT and Character.ai. Our contributions extend to open-weight models like Mistral and popular open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking talented engineers to join our team and develop the libraries and tools that will accelerate research efforts at Thinking Machines. You will take charge of our internal infrastructure—creating evaluation libraries, reinforcement learning training libraries, and experiment tracking platforms—while building systems that enhance research velocity over time.This position emphasizes collaboration. You will work closely with researchers to identify bottlenecks and pain points, ensuring that they trust your systems to function seamlessly and find them enjoyable to use.What You'll DoDesign, build, and manage research infrastructure, including evaluation frameworks, RL training systems, experiment tracking platforms, visualization tools, and shared utilities.Develop high-throughput, scalable pipelines for distributed evaluation, reward modeling, and multimodal assessment.Establish systems for reproducibility, traceability, and robust quality control across research experiments and model training runs, implementing effective monitoring and observability.Collaborate directly with researchers to identify bottlenecks and unlock new capabilities, managing research tools like a product manager by proactively seeking feedback and tracking adoption.Work alongside infrastructure, data, and product teams to integrate tools across the technical stack.

Feb 3, 2026

Apply

Infrastructure Research Engineer - Kernels at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our ambition is to enhance human potential by advancing collaborative general intelligence. We envision a future where individuals have the tools and knowledge to harness AI for their distinct requirements and aspirations.Our team comprises dedicated scientists, engineers, and innovators who have contributed to some of the most renowned AI products, including ChatGPT and Character.ai, along with open-weight models like Mistral, and influential open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking an Infrastructure Research Engineer to architect, optimize, and sustain the computational frameworks that facilitate large-scale language model training. You will create high-performance machine learning kernels (e.g., CUDA, CuTe, Triton), enable effective low-precision arithmetic operations, and enhance the distributed computing infrastructure essential for training expansive models.This position is ideal for an engineer who thrives in close collaboration with hardware and research disciplines. You will partner with researchers and systems architects to merge algorithmic design with hardware efficiency. Your responsibilities will include prototyping new kernel implementations, evaluating performance across various hardware generations, and helping to establish the numerical and parallelism strategies crucial for scaling next-generation AI systems.Note: This is an evergreen role that remains open continuously for expressions of interest. We receive numerous applications, and there may not always be an immediate opportunity that aligns with your qualifications. However, we encourage you to apply, as we regularly assess applications and will reach out as new positions become available. You are also welcome to reapply after gaining additional experience, but please refrain from applying more than once every six months. Additionally, you may notice postings for specific roles catering to particular projects or team needs. In such cases, you are encouraged to apply directly alongside this evergreen listing.What You’ll DoDesign and develop custom ML kernels (e.g., CUDA, CuTe, Triton) for key LLM operations such as attention, matrix multiplication, gating, and normalization, optimized for contemporary GPU and accelerator architectures.Conceptualize compute primitives aimed at alleviating memory bandwidth bottlenecks and enhancing kernel compute efficiency.Collaborate with research teams to synchronize kernel-level optimizations with model architecture and algorithmic objectives.Create and maintain a library of reusable kernels and performance benchmarks that serve as the foundation for internal model training.Contribute to the stability and scalability of our infrastructure, ensuring it meets the growing demands of AI development.

Nov 27, 2025

Apply

Research Product Manager at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$175K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we strive to empower humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone can access the knowledge and tools necessary to harness AI for their specific needs and aspirations.Our team comprises scientists, engineers, and innovators who have developed some of the most widely utilized AI products, such as ChatGPT and Character.ai, along with notable open-weight models like Mistral, as well as prominent open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleAs a Research Product Manager (RPM) at Thinking Machines Lab, you will play a pivotal role in driving complex, high-impact technical products and programs that encompass research, infrastructure, and applied initiatives. You will facilitate the transformation of ambitious concepts into reality by propelling cross-functional collaboration, ensuring projects maintain momentum, and fostering clarity in fast-paced, ambiguous settings.Your contributions will connect people, ideas, and systems, guaranteeing that our critical research initiatives remain aligned, well-defined, and progressing efficiently. This position is ideal for someone who excels in technical discussions, comprehends the intricacies of research, can conceptualize at a high level while also delving into detailed aspects, ultimately aiming to assist the company in executing at scale.Note: This is an "evergreen role" that we keep open on an ongoing basis to express interest. We receive numerous applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. Nevertheless, we encourage you to apply. We continuously review applications and reach out to applicants as new opportunities arise. You are welcome to reapply if you gain more experience, but please refrain from applying more than once every six months. You may also find that we post job openings for specific roles related to separate projects or team needs. In those cases, you are welcome to apply directly in addition to this evergreen role.What You’ll DoDrive and coordinate large-scale research products and programs, ensuring that complex projects are executed efficiently, transparently, and with scientific rigor.Translate technical ideas into actionable, well-scoped plans, defining milestones and ensuring team alignment across model development, data campaigns, infrastructure, and product integration.Collaborate across disciplines—from research and ML infrastructure to legal and business development—quickly ramping up on new domains as necessary.Create and maintain compute and resource roadmaps, identifying bottlenecks and solutions to optimize project flow.

Nov 28, 2025

Apply

Infrastructure Research Engineer - Numerics at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We envision a future where everyone has access to the knowledge and tools necessary to make AI work for their individual needs and goals. Our team comprises scientists, engineers, and innovators who have developed some of the most widely adopted AI products, including ChatGPT and Character.ai, alongside open-weight models like Mistral, as well as popular open-source initiatives such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking a highly skilled infrastructure research engineer to architect and develop core systems that facilitate efficient large-scale model training, with a strong emphasis on numerics. You will enhance the numerical foundations of our distributed training stack, focusing on precision formats, kernel optimizations, and communication frameworks to ensure that training trillion-parameter models is stable, scalable, and fast.This position is perfect for an individual who excels at the intersection of research and systems engineering—a creator who comprehends both the mathematics of optimization and the practicalities of distributed computing.Note: This is an "evergreen role" that remains open for ongoing expressions of interest. While we receive numerous applications and there may not always be an immediate opening that perfectly matches your skills and experience, we encourage you to apply. We continuously review applications and will contact applicants as new opportunities arise. You are welcome to reapply if you gain additional experience, but please refrain from applying more than once every six months. You may also notice postings for specific roles related to particular projects or teams; in those instances, you are welcome to apply for those positions in addition to the evergreen role.What You’ll DoDesign and optimize distributed training infrastructure for large-scale LLMs, ensuring performance, stability, and reproducibility in multi-GPU and multi-node environments.Implement and assess low-precision numerics (e.g., BF16, MXFP8, NVFP4) to enhance efficiency while maintaining model quality.Develop kernels and communication primitives that leverage hardware-level support for mixed and low-precision arithmetic.Collaborate with research teams to co-design model architectures and training methodologies that align with new numeric formats and stability requirements.Prototype and benchmark scaling strategies, including data, tensor, and pipeline parallelism that integrate precision-adaptive computation and quantized communication.Contribute to the design of our internal orchestration and monitoring frameworks.

Nov 27, 2025

Apply

Forward Deployed Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Thinking Machines Lab brings together scientists, engineers, and innovators behind widely recognized AI products such as ChatGPT and Character.ai, as well as open-source frameworks like PyTorch, OpenAI Gym, Fairseq, and Segment Anything. The team is driven by a mission to enhance humanity through collaborative general intelligence, aiming for a future where AI adapts to individual needs and goals. Tinker, the lab’s fine-tuning API, empowers researchers and developers to customize advanced AI models for their own use cases. Tinker manages the infrastructure, allowing users to train open-weight models with their chosen datasets, algorithms, and objectives. As Tinker grows its user base and features, the team is expanding to better support the community. Role overview The Forward Deployed Engineer acts as the main point of contact for a broad range of clients, from solo developers to large organizations. This role identifies customer challenges and requirements, then translates those insights into actionable product improvements. Both customer interaction and product development responsibilities are central to this position. What you will do Triage and resolve customer issues across the full stack, including analyzing logs, reproducing failures, and tracing job executions. Develop tools, integrations, and automation to address recurring problems and speed up user support. Create and update clear documentation and practical guides based on real user experiences and implementations. Work closely with research and infrastructure teams to turn customer feedback into prioritized engineering tasks. Help shape Tinker’s product roadmap by sharing insights from daily customer interactions.

Apr 27, 2026

Apply

Software Engineer, Platform at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Thinking Machines Lab aims to advance collaborative general intelligence, making AI accessible and adaptable for individuals and organizations. The team brings together scientists, engineers, and innovators behind well-known AI solutions, including ChatGPT, Character.ai, Mistral, and open-source projects like PyTorch, OpenAI Gym, Fairseq, and Segment Anything. Tinker, the lab’s fine-tuning API, helps researchers and developers customize AI models using their own data and algorithms. By handling the infrastructure, Tinker allows users to focus on training and deploying models that suit their needs. With a growing customer base and expanding features, the team is looking for a Software Engineer, Platform to support Tinker's continued development. Role overview This position centers on building and maintaining the core platform systems that power Tinker. The engineer will manage billing and usage metering, permissions and access control, organizational structures, data exports, audit logging, and the administrative tools that tie these systems together. Collaboration with product and legal teams is essential, as changes to features, pricing, and enterprise agreements will involve this role. What you will do Design the authorization layer for all products, including RBAC, API key scoping, organizational hierarchies, and permission boundaries. Oversee billing infrastructure, covering usage metering, plan management, payment processing, invoicing, and revenue recognition support. Develop and improve models for organizations and teams, such as seat management, SSO/SAML, workspace isolation, and invitation flows. Implement data export and deletion processes that align with enterprise standards and data residency requirements. Create audit logging systems to track user actions and decisions. This role is based in San Francisco.

Apr 27, 2026

Apply

Infrastructure Research Engineer - Training Systems

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We envision a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and goals.Our team comprises scientists, engineers, and builders who have developed some of the most widely utilized AI products, such as ChatGPT and Character.ai, alongside open-weight models like Mistral, and popular open-source initiatives like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the PositionWe are seeking an Infrastructure Research Engineer to design and construct the foundational systems that facilitate the scalable and efficient training of large models for both deployment and research purposes. Your primary objective will be to streamline experimentation and training at Thinking Machines, enabling our research teams to concentrate on scientific advancements rather than system limitations.This role is a perfect match for an individual who possesses a strong blend of deep systems expertise and a keen interest in machine learning at scale. You will take full ownership of the training stack, ensuring that every GPU cycle contributes to scientific progress.Note: This is an evergreen role that we keep open continuously to express interest. We receive numerous applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. However, we encourage you to apply. We regularly review applications and reach out to candidates as new opportunities arise. Feel free to reapply as you gain more experience, but please avoid applying more than once every six months. We may also post specific roles for individual projects or team needs, in which case you are welcome to apply directly alongside this evergreen role.Key ResponsibilitiesDesign, implement, and optimize distributed training systems that scale across thousands of GPUs and nodes for extensive training workloads.Develop high-performance optimizations to maximize throughput and efficiency.Create reusable frameworks and libraries that enhance training reproducibility, reliability, and scalability for new model architectures.Establish standards for reliability, maintainability, and security, ensuring systems remain robust under rapid iterations.Collaborate with researchers and engineers to construct scalable infrastructure.Publish and disseminate findings through internal documentation, open-source libraries, or technical reports that contribute to the advancement of scalable AI infrastructure.

Nov 27, 2025

Apply

Research Engineer, Infrastructure & Inference

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we are dedicated to empowering humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone can leverage AI to meet their unique needs and aspirations.Our talented team comprises scientists, engineers, and innovators who have developed some of the most widely recognized AI products, including ChatGPT and Character.ai, alongside open-weight models like Mistral and popular open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the PositionWe are seeking a motivated Infrastructure Research Engineer to design, enhance, and scale the systems that underpin large AI models. Your contributions will significantly improve inference speed, cost-effectiveness, reliability, and reproducibility, allowing our teams to concentrate on enhancing model capabilities rather than dealing with bottlenecks.Our mission centers on delivering high-performance and efficient model inference to support real-world applications and accelerate research efforts. In this role, you will be responsible for the infrastructure that guarantees smooth operation for every experiment, evaluation, and deployment at scale.Note: This is an evergreen role, kept open continuously to express interest. We receive numerous applications and may not always have an immediate opening that aligns perfectly with your skills and experience. However, we encourage you to apply. We regularly review applications and reach out to candidates as new opportunities arise. Feel free to reapply as you gain more experience, but we kindly ask that you avoid applying more than once every six months. You may also notice postings for specific roles related to particular projects or teams, in which case you are welcome to apply directly in addition to this evergreen role.What You Will DoCollaborate with researchers and engineers to transition cutting-edge AI models into production.Partner with research teams to ensure high-performance inference for innovative architectures.Design and implement new techniques, tools, and architectures that enhance performance, latency, throughput, and efficiency.Optimize our codebase and computing resources (e.g., GPUs) to maximize hardware FLOPs, bandwidth, and memory usage.Extend orchestration frameworks (e.g., Kubernetes, Ray, SLURM) for distributed inference, evaluation, and large-batch serving.Establish standards for reliability, observability, and reproducibility throughout the inference stack.Publish and share insights through internal documentation, open-source libraries, or technical reports that further the field of scalable AI infrastructure.

Nov 27, 2025

Apply

Full Stack Software Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco, California

Thinking Machines Lab brings together scientists, engineers, and innovators who have contributed to well-known AI products such as ChatGPT, Character.ai, and open-weight models like Mistral. The team’s open-source projects include PyTorch, OpenAI Gym, Fairseq, and Segment Anything. Their mission centers on advancing collaborative general intelligence and making AI tools accessible for a wide range of users and goals. The Tinker platform offers a fine-tuning API that lets researchers and developers tailor advanced AI models to their needs. By handling the underlying infrastructure, Tinker enables users to train open-weight models with custom data, algorithms, and objectives. As demand grows, the team is adding new features and supporting an expanding community. Role overview The Full Stack Software Engineer will play a key part in building and maintaining the products and services that Tinker users depend on. This position involves working closely with frontend, backend, and infrastructure teams to deliver the Tinker console, developer tools, and essential features. What you will do Develop and enhance Tinker’s APIs and backend services using Python and Rust, focusing on areas like job submission, orchestration, billing, and usage tracking. Design and launch user interfaces, including the Tinker console and upcoming developer tools, using React and TypeScript. Refine the developer experience by improving SDK usability, error messages, API design, and onboarding processes. Work to increase system reliability, observability, and security in production, and participate in on-call rotations. Create internal tools that help research and infrastructure teams work more efficiently. Location This role is based in San Francisco, California.

Apr 28, 2026

Apply

Site Reliability Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Thinking Machines Lab brings together scientists, engineers, and innovators who have shaped well-known AI products like ChatGPT and Character.ai, as well as open-weight models such as Mistral. The team also contributes to open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything. The company’s mission centers on advancing collaborative general intelligence, aiming to make AI accessible and adaptable to individual needs. Tinker, the company’s fine-tuning API, enables researchers and developers to customize advanced AI models using their own data and algorithms. Thinking Machines manages the infrastructure, giving users the flexibility to train open-weight models while focusing on their unique requirements. As Tinker expands, the platform continues to evolve alongside its growing community. Role overview The Site Reliability Engineer will focus on improving the reliability and resilience of the Tinker platform. This role involves close collaboration with platform engineers and research teams to strengthen every layer of the system, from infrastructure to user-facing services. What you will do Define and take ownership of end-to-end reliability, including CI/CD workflows, production observability, and incident response processes. Set Service Level Objectives for distributed training systems, balancing reliability, scheduling latency, and development speed. Design and implement monitoring and observability across the training pipeline. Manage incident response for Tinker, ensuring prompt recovery, thorough incident analysis, and systematic improvements to prevent recurrence. Enhance multi-tenant isolation and resource scheduling to support LoRA-based workload co-scheduling, maintaining both reliability and data separation. Collaborate with security teams to identify and address production vulnerabilities. This position is based in San Francisco.

Apr 28, 2026

Apply

Full Stack Software Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Thinking Machines Lab brings together scientists, engineers, and innovators who have contributed to well-known AI products such as ChatGPT, Character.ai, and open-source frameworks like PyTorch, OpenAI Gym, Fairseq, and Segment Anything. The team's mission centers on advancing collaborative general intelligence, aiming to make AI accessible for people to address their own needs and ambitions. The Tinker platform offers a fine-tuning API that lets researchers and developers tailor advanced AI models to their specific requirements. Tinker provides the infrastructure, while users maintain flexibility to train open-weight models with their own data and algorithms. As Tinker grows its features and user base, the team is expanding to support the platform's evolution. Role overview This Full Stack Software Engineer role focuses on designing, building, and maintaining the products and services that Tinker users rely on. The work covers frontend, backend, and infrastructure, with an emphasis on the Tinker console, developer tools, and meeting the changing needs of the Tinker community. What you will do Develop and improve Tinker’s APIs and backend services using Python and Rust, including systems for job submission, orchestration, billing, and usage tracking. Build user-facing interfaces such as the Tinker console and future developer tools with React and TypeScript. Enhance the developer experience by refining SDK usability, error messages, API design, and onboarding workflows. Increase system reliability, observability, and security in Tinker’s production environment, and participate in on-call rotations. Create internal tools to support the research and infrastructure teams working on Tinker. This position is based in San Francisco.

Apr 27, 2026

Apply

Executive Business Partner at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$200K/yr - $250K/yr|On-site|San Francisco, CA

At Thinking Machines Lab, we are on a mission to enhance humanity through the advancement of collaborative general intelligence. Our vision is to create a future where everyone has the opportunity to leverage AI tailored to their individual needs and aspirations.Our team comprises scientists, engineers, and innovators who have developed some of the most renowned AI products in the industry, such as ChatGPT, Character.ai, as well as open-weight models like Mistral and popular open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking an Executive Business Partner to provide vital support to several technical leaders from our San Francisco office. Your role will be crucial in ensuring our team remains focused and organized by managing personal logistics and handling tasks that may otherwise be overlooked.This position is unique, requiring creativity and flexibility to adapt to various work styles and the dynamic challenges of a fast-paced startup environment. You will enjoy significant autonomy in decision-making without extensive supervision.What You’ll DoManage calendars, schedule meetings, and coordinate travel for 3-4 technical leaders.Act as the primary liaison between your supported leaders and other departments within the company.Assist with recruiting coordination efforts.Monitor projects and commitments to ensure nothing is overlooked.

Mar 19, 2026

Apply

HR Business Partner at Thinking Machines | San Francisco, CA

Thinking Machines Lab

Full-time|$190K/yr - $300K/yr|On-site|San Francisco, California

Feb 2, 2026

Apply

Infrastructure Research Engineer - Reinforcement Learning Systems

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We're dedicated to crafting a future where everyone can harness the power of AI to meet their unique needs and aspirations.Our team comprises scientists, engineers, and innovators who have developed some of the most widely utilized AI products, including ChatGPT and Character.ai, as well as open-weight models like Mistral, in addition to renowned open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking a talented Infrastructure Research Engineer to architect and develop the foundational systems that facilitate the scalable and efficient training of large models using reinforcement learning.This position exists at the crossroads of research and large-scale systems engineering, requiring a professional who not only comprehends the algorithms behind reinforcement learning but also appreciates the practicalities of distributed training and inference at scale. You will have a diverse set of responsibilities, from optimizing rollout and reward pipelines to enhancing the reliability, observability, and orchestration of systems. Collaboration with researchers and infrastructure teams will be essential to ensure reinforcement learning is stable, rapid, and production-ready.Note: This is an evergreen role that we maintain on an ongoing basis to express interest. Due to the high volume of applications we receive, there may not always be an immediate position that aligns perfectly with your skills and experience. We encourage you to apply, as we continuously review applications and reach out to candidates when new opportunities arise. You may reapply after gaining more experience, but please refrain from applying more than once every six months. Additionally, you may notice postings for specific roles that cater to unique project or team needs; in those circumstances, you are welcome to apply directly alongside this evergreen role.What You’ll DoDesign, implement, and optimize the infrastructure that supports large-scale reinforcement learning and post-training workloads.Enhance the reliability and scalability of the RL training pipeline, including distributed RL workloads and training throughput.Create shared monitoring and observability tools to ensure high uptime, debuggability, and reproducibility of RL systems.Work closely with researchers to translate algorithmic concepts into production-quality training pipelines.Develop evaluation and benchmarking infrastructure to assess model performance based on helpfulness, safety, and factual accuracy.Publish and disseminate insights through internal documentation, open-source libraries, or technical reports that contribute to the advancement of scalable AI infrastructure.

Nov 27, 2025

Apply

GTM Strategy & Operations Lead at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$175K/yr - $300K/yr|On-site|San Francisco, California

Thinking Machines Lab brings together scientists, engineers, and innovators with a track record in developing widely used AI products and open-source projects. The team has contributed to tools like ChatGPT, Character.ai, Mistral, PyTorch, OpenAI Gym, Fairseq, and Segment Anything. The company’s mission centers on advancing collaborative general intelligence to help people achieve more with AI tailored to their needs. Tinker, the company’s fine-tuning API, enables researchers and developers to adapt advanced AI models to their own data and algorithms. By handling the infrastructure, Tinker allows users to focus on customization, opening up capabilities that were once limited to a few specialized labs. As Tinker’s customer base and feature set grow, the team is focused on building a scalable platform and supporting an expanding community. Role overview The GTM Strategy & Operations Lead will build and refine the commercial structure for Tinker. This person will design strategies and processes that turn organic product adoption into a consistent, scalable revenue stream. The role involves shaping how Tinker’s fine-tuning capabilities are packaged, priced, launched, and sold across different customer segments. Collaboration with product, engineering, and research teams is central to the work. Tinker is designed for technically sophisticated users. The GTM lead must be comfortable discussing training infrastructure and understand how developers evaluate and adopt new tools. What you will do Develop and execute commercialization strategies for Tinker, including pricing, packaging, and launch plans based on market and competitor analysis. Create go-to-market approaches tailored to different types of customers. Manage partnerships to expand Tinker’s reach and open new channels for demand. Design and oversee customer pilots, onboarding, and expansion playbooks to move accounts from testing to production use. Produce commercial playbooks to help customer-facing engineers and FDEs position and sell Tinker effectively. Set and track success metrics for launches and GTM projects, running experiments to test assumptions about pricing and product packaging.

Apr 27, 2026

Apply

Infrastructure Engineer at Chalk | San Francisco

Chalk

Full-time|On-site|SF

About ChalkAt Chalk, we are revolutionizing the data platform that drives the future of machine learning applications. Our mission is to eliminate the complexity, latency, and scalability issues that have historically limited ML capabilities. Our platform seamlessly integrates Rust-speed performance with user-friendly tools that developers adore. Renowned companies trust Chalk to combat fraudulent credit card transactions, verify identities, and enhance clean energy utilization. Recently, we secured a $50 million Series A funding, spearheaded by Felicis.About the RoleWe are on the lookout for talented engineers to join our Infrastructure team. This is a unique opportunity to become one of our early hires and significantly impact a fast-growing startup. You will have the autonomy to solve complex engineering challenges and take ownership of your projects.We seek a platform engineer with a solid background in infrastructure engineering. At Chalk, we are tackling problems related to DBMS query planning, optimization, compilers, and distributed analytical data processing systems.Chalk employs dynamic and static analysis of Python code to optimize arbitrary user Python code, orchestrate the necessary infrastructure implied by that code, and track metadata regarding data flow through our systems.Our team works in the office five days a week. We are flexible with unavoidable conflicts, but this is not a hybrid position.What You Will DoDevelop code to automate the orchestration and provisioning of infrastructure to implement Chalk technology for our customers and prospects.Create a robust platform for managing our hosted services and deploying Chalk into customer-owned cloud environments across AWS and GCP.Collaborate closely with our Engineering and Sales teams.Contribute to interviewing and expanding the Engineering team.What We’re Looking ForMinimum of 2 years of experience in software development for automated infrastructure management.Proficiency in Python, Go, and/or Terraform.Hands-on experience with AWS and/or GCP.Strong collaborative skills in both technical and non-technical teams.

Dec 18, 2023

Apply

Infrastructure Research Engineer at thinkingmachines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we are committed to empowering humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and aspirations.Our team comprises scientists, engineers, and builders who have developed some of the most utilized AI products, including ChatGPT and Character.ai, as well as open-weight models like Mistral. We also contribute to notable open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking a talented Infrastructure Research Engineer to enhance, scale, and fortify the systems supporting Tinker. This role will enable our internal teams and external clients to fine-tune models seamlessly, reliably, and cost-effectively. You will work at the intersection of large-scale training systems and product infrastructure, creating multi-tenant scheduling, storage, observability, and reliability features within a developer-friendly API.Your contributions will allow all Tinker users to concentrate on research and development without the burden of infrastructure concerns.Note: This is an evergreen position that we keep open for ongoing interest. We receive numerous applications, and there may not always be a role that aligns perfectly with your skills and experience. We encourage you to apply, as we continuously review applications and will reach out as new opportunities arise. You are welcome to reapply after gaining more experience, but please refrain from applying more than once every 6 months. We also post specific roles for unique project or team needs, and you are welcome to apply directly to those in addition to this evergreen listing.What You’ll DoDesign and implement distributed job orchestration, placement, preemption, and fair-share scheduling to enhance Tinker for multi-tenant workloads.Optimize GPU utilization, throughput, and reliability across clusters (including autoscaling, bin-packing, and quotas).Develop reusable frameworks and libraries to enhance Tinker’s transparency, reproducibility, and performance.Collaborate with researchers and developer experience engineers to transform fine-tuning challenges into product features.Publish and disseminate insights through internal documentation, open-source libraries, or technical reports to advance the field of scalable AI infrastructure.

Nov 27, 2025

Apply

Research Engineer at Mercor | San Francisco

Mercor

Full-time|On-site|San Francisco

About MercorMercor sits at the forefront of labor markets and artificial intelligence research, collaborating with premier AI laboratories and enterprises to harness the human intelligence crucial for AI evolution.Our expansive talent network empowers the training of cutting-edge AI models, akin to how educators impart knowledge to students—sharing insights, experiences, and contexts that transcend mere code. Currently, our network comprises over 30,000 experts, generating collective earnings exceeding $2 million daily.At Mercor, we are pioneering a unique category of work where expertise fuels AI progress. Realizing this vision necessitates a bold, fast-paced, and deeply dedicated team. You will collaborate with researchers, operators, and AI firms that are at the vanguard of transforming systems that redefine society.As a profitable Series C company, Mercor is valued at $10 billion and maintains an in-office presence five days a week at our new headquarters in San Francisco.About the RoleIn your capacity as a Research Engineer at Mercor, you will operate at the intersection of engineering and applied AI research. You will play a pivotal role in post-training and Reinforcement Learning from Human Feedback (RLVR), synthetic data generation, and large-scale evaluation workflows essential for advancing frontier language models.Your contributions will help train large language models to adeptly utilize tools, exhibit agentic behavior, and engage in real-world reasoning within production environments. You will be instrumental in shaping rewards, conducting post-training experiments, and constructing scalable systems to enhance model performance. Your responsibilities will also include designing and evaluating datasets, creating scalable data augmentation pipelines, and developing rubrics and evaluators that expand the learning potential of LLMs.

Dec 29, 2025

Apply

Research Engineer at magic.dev | San Francisco

magic.dev

Full-time|$225K/yr - $550K/yr|On-site|San Francisco

At magic.dev, we are committed to advancing humanity by developing safe artificial general intelligence (AGI) that tackles the world's most pressing challenges. Our unique approach focuses on automating research and code generation to enhance model performance and alignment more effectively than traditional methods. By leveraging cutting-edge pre-training, domain-specific reinforcement learning, ultra-long context processing, and efficient inference-time computation, we aim to redefine the capabilities of AGI.Role OverviewAs a Research Engineer, you will play a pivotal role in training, evaluating, and deploying large-scale AI models alongside innovative inference-time computing methods. You will contribute to the creation of extensive internet-scale datasets and support the prototyping of groundbreaking research and product initiatives.Key ResponsibilitiesEnhance inference throughput for cutting-edge model architecturesDevelop and refine frameworks that underpin our research and production processesTrain trillion-parameter models using large GPU clustersCurate post-training datasets to bolster specific capabilitiesConstruct internet-scale data pipelines and web crawlersDesign, prototype, and optimize innovative model architecturesContribute to cutting-edge research in long-context, inference-time computation, reinforcement learning, and additional domainsQualificationsProven software engineering expertiseIn-depth understanding of deep learning literatureExperience with both pre-training and post-training of large language models (LLMs)Strong capability to generate and assess research ideasFamiliarity with large distributed systemsProficient in managing substantial ETL workloadsCompensation and BenefitsAnnual salary ranging from $225,000 to $550,000 based on experienceEquity is a significant component of total compensation401(k) plan with a 6% salary matchComprehensive health, dental, and vision insurance for you and your dependentsUnlimited paid time offVisa sponsorship and relocation assistance availableBe part of a small, dynamic, and focused team

Jan 24, 2024

Apply

Research Engineer, Infrastructure

Cognition

Full-time|On-site|San Francisco Bay Area

Join our dynamic team at Cognition as a Research Engineer specializing in Infrastructure. In this role, you will be at the forefront of cutting-edge research, contributing to innovative solutions that shape the future of our infrastructure projects.Your responsibilities will include conducting thorough research, analyzing data, and collaborating with cross-functional teams to implement effective strategies. We are looking for an individual who is passionate about technology and infrastructure, eager to solve complex problems, and ready to drive impactful results.

Apr 8, 2026

Create account — see all 11,539 results