Infrastructure Research Engineer At Thinkingmachines San Francisco jobs in San Francisco – Browse 11,490 openings on RoboApply Jobs

Infrastructure Research Engineer at thinkingmachines | San Francisco

Thinking Machines LabSan Francisco

On-site Full-time $350K/yr - $475K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Minimum qualifications:Bachelor’s degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, or a related field. Familiarity with distributed systems and experience in developing scalable infrastructure. Strong programming skills in languages such as Python, Go, or similar. Understanding of machine learning frameworks and GPU resource management.

About the job

Our team comprises scientists, engineers, and builders who have developed some of the most utilized AI products, including ChatGPT and Character.ai, as well as open-weight models like Mistral. We also contribute to notable open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.

About the Role

We are seeking a talented Infrastructure Research Engineer to enhance, scale, and fortify the systems supporting Tinker. This role will enable our internal teams and external clients to fine-tune models seamlessly, reliably, and cost-effectively. You will work at the intersection of large-scale training systems and product infrastructure, creating multi-tenant scheduling, storage, observability, and reliability features within a developer-friendly API.

Your contributions will allow all Tinker users to concentrate on research and development without the burden of infrastructure concerns.

Note: This is an evergreen position that we keep open for ongoing interest. We receive numerous applications, and there may not always be a role that aligns perfectly with your skills and experience. We encourage you to apply, as we continuously review applications and will reach out as new opportunities arise. You are welcome to reapply after gaining more experience, but please refrain from applying more than once every 6 months. We also post specific roles for unique project or team needs, and you are welcome to apply directly to those in addition to this evergreen listing.

What You’ll Do

Design and implement distributed job orchestration, placement, preemption, and fair-share scheduling to enhance Tinker for multi-tenant workloads.
Optimize GPU utilization, throughput, and reliability across clusters (including autoscaling, bin-packing, and quotas).
Develop reusable frameworks and libraries to enhance Tinker’s transparency, reproducibility, and performance.
Collaborate with researchers and developer experience engineers to transform fine-tuning challenges into product features.
Publish and disseminate insights through internal documentation, open-source libraries, or technical reports to advance the field of scalable AI infrastructure.

About Thinking Machines Lab

Thinking Machines is a pioneering AI lab dedicated to the advancement of collaborative general intelligence. Our innovative team has produced some of the most utilized AI solutions globally, ensuring that technology serves humanity’s diverse needs.

Similar jobs

1 - 20 of 11,490 Jobs

Select all on this page (20)

Apply

Infrastructure Research Engineer at thinkingmachines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we are committed to empowering humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and aspirations.Our team comprises scientists, engineers, and builders who have developed some of the most utilized AI products, including ChatGPT and Character.ai, as well as open-weight models like Mistral. We also contribute to notable open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking a talented Infrastructure Research Engineer to enhance, scale, and fortify the systems supporting Tinker. This role will enable our internal teams and external clients to fine-tune models seamlessly, reliably, and cost-effectively. You will work at the intersection of large-scale training systems and product infrastructure, creating multi-tenant scheduling, storage, observability, and reliability features within a developer-friendly API.Your contributions will allow all Tinker users to concentrate on research and development without the burden of infrastructure concerns.Note: This is an evergreen position that we keep open for ongoing interest. We receive numerous applications, and there may not always be a role that aligns perfectly with your skills and experience. We encourage you to apply, as we continuously review applications and will reach out as new opportunities arise. You are welcome to reapply after gaining more experience, but please refrain from applying more than once every 6 months. We also post specific roles for unique project or team needs, and you are welcome to apply directly to those in addition to this evergreen listing.What You’ll DoDesign and implement distributed job orchestration, placement, preemption, and fair-share scheduling to enhance Tinker for multi-tenant workloads.Optimize GPU utilization, throughput, and reliability across clusters (including autoscaling, bin-packing, and quotas).Develop reusable frameworks and libraries to enhance Tinker’s transparency, reproducibility, and performance.Collaborate with researchers and developer experience engineers to transform fine-tuning challenges into product features.Publish and disseminate insights through internal documentation, open-source libraries, or technical reports to advance the field of scalable AI infrastructure.

Nov 27, 2025

Apply

Software Engineer, Research Acceleration at thinkingmachines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We aspire to create a future where everyone can access the knowledge and tools necessary to harness AI for their individual needs and aspirations.Our team consists of scientists, engineers, and innovators who have developed some of the most renowned AI products, including ChatGPT and Character.ai, as well as open-weight models such as Mistral. We are also contributors to popular open-source initiatives like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking talented engineers to develop the libraries and tools that will expedite research at Thinking Machines. You will take charge of our internal infrastructure, which includes evaluation libraries, reinforcement learning training libraries, and experiment tracking platforms, all aimed at enhancing research velocity over time.This position emphasizes collaboration; you will engage directly with researchers to pinpoint bottlenecks and challenges. Your success will be measured by the trust researchers place in your systems and their enjoyment of using them.What You'll DoDesign, develop, and manage research infrastructure, including evaluation frameworks, RL training systems, experiment tracking platforms, visualization tools, and shared utilities.Create high-throughput, scalable pipelines for distributed evaluation, reward modeling, and multimodal assessments.Establish systems for reproducibility, traceability, and stringent quality control throughout research experiments and model training processes. Implement monitoring and observability.Collaborate closely with researchers to identify obstacles and unlock new capabilities. Manage research tools like a product manager, actively seeking feedback and tracking user adoption.Work alongside infrastructure, data, and product teams to ensure seamless integration of tools across the technical stack.

Feb 3, 2026

Apply

Full Stack Software Engineer at thinkingmachines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply

Infrastructure Research Engineer - Kernels at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our ambition is to enhance human potential by advancing collaborative general intelligence. We envision a future where individuals have the tools and knowledge to harness AI for their distinct requirements and aspirations.Our team comprises dedicated scientists, engineers, and innovators who have contributed to some of the most renowned AI products, including ChatGPT and Character.ai, along with open-weight models like Mistral, and influential open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking an Infrastructure Research Engineer to architect, optimize, and sustain the computational frameworks that facilitate large-scale language model training. You will create high-performance machine learning kernels (e.g., CUDA, CuTe, Triton), enable effective low-precision arithmetic operations, and enhance the distributed computing infrastructure essential for training expansive models.This position is ideal for an engineer who thrives in close collaboration with hardware and research disciplines. You will partner with researchers and systems architects to merge algorithmic design with hardware efficiency. Your responsibilities will include prototyping new kernel implementations, evaluating performance across various hardware generations, and helping to establish the numerical and parallelism strategies crucial for scaling next-generation AI systems.Note: This is an evergreen role that remains open continuously for expressions of interest. We receive numerous applications, and there may not always be an immediate opportunity that aligns with your qualifications. However, we encourage you to apply, as we regularly assess applications and will reach out as new positions become available. You are also welcome to reapply after gaining additional experience, but please refrain from applying more than once every six months. Additionally, you may notice postings for specific roles catering to particular projects or team needs. In such cases, you are encouraged to apply directly alongside this evergreen listing.What You’ll DoDesign and develop custom ML kernels (e.g., CUDA, CuTe, Triton) for key LLM operations such as attention, matrix multiplication, gating, and normalization, optimized for contemporary GPU and accelerator architectures.Conceptualize compute primitives aimed at alleviating memory bandwidth bottlenecks and enhancing kernel compute efficiency.Collaborate with research teams to synchronize kernel-level optimizations with model architecture and algorithmic objectives.Create and maintain a library of reusable kernels and performance benchmarks that serve as the foundation for internal model training.Contribute to the stability and scalability of our infrastructure, ensuring it meets the growing demands of AI development.

Nov 27, 2025

Apply

Infrastructure Engineer - Security at thinkingmachines | San Francisco

Thinking Machines Lab

Full-time|$200K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We are dedicated to building a future where everyone can access the knowledge and tools necessary to harness AI for their unique needs and objectives.We are a team of scientists, engineers, and builders who have developed some of the most widely used AI products, including ChatGPT and Character.ai, and contributed to open-weight models like Mistral, along with popular open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking an Infrastructure Engineer to take charge of evolving the security infrastructure that supports our foundational models. In this pivotal role, you will collaborate across computing, storage, networking, and data platforms to ensure our systems remain secure, reliable, and scalable. You will design controls, architecture, and tooling that embed security into the platform's core functionalities. Working closely with research and product teams, you will enable them to operate swiftly while safeguarding our models, data, and environments.Note: This is an "evergreen role" that we maintain for ongoing interest. While we receive numerous applications, there may not always be an immediate position that perfectly matches your skills and experience. We encourage you to apply, as we continuously assess applications and reach out to candidates when new opportunities arise. Feel free to reapply if you gain more experience, but please refrain from applying more than once every six months. Additionally, we occasionally post openings for specific roles to meet project or team-specific needs, and in those cases, you are welcome to apply directly in conjunction with this evergreen role.What You’ll DoDesign security patterns for platforms and services, including network segmentation, service-to-service authentication, RBAC, and policy enforcement in Kubernetes and cloud environments.Oversee identity, access, and secrets management for users and services: workload and cross-cloud identity, least-privilege IAM, and secrets management.Create secure platforms for data ingestion, processing, and curation, encompassing classification, encryption, access controls, and safe sharing practices across teams.Develop threat models and review designs with researchers and engineers to facilitate safe and scalable feature launches.Automate security checks and implement guardrails: policy-as-code, secure infrastructure baselines, CI/CD validation, and tools that streamline secure operations.

Dec 2, 2025

Apply

Software Engineer - Supercomputing at thinkingmachines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our vision is to enhance human potential by advancing collaborative general intelligence. We are dedicated to creating an inclusive future where everyone can harness AI's capabilities tailored to their unique aspirations.Our team comprises scientists, engineers, and innovators behind some of the most impactful AI solutions, including ChatGPT and Character.ai, as well as open-source projects like PyTorch and Segment Anything.About the RoleWe are seeking a talented Software Engineer to architect, develop, and maintain the GPU supercomputing infrastructure essential for large-scale AI training and inference. Your contributions will ensure high-performance, reliable, and cost-effective computing resources, enabling our users and researchers to achieve rapid advancements at scale.This is an "evergreen role," open for ongoing interest. We receive numerous applications, and while an immediate fit may not always be available, we encourage you to apply. We actively review applications and reach out when new opportunities arise. Reapplications are welcome after six months, and we also post specific roles for unique projects or teams.What You’ll DoAutomate and manage large GPU clusters, handling provisioning, imaging, and capacity strategy.Develop software that simplifies cluster management, providing a cohesive interface for training and inference tasks.Enhance scheduling and orchestration frameworks (Kubernetes, Slurm, or similar) for optimized resource allocation, preemption, and multi-tenancy management.Monitor and improve operational efficiency, focusing on speed, reliability, and error recovery mechanisms.Design robust storage solutions for datasets, checkpoints, and logs, ensuring clear data retention and lineage.Collaborate with researchers to facilitate large-scale experiments, offering guidance on parallelism and performance considerations.

Nov 27, 2025

Apply

Research Infrastructure Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we are on a mission to empower humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and objectives.We are a diverse team of scientists, engineers, and builders responsible for developing some of the most influential AI products on the market, such as ChatGPT and Character.ai. Our contributions extend to open-weight models like Mistral and popular open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking talented engineers to join our team and develop the libraries and tools that will accelerate research efforts at Thinking Machines. You will take charge of our internal infrastructure—creating evaluation libraries, reinforcement learning training libraries, and experiment tracking platforms—while building systems that enhance research velocity over time.This position emphasizes collaboration. You will work closely with researchers to identify bottlenecks and pain points, ensuring that they trust your systems to function seamlessly and find them enjoyable to use.What You'll DoDesign, build, and manage research infrastructure, including evaluation frameworks, RL training systems, experiment tracking platforms, visualization tools, and shared utilities.Develop high-throughput, scalable pipelines for distributed evaluation, reward modeling, and multimodal assessment.Establish systems for reproducibility, traceability, and robust quality control across research experiments and model training runs, implementing effective monitoring and observability.Collaborate directly with researchers to identify bottlenecks and unlock new capabilities, managing research tools like a product manager by proactively seeking feedback and tracking adoption.Work alongside infrastructure, data, and product teams to integrate tools across the technical stack.

Feb 3, 2026

Apply

Security-Focused Software Engineer at thinkingmachines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to enhance human capabilities through the development of collaborative general intelligence. We are dedicated to creating a future where everyone can utilize AI tailored to their specific needs and aspirations.Our team consists of accomplished scientists, engineers, and innovators responsible for some of the most popular AI applications, including ChatGPT and Character.ai, along with renowned open-weight models like Mistral and influential open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are on the lookout for a passionate Software Engineer with a focus on security to ensure our products are secure by design while facilitating rapid and ambitious product development. You will collaborate closely with product and research teams to integrate security measures into the design and development processes, and create tools and automation to maintain system safety at scale.Note: This is an ongoing opportunity, and we encourage you to express your interest. While we receive numerous applications and there may not always be an immediate match for your skills, we encourage you to apply. We consistently review applications and will reach out as new roles become available. You may reapply if you gain additional experience, but please limit applications to once every six months. We also post specific roles for particular projects or teams, and you are welcome to apply for those as well.What You’ll DoCollaborate with product and research teams to integrate security into the development lifecycle: threat modeling, design reviews, and establishing secure defaults for new features.Design and implement security controls throughout our product stack (authentication, authorization, session management, input validation, etc.).Create and maintain security tooling and automation for engineers: secure frameworks and templates, CI/CD checks, dependency management, and vulnerability detection.Work alongside researchers to identify and address AI-specific product risks, such as model abuse, prompt injection, data leakage, or misuse of capabilities.Enhance observability and detection for security-related events: access anomalies, abuse patterns, and suspicious behavior in production.

Nov 27, 2025

Apply

Research Engineer, Infrastructure & Inference

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we are dedicated to empowering humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone can leverage AI to meet their unique needs and aspirations.Our talented team comprises scientists, engineers, and innovators who have developed some of the most widely recognized AI products, including ChatGPT and Character.ai, alongside open-weight models like Mistral and popular open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the PositionWe are seeking a motivated Infrastructure Research Engineer to design, enhance, and scale the systems that underpin large AI models. Your contributions will significantly improve inference speed, cost-effectiveness, reliability, and reproducibility, allowing our teams to concentrate on enhancing model capabilities rather than dealing with bottlenecks.Our mission centers on delivering high-performance and efficient model inference to support real-world applications and accelerate research efforts. In this role, you will be responsible for the infrastructure that guarantees smooth operation for every experiment, evaluation, and deployment at scale.Note: This is an evergreen role, kept open continuously to express interest. We receive numerous applications and may not always have an immediate opening that aligns perfectly with your skills and experience. However, we encourage you to apply. We regularly review applications and reach out to candidates as new opportunities arise. Feel free to reapply as you gain more experience, but we kindly ask that you avoid applying more than once every six months. You may also notice postings for specific roles related to particular projects or teams, in which case you are welcome to apply directly in addition to this evergreen role.What You Will DoCollaborate with researchers and engineers to transition cutting-edge AI models into production.Partner with research teams to ensure high-performance inference for innovative architectures.Design and implement new techniques, tools, and architectures that enhance performance, latency, throughput, and efficiency.Optimize our codebase and computing resources (e.g., GPUs) to maximize hardware FLOPs, bandwidth, and memory usage.Extend orchestration frameworks (e.g., Kubernetes, Ray, SLURM) for distributed inference, evaluation, and large-batch serving.Establish standards for reliability, observability, and reproducibility throughout the inference stack.Publish and share insights through internal documentation, open-source libraries, or technical reports that further the field of scalable AI infrastructure.

Nov 27, 2025

Apply

Infrastructure Research Engineer - Training Systems

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We envision a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and goals.Our team comprises scientists, engineers, and builders who have developed some of the most widely utilized AI products, such as ChatGPT and Character.ai, alongside open-weight models like Mistral, and popular open-source initiatives like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the PositionWe are seeking an Infrastructure Research Engineer to design and construct the foundational systems that facilitate the scalable and efficient training of large models for both deployment and research purposes. Your primary objective will be to streamline experimentation and training at Thinking Machines, enabling our research teams to concentrate on scientific advancements rather than system limitations.This role is a perfect match for an individual who possesses a strong blend of deep systems expertise and a keen interest in machine learning at scale. You will take full ownership of the training stack, ensuring that every GPU cycle contributes to scientific progress.Note: This is an evergreen role that we keep open continuously to express interest. We receive numerous applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. However, we encourage you to apply. We regularly review applications and reach out to candidates as new opportunities arise. Feel free to reapply as you gain more experience, but please avoid applying more than once every six months. We may also post specific roles for individual projects or team needs, in which case you are welcome to apply directly alongside this evergreen role.Key ResponsibilitiesDesign, implement, and optimize distributed training systems that scale across thousands of GPUs and nodes for extensive training workloads.Develop high-performance optimizations to maximize throughput and efficiency.Create reusable frameworks and libraries that enhance training reproducibility, reliability, and scalability for new model architectures.Establish standards for reliability, maintainability, and security, ensuring systems remain robust under rapid iterations.Collaborate with researchers and engineers to construct scalable infrastructure.Publish and disseminate findings through internal documentation, open-source libraries, or technical reports that contribute to the advancement of scalable AI infrastructure.

Nov 27, 2025

Apply

Infrastructure Research Engineer - Numerics at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We envision a future where everyone has access to the knowledge and tools necessary to make AI work for their individual needs and goals. Our team comprises scientists, engineers, and innovators who have developed some of the most widely adopted AI products, including ChatGPT and Character.ai, alongside open-weight models like Mistral, as well as popular open-source initiatives such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking a highly skilled infrastructure research engineer to architect and develop core systems that facilitate efficient large-scale model training, with a strong emphasis on numerics. You will enhance the numerical foundations of our distributed training stack, focusing on precision formats, kernel optimizations, and communication frameworks to ensure that training trillion-parameter models is stable, scalable, and fast.This position is perfect for an individual who excels at the intersection of research and systems engineering—a creator who comprehends both the mathematics of optimization and the practicalities of distributed computing.Note: This is an "evergreen role" that remains open for ongoing expressions of interest. While we receive numerous applications and there may not always be an immediate opening that perfectly matches your skills and experience, we encourage you to apply. We continuously review applications and will contact applicants as new opportunities arise. You are welcome to reapply if you gain additional experience, but please refrain from applying more than once every six months. You may also notice postings for specific roles related to particular projects or teams; in those instances, you are welcome to apply for those positions in addition to the evergreen role.What You’ll DoDesign and optimize distributed training infrastructure for large-scale LLMs, ensuring performance, stability, and reproducibility in multi-GPU and multi-node environments.Implement and assess low-precision numerics (e.g., BF16, MXFP8, NVFP4) to enhance efficiency while maintaining model quality.Develop kernels and communication primitives that leverage hardware-level support for mixed and low-precision arithmetic.Collaborate with research teams to co-design model architectures and training methodologies that align with new numeric formats and stability requirements.Prototype and benchmark scaling strategies, including data, tensor, and pipeline parallelism that integrate precision-adaptive computation and quantized communication.Contribute to the design of our internal orchestration and monitoring frameworks.

Nov 27, 2025

Apply

Infrastructure Research Engineer - Reinforcement Learning Systems

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We're dedicated to crafting a future where everyone can harness the power of AI to meet their unique needs and aspirations.Our team comprises scientists, engineers, and innovators who have developed some of the most widely utilized AI products, including ChatGPT and Character.ai, as well as open-weight models like Mistral, in addition to renowned open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking a talented Infrastructure Research Engineer to architect and develop the foundational systems that facilitate the scalable and efficient training of large models using reinforcement learning.This position exists at the crossroads of research and large-scale systems engineering, requiring a professional who not only comprehends the algorithms behind reinforcement learning but also appreciates the practicalities of distributed training and inference at scale. You will have a diverse set of responsibilities, from optimizing rollout and reward pipelines to enhancing the reliability, observability, and orchestration of systems. Collaboration with researchers and infrastructure teams will be essential to ensure reinforcement learning is stable, rapid, and production-ready.Note: This is an evergreen role that we maintain on an ongoing basis to express interest. Due to the high volume of applications we receive, there may not always be an immediate position that aligns perfectly with your skills and experience. We encourage you to apply, as we continuously review applications and reach out to candidates when new opportunities arise. You may reapply after gaining more experience, but please refrain from applying more than once every six months. Additionally, you may notice postings for specific roles that cater to unique project or team needs; in those circumstances, you are welcome to apply directly alongside this evergreen role.What You’ll DoDesign, implement, and optimize the infrastructure that supports large-scale reinforcement learning and post-training workloads.Enhance the reliability and scalability of the RL training pipeline, including distributed RL workloads and training throughput.Create shared monitoring and observability tools to ensure high uptime, debuggability, and reproducibility of RL systems.Work closely with researchers to translate algorithmic concepts into production-quality training pipelines.Develop evaluation and benchmarking infrastructure to assess model performance based on helpfulness, safety, and factual accuracy.Publish and disseminate insights through internal documentation, open-source libraries, or technical reports that contribute to the advancement of scalable AI infrastructure.

Nov 27, 2025

Apply

Research Product Manager at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$175K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we strive to empower humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone can access the knowledge and tools necessary to harness AI for their specific needs and aspirations.Our team comprises scientists, engineers, and innovators who have developed some of the most widely utilized AI products, such as ChatGPT and Character.ai, along with notable open-weight models like Mistral, as well as prominent open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleAs a Research Product Manager (RPM) at Thinking Machines Lab, you will play a pivotal role in driving complex, high-impact technical products and programs that encompass research, infrastructure, and applied initiatives. You will facilitate the transformation of ambitious concepts into reality by propelling cross-functional collaboration, ensuring projects maintain momentum, and fostering clarity in fast-paced, ambiguous settings.Your contributions will connect people, ideas, and systems, guaranteeing that our critical research initiatives remain aligned, well-defined, and progressing efficiently. This position is ideal for someone who excels in technical discussions, comprehends the intricacies of research, can conceptualize at a high level while also delving into detailed aspects, ultimately aiming to assist the company in executing at scale.Note: This is an "evergreen role" that we keep open on an ongoing basis to express interest. We receive numerous applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. Nevertheless, we encourage you to apply. We continuously review applications and reach out to applicants as new opportunities arise. You are welcome to reapply if you gain more experience, but please refrain from applying more than once every six months. You may also find that we post job openings for specific roles related to separate projects or team needs. In those cases, you are welcome to apply directly in addition to this evergreen role.What You’ll DoDrive and coordinate large-scale research products and programs, ensuring that complex projects are executed efficiently, transparently, and with scientific rigor.Translate technical ideas into actionable, well-scoped plans, defining milestones and ensuring team alignment across model development, data campaigns, infrastructure, and product integration.Collaborate across disciplines—from research and ML infrastructure to legal and business development—quickly ramping up on new domains as necessary.Create and maintain compute and resource roadmaps, identifying bottlenecks and solutions to optimize project flow.

Nov 28, 2025

Apply

Research Engineer, Infrastructure

Cognition

Full-time|On-site|San Francisco Bay Area

Join our dynamic team at Cognition as a Research Engineer specializing in Infrastructure. In this role, you will be at the forefront of cutting-edge research, contributing to innovative solutions that shape the future of our infrastructure projects.Your responsibilities will include conducting thorough research, analyzing data, and collaborating with cross-functional teams to implement effective strategies. We are looking for an individual who is passionate about technology and infrastructure, eager to solve complex problems, and ready to drive impactful results.

Apr 8, 2026

Apply

Forward-Deployed Research Engineer at primeintellect | San Francisco

Prime Intellect

Full-time|On-site|San Francisco

Be Your Own LabAt Prime Intellect, we are dedicated to constructing the foundational infrastructure that leading AI laboratories utilize internally, making it accessible to all. Our advanced platform, Lab, integrates environments, evaluations, sandboxes, and high-performance training into a cohesive full-stack system for post-training at the forefront of AI development. From Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) to tool utilization and agent workflows, we ensure every aspect is validated through our own rigorous testing, training cutting-edge models on the same robust stack we offer to our users. We seek individuals who are passionate about contributing at the intersection of pioneering research and tangible infrastructure.Recently, we secured $15 million in funding (with a total of $20 million raised) led by Founders Fund, along with contributions from Menlo Ventures and esteemed investors such as Andrej Karpathy (Eureka AI, Tesla, OpenAI), Tri Dao (Chief Scientific Officer of Together AI), Dylan Patel (SemiAnalysis), Clem Delangue (Huggingface), Emad Mostaque (Stability AI), and many others.About the RoleWe are in search of a Forward-Deployed Research Engineer (FDRE) who will act as the key technical liaison between Prime Intellect and our most valued clients: AI companies, research institutions, and enterprises implementing post-training and agentic RL on our platform.This role transcends traditional research; you will primarily engage directly with customers to gain insights into their models, workflows, and objectives. Your responsibility will be to convert these insights into actionable training runs, environment designs, evaluation harnesses, and deployment strategies using the Lab stack. You will be the catalyst for making our platform operate effectively for real-world applications.Collaboration with our research, product, and infrastructure teams will be essential, as you will provide valuable field insights to inform future developments, ensuring we align our offerings with actual customer needs.What You'll DoCustomer Engagement & Technical DeliveryWork directly with key customers to comprehend their agent architectures, identify failure modes, and clarify product goalsCreate and develop tailored RL environments, evaluation tools, and verification methods that define success for each specific domainDesign agent scaffolding — including tool usage, multi-step reasoning, memory functions, and sandbox execution — customized to match client workflowsSet up and initiate training sessions on Lab, refining reward functions, rollout strategies, and evaluation standardsLead technical engagements from inception to deployment, ensuring seamless integration and functionality.

Feb 20, 2026

Apply

Infrastructure Engineer at Chalk | San Francisco

Chalk

Full-time|On-site|SF

About ChalkAt Chalk, we are revolutionizing the data platform that drives the future of machine learning applications. Our mission is to eliminate the complexity, latency, and scalability issues that have historically limited ML capabilities. Our platform seamlessly integrates Rust-speed performance with user-friendly tools that developers adore. Renowned companies trust Chalk to combat fraudulent credit card transactions, verify identities, and enhance clean energy utilization. Recently, we secured a $50 million Series A funding, spearheaded by Felicis.About the RoleWe are on the lookout for talented engineers to join our Infrastructure team. This is a unique opportunity to become one of our early hires and significantly impact a fast-growing startup. You will have the autonomy to solve complex engineering challenges and take ownership of your projects.We seek a platform engineer with a solid background in infrastructure engineering. At Chalk, we are tackling problems related to DBMS query planning, optimization, compilers, and distributed analytical data processing systems.Chalk employs dynamic and static analysis of Python code to optimize arbitrary user Python code, orchestrate the necessary infrastructure implied by that code, and track metadata regarding data flow through our systems.Our team works in the office five days a week. We are flexible with unavoidable conflicts, but this is not a hybrid position.What You Will DoDevelop code to automate the orchestration and provisioning of infrastructure to implement Chalk technology for our customers and prospects.Create a robust platform for managing our hosted services and deploying Chalk into customer-owned cloud environments across AWS and GCP.Collaborate closely with our Engineering and Sales teams.Contribute to interviewing and expanding the Engineering team.What We’re Looking ForMinimum of 2 years of experience in software development for automated infrastructure management.Proficiency in Python, Go, and/or Terraform.Hands-on experience with AWS and/or GCP.Strong collaborative skills in both technical and non-technical teams.

Dec 18, 2023

Apply

Research Engineer at Mercor | San Francisco

Mercor

Full-time|On-site|San Francisco

About MercorMercor sits at the forefront of labor markets and artificial intelligence research, collaborating with premier AI laboratories and enterprises to harness the human intelligence crucial for AI evolution.Our expansive talent network empowers the training of cutting-edge AI models, akin to how educators impart knowledge to students—sharing insights, experiences, and contexts that transcend mere code. Currently, our network comprises over 30,000 experts, generating collective earnings exceeding $2 million daily.At Mercor, we are pioneering a unique category of work where expertise fuels AI progress. Realizing this vision necessitates a bold, fast-paced, and deeply dedicated team. You will collaborate with researchers, operators, and AI firms that are at the vanguard of transforming systems that redefine society.As a profitable Series C company, Mercor is valued at $10 billion and maintains an in-office presence five days a week at our new headquarters in San Francisco.About the RoleIn your capacity as a Research Engineer at Mercor, you will operate at the intersection of engineering and applied AI research. You will play a pivotal role in post-training and Reinforcement Learning from Human Feedback (RLVR), synthetic data generation, and large-scale evaluation workflows essential for advancing frontier language models.Your contributions will help train large language models to adeptly utilize tools, exhibit agentic behavior, and engage in real-world reasoning within production environments. You will be instrumental in shaping rewards, conducting post-training experiments, and constructing scalable systems to enhance model performance. Your responsibilities will also include designing and evaluating datasets, creating scalable data augmentation pipelines, and developing rubrics and evaluators that expand the learning potential of LLMs.

Dec 29, 2025

Apply

Infrastructure Engineer at rowspace | San Francisco

rowspace

Full-time|On-site|San Francisco

The OpportunityJoin rowspace as an Infrastructure Engineer and play a pivotal role in constructing and safeguarding the core of our cutting-edge AI data platform. In this position, you'll engineer systems capable of managing extensive volumes of sensitive financial information while adhering to rigorous security and compliance standards. Your work will involve real-time integration of public data with private, tenant-isolated customer data at scale.Key ResponsibilitiesDesign and implement scalable infrastructure to support our AI-driven knowledge engine that processes both structured and unstructured financial data.Establish a security-first architecture for private cloud environments, ensuring data governance aligns with financial services regulations.Create resilient data ingestion pipelines that accommodate a variety of data sources, from CapIQ feeds (structured data) to internal SharePoint documents (unstructured data).Develop comprehensive monitoring and alerting systems for our BYOC platform.Enforce access controls and maintain audit trails to ensure that AI interactions can be traced back to primary sources.Collaborate with our AI Research and Product teams to enhance infrastructure for LLM inference and training workloads, as well as agent infrastructure development.Establish CI/CD practices and infrastructure-as-code for swift, reliable deployments across multiple cloud providers.

Feb 4, 2026

Apply

Software Engineer, Data Infrastructure

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our vision is to enhance human potential by advancing collaborative general intelligence. We are dedicated to creating a future where individuals have the resources and knowledge to harness AI for their specific objectives and aspirations.Our team comprises scientists, engineers, and innovators who have developed some of the most popular AI products, including ChatGPT and Character.ai, as well as influential open-weight models like Mistral, along with highly regarded open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking a talented engineer to enhance our data infrastructure. You will become part of a dynamic, high-impact team tasked with designing and scaling the foundational infrastructure for distributed training pipelines, multimodal data catalogs, and sophisticated processing systems that manage petabytes of data.Our infrastructure is pivotal; it serves as the foundation for every groundbreaking achievement. You will collaborate directly with researchers to expedite experiments, develop novel datasets, optimize infrastructure efficiency, and derive essential insights from our data repositories.If you are passionate about distributed systems, large-scale data mining, and open-source tools such as Spark, Kafka, Beam, Ray, and Delta Lake, and enjoy building innovative solutions from scratch, we encourage you to apply.Note: This is an evergreen role that we keep open continuously for expressions of interest. We receive a high volume of applications, and while there may not always be an immediate position that aligns perfectly with your skills and experience, we encourage you to apply. We regularly review applications and reach out as new opportunities arise. You are welcome to reapply after gaining more experience, but please refrain from applying more than once every six months. We may also post for specific roles for particular projects or team needs, and in those cases, you are welcome to apply directly in addition to this evergreen role.

Nov 27, 2025

Apply

Research Engineer at HUD | San Francisco

HUD

Full-time|On-site|San Francisco

HUD builds infrastructure for generating and evaluating reinforcement learning (RL) training data for advanced AI agents. The team is also developing a marketplace to connect leading labs with high-quality training data. HUD's platform serves frontier labs, Fortune 500 companies, and startups. The company is backed by $15M in funding from top venture capital firms and is part of Y Combinator's W25 cohort. Role overview HUD is seeking Research Engineers in San Francisco to help strengthen quality assurance for training data produced by partner organizations. This position centers on building systems that maintain and improve data quality as demand increases. What you will do Set and uphold quality standards for training datasets. Develop tools and workflows for auditing datasets from suppliers, including sampling methods, validation pipelines (using rules and models), and feedback systems. Assess and refine human-in-the-loop review processes to support quality assurance. Collaborate with data vendors to resolve quality issues, share insights, and encourage better data generation practices. Integrate QA findings into internal tools and the data vendor portal to reduce anomalies, inconsistencies, and edge cases. Requirements Strong skills in Python, Docker, and Linux environments. Experience working with large datasets. Ability to learn quickly and adapt in technical contexts, such as programming competitions. Background in early-stage tech startups and ability to work independently. Familiarity with modern AI tools and large language models (LLMs). Clear communication skills for collaborating remotely across time zones. Preferred qualifications Understanding of common issues in training data. Background in building data validation pipelines or human-in-the-loop review systems. Strong attention to detail, with the ability to identify subtle data inconsistencies or edge cases. Experience designing metrics, experiments, and QA processes, not just executing them.

Apr 24, 2026

Create account — see all 11,490 results

1 - 20 of 11,490 Jobs

Select all on this page (20)

Apply

Infrastructure Research Engineer at thinkingmachines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply

Software Engineer, Research Acceleration at thinkingmachines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Feb 3, 2026

Apply

Full Stack Software Engineer at thinkingmachines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply

Infrastructure Research Engineer - Kernels at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply

Infrastructure Engineer - Security at thinkingmachines | San Francisco

Thinking Machines Lab

Full-time|$200K/yr - $475K/yr|On-site|San Francisco

Dec 2, 2025

Apply

Software Engineer - Supercomputing at thinkingmachines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply

Research Infrastructure Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Feb 3, 2026

Apply

Security-Focused Software Engineer at thinkingmachines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply

Research Engineer, Infrastructure & Inference

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply

Infrastructure Research Engineer - Training Systems

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply

Infrastructure Research Engineer - Numerics at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We envision a future where everyone has access to the knowledge and tools necessary to make AI work for their individual needs and goals. Our team comprises scientists, engineers, and innovators who have developed some of the most widely adopted AI products, including ChatGPT and Character.ai, alongside open-weight models like Mistral, as well as popular open-source initiatives such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking a highly skilled infrastructure research engineer to architect and develop core systems that facilitate efficient large-scale model training, with a strong emphasis on numerics. You will enhance the numerical foundations of our distributed training stack, focusing on precision formats, kernel optimizations, and communication frameworks to ensure that training trillion-parameter models is stable, scalable, and fast.This position is perfect for an individual who excels at the intersection of research and systems engineering—a creator who comprehends both the mathematics of optimization and the practicalities of distributed computing.Note: This is an "evergreen role" that remains open for ongoing expressions of interest. While we receive numerous applications and there may not always be an immediate opening that perfectly matches your skills and experience, we encourage you to apply. We continuously review applications and will contact applicants as new opportunities arise. You are welcome to reapply if you gain additional experience, but please refrain from applying more than once every six months. You may also notice postings for specific roles related to particular projects or teams; in those instances, you are welcome to apply for those positions in addition to the evergreen role.What You’ll DoDesign and optimize distributed training infrastructure for large-scale LLMs, ensuring performance, stability, and reproducibility in multi-GPU and multi-node environments.Implement and assess low-precision numerics (e.g., BF16, MXFP8, NVFP4) to enhance efficiency while maintaining model quality.Develop kernels and communication primitives that leverage hardware-level support for mixed and low-precision arithmetic.Collaborate with research teams to co-design model architectures and training methodologies that align with new numeric formats and stability requirements.Prototype and benchmark scaling strategies, including data, tensor, and pipeline parallelism that integrate precision-adaptive computation and quantized communication.Contribute to the design of our internal orchestration and monitoring frameworks.

Nov 27, 2025

Apply

Infrastructure Research Engineer - Reinforcement Learning Systems

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply

Research Product Manager at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$175K/yr - $475K/yr|On-site|San Francisco

Nov 28, 2025

Apply