Infrastructure Research Engineer Kernels At Thinking Machines San Francisco jobs in San Francisco – Browse 11,619 openings on RoboApply Jobs

Infrastructure Research Engineer - Kernels at Thinking Machines | San Francisco

Thinking Machines LabSan Francisco

On-site Full-time $350K/yr - $475K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

The ideal candidate will possess:A strong background in computer science, electrical engineering, or a related field. Proficiency in programming languages such as C++, Python, or similar. Experience with machine learning frameworks and tools. Familiarity with high-performance computing and GPU programming. A collaborative mindset and excellent communication skills.

About the job

Our team comprises dedicated scientists, engineers, and innovators who have contributed to some of the most renowned AI products, including ChatGPT and Character.ai, along with open-weight models like Mistral, and influential open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.

About the Role

We are seeking an Infrastructure Research Engineer to architect, optimize, and sustain the computational frameworks that facilitate large-scale language model training. You will create high-performance machine learning kernels (e.g., CUDA, CuTe, Triton), enable effective low-precision arithmetic operations, and enhance the distributed computing infrastructure essential for training expansive models.

This position is ideal for an engineer who thrives in close collaboration with hardware and research disciplines. You will partner with researchers and systems architects to merge algorithmic design with hardware efficiency. Your responsibilities will include prototyping new kernel implementations, evaluating performance across various hardware generations, and helping to establish the numerical and parallelism strategies crucial for scaling next-generation AI systems.

Note: This is an evergreen role that remains open continuously for expressions of interest. We receive numerous applications, and there may not always be an immediate opportunity that aligns with your qualifications. However, we encourage you to apply, as we regularly assess applications and will reach out as new positions become available. You are also welcome to reapply after gaining additional experience, but please refrain from applying more than once every six months. Additionally, you may notice postings for specific roles catering to particular projects or team needs. In such cases, you are encouraged to apply directly alongside this evergreen listing.

What You’ll Do

Design and develop custom ML kernels (e.g., CUDA, CuTe, Triton) for key LLM operations such as attention, matrix multiplication, gating, and normalization, optimized for contemporary GPU and accelerator architectures.
Conceptualize compute primitives aimed at alleviating memory bandwidth bottlenecks and enhancing kernel compute efficiency.
Collaborate with research teams to synchronize kernel-level optimizations with model architecture and algorithmic objectives.
Create and maintain a library of reusable kernels and performance benchmarks that serve as the foundation for internal model training.
Contribute to the stability and scalability of our infrastructure, ensuring it meets the growing demands of AI development.

About Thinking Machines Lab

Thinking Machines Lab is on a mission to revolutionize human capability through the advancement of collaborative general intelligence. We strive to create accessible AI tools that empower individuals to realize their unique goals and aspirations.

Similar jobs

1 - 20 of 11,619 Jobs

Select all on this page (20)

Apply

Infrastructure Research Engineer - Kernels at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our ambition is to enhance human potential by advancing collaborative general intelligence. We envision a future where individuals have the tools and knowledge to harness AI for their distinct requirements and aspirations.Our team comprises dedicated scientists, engineers, and innovators who have contributed to some of the most renowned AI products, including ChatGPT and Character.ai, along with open-weight models like Mistral, and influential open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking an Infrastructure Research Engineer to architect, optimize, and sustain the computational frameworks that facilitate large-scale language model training. You will create high-performance machine learning kernels (e.g., CUDA, CuTe, Triton), enable effective low-precision arithmetic operations, and enhance the distributed computing infrastructure essential for training expansive models.This position is ideal for an engineer who thrives in close collaboration with hardware and research disciplines. You will partner with researchers and systems architects to merge algorithmic design with hardware efficiency. Your responsibilities will include prototyping new kernel implementations, evaluating performance across various hardware generations, and helping to establish the numerical and parallelism strategies crucial for scaling next-generation AI systems.Note: This is an evergreen role that remains open continuously for expressions of interest. We receive numerous applications, and there may not always be an immediate opportunity that aligns with your qualifications. However, we encourage you to apply, as we regularly assess applications and will reach out as new positions become available. You are also welcome to reapply after gaining additional experience, but please refrain from applying more than once every six months. Additionally, you may notice postings for specific roles catering to particular projects or team needs. In such cases, you are encouraged to apply directly alongside this evergreen listing.What You’ll DoDesign and develop custom ML kernels (e.g., CUDA, CuTe, Triton) for key LLM operations such as attention, matrix multiplication, gating, and normalization, optimized for contemporary GPU and accelerator architectures.Conceptualize compute primitives aimed at alleviating memory bandwidth bottlenecks and enhancing kernel compute efficiency.Collaborate with research teams to synchronize kernel-level optimizations with model architecture and algorithmic objectives.Create and maintain a library of reusable kernels and performance benchmarks that serve as the foundation for internal model training.Contribute to the stability and scalability of our infrastructure, ensuring it meets the growing demands of AI development.

Nov 27, 2025

Apply

Research Infrastructure Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we are on a mission to empower humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and objectives.We are a diverse team of scientists, engineers, and builders responsible for developing some of the most influential AI products on the market, such as ChatGPT and Character.ai. Our contributions extend to open-weight models like Mistral and popular open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking talented engineers to join our team and develop the libraries and tools that will accelerate research efforts at Thinking Machines. You will take charge of our internal infrastructure—creating evaluation libraries, reinforcement learning training libraries, and experiment tracking platforms—while building systems that enhance research velocity over time.This position emphasizes collaboration. You will work closely with researchers to identify bottlenecks and pain points, ensuring that they trust your systems to function seamlessly and find them enjoyable to use.What You'll DoDesign, build, and manage research infrastructure, including evaluation frameworks, RL training systems, experiment tracking platforms, visualization tools, and shared utilities.Develop high-throughput, scalable pipelines for distributed evaluation, reward modeling, and multimodal assessment.Establish systems for reproducibility, traceability, and robust quality control across research experiments and model training runs, implementing effective monitoring and observability.Collaborate directly with researchers to identify bottlenecks and unlock new capabilities, managing research tools like a product manager by proactively seeking feedback and tracking adoption.Work alongside infrastructure, data, and product teams to integrate tools across the technical stack.

Feb 3, 2026

Apply

Research Product Manager at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$175K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we strive to empower humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone can access the knowledge and tools necessary to harness AI for their specific needs and aspirations.Our team comprises scientists, engineers, and innovators who have developed some of the most widely utilized AI products, such as ChatGPT and Character.ai, along with notable open-weight models like Mistral, as well as prominent open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleAs a Research Product Manager (RPM) at Thinking Machines Lab, you will play a pivotal role in driving complex, high-impact technical products and programs that encompass research, infrastructure, and applied initiatives. You will facilitate the transformation of ambitious concepts into reality by propelling cross-functional collaboration, ensuring projects maintain momentum, and fostering clarity in fast-paced, ambiguous settings.Your contributions will connect people, ideas, and systems, guaranteeing that our critical research initiatives remain aligned, well-defined, and progressing efficiently. This position is ideal for someone who excels in technical discussions, comprehends the intricacies of research, can conceptualize at a high level while also delving into detailed aspects, ultimately aiming to assist the company in executing at scale.Note: This is an "evergreen role" that we keep open on an ongoing basis to express interest. We receive numerous applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. Nevertheless, we encourage you to apply. We continuously review applications and reach out to applicants as new opportunities arise. You are welcome to reapply if you gain more experience, but please refrain from applying more than once every six months. You may also find that we post job openings for specific roles related to separate projects or team needs. In those cases, you are welcome to apply directly in addition to this evergreen role.What You’ll DoDrive and coordinate large-scale research products and programs, ensuring that complex projects are executed efficiently, transparently, and with scientific rigor.Translate technical ideas into actionable, well-scoped plans, defining milestones and ensuring team alignment across model development, data campaigns, infrastructure, and product integration.Collaborate across disciplines—from research and ML infrastructure to legal and business development—quickly ramping up on new domains as necessary.Create and maintain compute and resource roadmaps, identifying bottlenecks and solutions to optimize project flow.

Nov 28, 2025

Apply

Infrastructure Engineer at Kernel | San Francisco

Kernel

Full-time|On-site|San Francisco

About KernelKernel is a cutting-edge developer platform that offers Lightning-Fast Browsers-as-a-Service tailored for browser automation and web agent creation. Our API and MCP server enable developers to seamlessly launch browsers in the cloud without the hassle of infrastructure management.Our serverless browser platform takes care of the complex tasks: autoscaling reliable browser infrastructure, ensuring observability, and managing the intricate details of web interactions, allowing developers to concentrate on their agent functionalities rather than the underlying processes. Kernel brings AI to life, making it practical and powerful, empowering developers to deploy agents that can effectively engage with the digital landscape.We are trusted by teams at Cash App, Rye, and numerous others for diverse applications like in-depth research, QA automation, and real-time web analysis. We have successfully secured $22M in funding from notable investors including Accel, YCombinator, Vercel, Paul Graham, Solomon Hykes (Docker), David Cramer (Sentry), Charlie Marsh (Astral), among others.With just one line of code, you can deploy any web agent to our cloud. The rest is in your hands. If you're passionate about developing critical infrastructure for the next generation of AI applications, we would love to connect.

Dec 4, 2025

Apply

Infrastructure Research Engineer - Numerics at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We envision a future where everyone has access to the knowledge and tools necessary to make AI work for their individual needs and goals. Our team comprises scientists, engineers, and innovators who have developed some of the most widely adopted AI products, including ChatGPT and Character.ai, alongside open-weight models like Mistral, as well as popular open-source initiatives such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking a highly skilled infrastructure research engineer to architect and develop core systems that facilitate efficient large-scale model training, with a strong emphasis on numerics. You will enhance the numerical foundations of our distributed training stack, focusing on precision formats, kernel optimizations, and communication frameworks to ensure that training trillion-parameter models is stable, scalable, and fast.This position is perfect for an individual who excels at the intersection of research and systems engineering—a creator who comprehends both the mathematics of optimization and the practicalities of distributed computing.Note: This is an "evergreen role" that remains open for ongoing expressions of interest. While we receive numerous applications and there may not always be an immediate opening that perfectly matches your skills and experience, we encourage you to apply. We continuously review applications and will contact applicants as new opportunities arise. You are welcome to reapply if you gain additional experience, but please refrain from applying more than once every six months. You may also notice postings for specific roles related to particular projects or teams; in those instances, you are welcome to apply for those positions in addition to the evergreen role.What You’ll DoDesign and optimize distributed training infrastructure for large-scale LLMs, ensuring performance, stability, and reproducibility in multi-GPU and multi-node environments.Implement and assess low-precision numerics (e.g., BF16, MXFP8, NVFP4) to enhance efficiency while maintaining model quality.Develop kernels and communication primitives that leverage hardware-level support for mixed and low-precision arithmetic.Collaborate with research teams to co-design model architectures and training methodologies that align with new numeric formats and stability requirements.Prototype and benchmark scaling strategies, including data, tensor, and pipeline parallelism that integrate precision-adaptive computation and quantized communication.Contribute to the design of our internal orchestration and monitoring frameworks.

Nov 27, 2025

Apply

Forward Deployed Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Thinking Machines Lab brings together scientists, engineers, and innovators behind widely recognized AI products such as ChatGPT and Character.ai, as well as open-source frameworks like PyTorch, OpenAI Gym, Fairseq, and Segment Anything. The team is driven by a mission to enhance humanity through collaborative general intelligence, aiming for a future where AI adapts to individual needs and goals. Tinker, the lab’s fine-tuning API, empowers researchers and developers to customize advanced AI models for their own use cases. Tinker manages the infrastructure, allowing users to train open-weight models with their chosen datasets, algorithms, and objectives. As Tinker grows its user base and features, the team is expanding to better support the community. Role overview The Forward Deployed Engineer acts as the main point of contact for a broad range of clients, from solo developers to large organizations. This role identifies customer challenges and requirements, then translates those insights into actionable product improvements. Both customer interaction and product development responsibilities are central to this position. What you will do Triage and resolve customer issues across the full stack, including analyzing logs, reproducing failures, and tracing job executions. Develop tools, integrations, and automation to address recurring problems and speed up user support. Create and update clear documentation and practical guides based on real user experiences and implementations. Work closely with research and infrastructure teams to turn customer feedback into prioritized engineering tasks. Help shape Tinker’s product roadmap by sharing insights from daily customer interactions.

Apr 27, 2026

Apply

Software Engineer, Platform at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Thinking Machines Lab aims to advance collaborative general intelligence, making AI accessible and adaptable for individuals and organizations. The team brings together scientists, engineers, and innovators behind well-known AI solutions, including ChatGPT, Character.ai, Mistral, and open-source projects like PyTorch, OpenAI Gym, Fairseq, and Segment Anything. Tinker, the lab’s fine-tuning API, helps researchers and developers customize AI models using their own data and algorithms. By handling the infrastructure, Tinker allows users to focus on training and deploying models that suit their needs. With a growing customer base and expanding features, the team is looking for a Software Engineer, Platform to support Tinker's continued development. Role overview This position centers on building and maintaining the core platform systems that power Tinker. The engineer will manage billing and usage metering, permissions and access control, organizational structures, data exports, audit logging, and the administrative tools that tie these systems together. Collaboration with product and legal teams is essential, as changes to features, pricing, and enterprise agreements will involve this role. What you will do Design the authorization layer for all products, including RBAC, API key scoping, organizational hierarchies, and permission boundaries. Oversee billing infrastructure, covering usage metering, plan management, payment processing, invoicing, and revenue recognition support. Develop and improve models for organizations and teams, such as seat management, SSO/SAML, workspace isolation, and invitation flows. Implement data export and deletion processes that align with enterprise standards and data residency requirements. Create audit logging systems to track user actions and decisions. This role is based in San Francisco.

Apr 27, 2026

Apply

Infrastructure Research Engineer - Training Systems

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We envision a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and goals.Our team comprises scientists, engineers, and builders who have developed some of the most widely utilized AI products, such as ChatGPT and Character.ai, alongside open-weight models like Mistral, and popular open-source initiatives like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the PositionWe are seeking an Infrastructure Research Engineer to design and construct the foundational systems that facilitate the scalable and efficient training of large models for both deployment and research purposes. Your primary objective will be to streamline experimentation and training at Thinking Machines, enabling our research teams to concentrate on scientific advancements rather than system limitations.This role is a perfect match for an individual who possesses a strong blend of deep systems expertise and a keen interest in machine learning at scale. You will take full ownership of the training stack, ensuring that every GPU cycle contributes to scientific progress.Note: This is an evergreen role that we keep open continuously to express interest. We receive numerous applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. However, we encourage you to apply. We regularly review applications and reach out to candidates as new opportunities arise. Feel free to reapply as you gain more experience, but please avoid applying more than once every six months. We may also post specific roles for individual projects or team needs, in which case you are welcome to apply directly alongside this evergreen role.Key ResponsibilitiesDesign, implement, and optimize distributed training systems that scale across thousands of GPUs and nodes for extensive training workloads.Develop high-performance optimizations to maximize throughput and efficiency.Create reusable frameworks and libraries that enhance training reproducibility, reliability, and scalability for new model architectures.Establish standards for reliability, maintainability, and security, ensuring systems remain robust under rapid iterations.Collaborate with researchers and engineers to construct scalable infrastructure.Publish and disseminate findings through internal documentation, open-source libraries, or technical reports that contribute to the advancement of scalable AI infrastructure.

Nov 27, 2025

Apply

Research Engineer, Infrastructure & Inference

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we are dedicated to empowering humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone can leverage AI to meet their unique needs and aspirations.Our talented team comprises scientists, engineers, and innovators who have developed some of the most widely recognized AI products, including ChatGPT and Character.ai, alongside open-weight models like Mistral and popular open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the PositionWe are seeking a motivated Infrastructure Research Engineer to design, enhance, and scale the systems that underpin large AI models. Your contributions will significantly improve inference speed, cost-effectiveness, reliability, and reproducibility, allowing our teams to concentrate on enhancing model capabilities rather than dealing with bottlenecks.Our mission centers on delivering high-performance and efficient model inference to support real-world applications and accelerate research efforts. In this role, you will be responsible for the infrastructure that guarantees smooth operation for every experiment, evaluation, and deployment at scale.Note: This is an evergreen role, kept open continuously to express interest. We receive numerous applications and may not always have an immediate opening that aligns perfectly with your skills and experience. However, we encourage you to apply. We regularly review applications and reach out to candidates as new opportunities arise. Feel free to reapply as you gain more experience, but we kindly ask that you avoid applying more than once every six months. You may also notice postings for specific roles related to particular projects or teams, in which case you are welcome to apply directly in addition to this evergreen role.What You Will DoCollaborate with researchers and engineers to transition cutting-edge AI models into production.Partner with research teams to ensure high-performance inference for innovative architectures.Design and implement new techniques, tools, and architectures that enhance performance, latency, throughput, and efficiency.Optimize our codebase and computing resources (e.g., GPUs) to maximize hardware FLOPs, bandwidth, and memory usage.Extend orchestration frameworks (e.g., Kubernetes, Ray, SLURM) for distributed inference, evaluation, and large-batch serving.Establish standards for reliability, observability, and reproducibility throughout the inference stack.Publish and share insights through internal documentation, open-source libraries, or technical reports that further the field of scalable AI infrastructure.

Nov 27, 2025

Apply

Full Stack Software Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco, California

Thinking Machines Lab brings together scientists, engineers, and innovators who have contributed to well-known AI products such as ChatGPT, Character.ai, and open-weight models like Mistral. The team’s open-source projects include PyTorch, OpenAI Gym, Fairseq, and Segment Anything. Their mission centers on advancing collaborative general intelligence and making AI tools accessible for a wide range of users and goals. The Tinker platform offers a fine-tuning API that lets researchers and developers tailor advanced AI models to their needs. By handling the underlying infrastructure, Tinker enables users to train open-weight models with custom data, algorithms, and objectives. As demand grows, the team is adding new features and supporting an expanding community. Role overview The Full Stack Software Engineer will play a key part in building and maintaining the products and services that Tinker users depend on. This position involves working closely with frontend, backend, and infrastructure teams to deliver the Tinker console, developer tools, and essential features. What you will do Develop and enhance Tinker’s APIs and backend services using Python and Rust, focusing on areas like job submission, orchestration, billing, and usage tracking. Design and launch user interfaces, including the Tinker console and upcoming developer tools, using React and TypeScript. Refine the developer experience by improving SDK usability, error messages, API design, and onboarding processes. Work to increase system reliability, observability, and security in production, and participate in on-call rotations. Create internal tools that help research and infrastructure teams work more efficiently. Location This role is based in San Francisco, California.

Apr 28, 2026

Apply

Product Engineer at Kernel | San Francisco

Kernel

Full-time|On-site|San Francisco

About KernelKernel is a cutting-edge developer platform that offers Lightning-Fast Browsers-as-a-Service for browser automations and web agents. Our API and MCP server empower developers to effortlessly launch browsers in the cloud without the hassle of managing infrastructure.Our serverless browser platform takes care of the complex aspects: autoscaling reliable browser infrastructure, observability, and intricate web interactions, enabling developers to concentrate on the functionality of their agents rather than the underlying details. Kernel transforms AI into a tangible, practical, and powerful tool, allowing developers to deploy agents capable of genuine interaction with the digital landscape.We pride ourselves on being trusted by teams at Cash App, Rye, and numerous others for deep research, QA automation, and real-time web analysis. We have successfully secured $22M in funding from top investors including Accel, YCombinator, Vercel, Paul Graham, Solomon Hykes (Docker), David Cramer (Sentry), Charlie Marsh (Astral), and more.With just one line of code, you can deploy any web agent to our cloud. The rest is in your hands. If you are passionate about building essential infrastructure for the next wave of AI applications, we would love to hear from you.About the RoleAs a Product Engineer at Kernel, you will be a full-stack engineer who values product development as much as coding. You possess the ability to translate your strong product instincts into code, ranging from pixel-perfect UI decisions to backend API architecture. You proactively contribute to the specification process rather than waiting for one to be provided.You will collaborate closely with our co-founders to define product direction, deliver full-stack features from end to end, and ensure that Kernel maintains its polished yet powerful appearance.Your ResponsibilitiesLead the full-stack implementation of user-facing product surfaces: dashboard, onboarding, website, and core product functionalities.Influence the product roadmap by integrating customer feedback, analyzing usage patterns, and leveraging your own insights into developer needs.Enhance developer experience across our SDK, documentation, CLI, and API, delivering the kind of seamless experience that makes developers exclaim, 'this just works.'Rapidly prototype and iterate, bringing features from concept to production with minimal oversight.Help shape the standards for building a superior developer product at Kernel.Your QualificationsYou are comfortable taking ownership of features from frontend to backend, demonstrating a holistic understanding of product development.A strong passion for creating seamless user experiences and an ability to translate product vision into functional code.Experience working in a fast-paced environment with a focus on agile methodologies.

Feb 27, 2026

Apply

Site Reliability Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Thinking Machines Lab brings together scientists, engineers, and innovators who have shaped well-known AI products like ChatGPT and Character.ai, as well as open-weight models such as Mistral. The team also contributes to open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything. The company’s mission centers on advancing collaborative general intelligence, aiming to make AI accessible and adaptable to individual needs. Tinker, the company’s fine-tuning API, enables researchers and developers to customize advanced AI models using their own data and algorithms. Thinking Machines manages the infrastructure, giving users the flexibility to train open-weight models while focusing on their unique requirements. As Tinker expands, the platform continues to evolve alongside its growing community. Role overview The Site Reliability Engineer will focus on improving the reliability and resilience of the Tinker platform. This role involves close collaboration with platform engineers and research teams to strengthen every layer of the system, from infrastructure to user-facing services. What you will do Define and take ownership of end-to-end reliability, including CI/CD workflows, production observability, and incident response processes. Set Service Level Objectives for distributed training systems, balancing reliability, scheduling latency, and development speed. Design and implement monitoring and observability across the training pipeline. Manage incident response for Tinker, ensuring prompt recovery, thorough incident analysis, and systematic improvements to prevent recurrence. Enhance multi-tenant isolation and resource scheduling to support LoRA-based workload co-scheduling, maintaining both reliability and data separation. Collaborate with security teams to identify and address production vulnerabilities. This position is based in San Francisco.

Apr 28, 2026

Apply

Full Stack Software Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Thinking Machines Lab brings together scientists, engineers, and innovators who have contributed to well-known AI products such as ChatGPT, Character.ai, and open-source frameworks like PyTorch, OpenAI Gym, Fairseq, and Segment Anything. The team's mission centers on advancing collaborative general intelligence, aiming to make AI accessible for people to address their own needs and ambitions. The Tinker platform offers a fine-tuning API that lets researchers and developers tailor advanced AI models to their specific requirements. Tinker provides the infrastructure, while users maintain flexibility to train open-weight models with their own data and algorithms. As Tinker grows its features and user base, the team is expanding to support the platform's evolution. Role overview This Full Stack Software Engineer role focuses on designing, building, and maintaining the products and services that Tinker users rely on. The work covers frontend, backend, and infrastructure, with an emphasis on the Tinker console, developer tools, and meeting the changing needs of the Tinker community. What you will do Develop and improve Tinker’s APIs and backend services using Python and Rust, including systems for job submission, orchestration, billing, and usage tracking. Build user-facing interfaces such as the Tinker console and future developer tools with React and TypeScript. Enhance the developer experience by refining SDK usability, error messages, API design, and onboarding workflows. Increase system reliability, observability, and security in Tinker’s production environment, and participate in on-call rotations. Create internal tools to support the research and infrastructure teams working on Tinker. This position is based in San Francisco.

Apr 27, 2026

Apply

Executive Business Partner at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$200K/yr - $250K/yr|On-site|San Francisco, CA

At Thinking Machines Lab, we are on a mission to enhance humanity through the advancement of collaborative general intelligence. Our vision is to create a future where everyone has the opportunity to leverage AI tailored to their individual needs and aspirations.Our team comprises scientists, engineers, and innovators who have developed some of the most renowned AI products in the industry, such as ChatGPT, Character.ai, as well as open-weight models like Mistral and popular open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking an Executive Business Partner to provide vital support to several technical leaders from our San Francisco office. Your role will be crucial in ensuring our team remains focused and organized by managing personal logistics and handling tasks that may otherwise be overlooked.This position is unique, requiring creativity and flexibility to adapt to various work styles and the dynamic challenges of a fast-paced startup environment. You will enjoy significant autonomy in decision-making without extensive supervision.What You’ll DoManage calendars, schedule meetings, and coordinate travel for 3-4 technical leaders.Act as the primary liaison between your supported leaders and other departments within the company.Assist with recruiting coordination efforts.Monitor projects and commitments to ensure nothing is overlooked.

Mar 19, 2026

Apply

HR Business Partner at Thinking Machines | San Francisco, CA

Thinking Machines Lab

Full-time|$190K/yr - $300K/yr|On-site|San Francisco, California

Feb 2, 2026

Apply

Infrastructure Research Engineer - Reinforcement Learning Systems

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We're dedicated to crafting a future where everyone can harness the power of AI to meet their unique needs and aspirations.Our team comprises scientists, engineers, and innovators who have developed some of the most widely utilized AI products, including ChatGPT and Character.ai, as well as open-weight models like Mistral, in addition to renowned open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking a talented Infrastructure Research Engineer to architect and develop the foundational systems that facilitate the scalable and efficient training of large models using reinforcement learning.This position exists at the crossroads of research and large-scale systems engineering, requiring a professional who not only comprehends the algorithms behind reinforcement learning but also appreciates the practicalities of distributed training and inference at scale. You will have a diverse set of responsibilities, from optimizing rollout and reward pipelines to enhancing the reliability, observability, and orchestration of systems. Collaboration with researchers and infrastructure teams will be essential to ensure reinforcement learning is stable, rapid, and production-ready.Note: This is an evergreen role that we maintain on an ongoing basis to express interest. Due to the high volume of applications we receive, there may not always be an immediate position that aligns perfectly with your skills and experience. We encourage you to apply, as we continuously review applications and reach out to candidates when new opportunities arise. You may reapply after gaining more experience, but please refrain from applying more than once every six months. Additionally, you may notice postings for specific roles that cater to unique project or team needs; in those circumstances, you are welcome to apply directly alongside this evergreen role.What You’ll DoDesign, implement, and optimize the infrastructure that supports large-scale reinforcement learning and post-training workloads.Enhance the reliability and scalability of the RL training pipeline, including distributed RL workloads and training throughput.Create shared monitoring and observability tools to ensure high uptime, debuggability, and reproducibility of RL systems.Work closely with researchers to translate algorithmic concepts into production-quality training pipelines.Develop evaluation and benchmarking infrastructure to assess model performance based on helpfulness, safety, and factual accuracy.Publish and disseminate insights through internal documentation, open-source libraries, or technical reports that contribute to the advancement of scalable AI infrastructure.

Nov 27, 2025

Apply

Customer Engineer at Kernel | San Francisco

Kernel

Full-time|On-site|San Francisco

Join Our Team at KernelAt Kernel, we are revolutionizing the way developers interact with the digital world through our innovative platform, offering Lightning-Fast Browsers-as-a-Service for seamless browser automation and advanced web agents. Our cutting-edge API and MCP server empower developers to effortlessly launch browsers in the cloud, eliminating the complexities of infrastructure management.Our serverless browser platform takes the hassle out of autoscaling, reliability, and observability, allowing developers to concentrate on their agents' functionality rather than the underlying processes. Kernel transforms AI into a practical and impactful tool, enabling developers to deploy agents that can genuinely engage with online environments.Trusted by industry leaders such as Cash App and Rye for applications ranging from comprehensive research to QA automation and real-time web analysis, we have successfully raised $22M from prominent investors including Accel, YCombinator, and others.With just one line of code, any web agent can be deployed to our cloud—what happens next is up to you. If you are passionate about creating essential infrastructure for the future of AI applications, we would love to connect.

Dec 4, 2025

Apply

GTM Strategy & Operations Lead at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$175K/yr - $300K/yr|On-site|San Francisco, California

Thinking Machines Lab brings together scientists, engineers, and innovators with a track record in developing widely used AI products and open-source projects. The team has contributed to tools like ChatGPT, Character.ai, Mistral, PyTorch, OpenAI Gym, Fairseq, and Segment Anything. The company’s mission centers on advancing collaborative general intelligence to help people achieve more with AI tailored to their needs. Tinker, the company’s fine-tuning API, enables researchers and developers to adapt advanced AI models to their own data and algorithms. By handling the infrastructure, Tinker allows users to focus on customization, opening up capabilities that were once limited to a few specialized labs. As Tinker’s customer base and feature set grow, the team is focused on building a scalable platform and supporting an expanding community. Role overview The GTM Strategy & Operations Lead will build and refine the commercial structure for Tinker. This person will design strategies and processes that turn organic product adoption into a consistent, scalable revenue stream. The role involves shaping how Tinker’s fine-tuning capabilities are packaged, priced, launched, and sold across different customer segments. Collaboration with product, engineering, and research teams is central to the work. Tinker is designed for technically sophisticated users. The GTM lead must be comfortable discussing training infrastructure and understand how developers evaluate and adopt new tools. What you will do Develop and execute commercialization strategies for Tinker, including pricing, packaging, and launch plans based on market and competitor analysis. Create go-to-market approaches tailored to different types of customers. Manage partnerships to expand Tinker’s reach and open new channels for demand. Design and oversee customer pilots, onboarding, and expansion playbooks to move accounts from testing to production use. Produce commercial playbooks to help customer-facing engineers and FDEs position and sell Tinker effectively. Set and track success metrics for launches and GTM projects, running experiments to test assumptions about pricing and product packaging.

Apr 27, 2026

Apply

Backend Engineer at Kernel | San Francisco

Kernel

Full-time|On-site|San Francisco

About KernelKernel is an innovative developer platform that delivers Lightning-Fast Browsers-as-a-Service for browser automation and web agent deployment. Our API and MCP server empower developers to effortlessly launch cloud-based browsers without the hassle of infrastructure management.Our serverless browser solution takes care of the complexities: autoscaling, dependable browser infrastructure, observability, and intricate web interactions, allowing developers to concentrate on their agents' functionality rather than the underlying technology. Kernel brings AI to life, enabling developers to create agents that genuinely engage with the digital landscape.Our platform is trusted by teams at Cash App, Rye, and many others for various tasks including in-depth research, QA automation, and real-time web analysis. We recently secured $22M in funding from notable investors such as Accel, YCombinator, Vercel, Paul Graham, Solomon Hykes (Docker), David Cramer (Sentry), and Charlie Marsh (Astral).With just a single line of code, you can deploy any web agent to our cloud infrastructure. If you are passionate about developing essential infrastructure for the future of AI applications, we would love to connect with you.

Dec 4, 2025

Apply

Kernel Engineer at magic.dev | San Francisco

magic.dev

Full-time|On-site|San Francisco

At Magic, our goal is to develop safe AGI that propels humanity forward by addressing some of the most pressing challenges we face. We are committed to harnessing the power of automated research and code generation to enhance models and improve alignment in ways that surpass human capabilities. Our innovative methodology integrates cutting-edge pre-training, domain-specific reinforcement learning, ultra-long context, and advanced inference-time computing.Role OverviewAs a Kernel Engineer, you will be responsible for the design, implementation, and maintenance of high-performance kernels, aiming to optimize throughput and minimize latency during both training and inference processes.Magic's extended context windows present unique kernel optimization challenges, particularly regarding memory efficiency, data movement, and sustained throughput.Key ResponsibilitiesDesign and develop kernels that facilitate high-performance long-context functionality.Take ownership of kernel design, implementation, deployment, and ensure production reliability.Emphasize robustness, thorough testing, and functional accuracy while striving for optimal performance.Assess the feasibility of porting Magic’s compute kernels to various hardware platforms.Collaborate with the training, inference, and reinforcement learning teams to co-design kernels.Explore our work through the Magic-Attention, presented at GTC 2026.QualificationsExperience in low-level programming for AI accelerators, including NVIDIA Blackwell or Google TPUs.Proficient in developing and optimizing GPU kernels using frameworks such as NCCL, MSCCLPP, CUTLASS, CuTeDSL, Triton, Quack, and Flash Attention.

Jan 24, 2024

Create account — see all 11,619 results

1 - 20 of 11,619 Jobs

Select all on this page (20)

Apply

Infrastructure Research Engineer - Kernels at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply

Research Infrastructure Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Feb 3, 2026

Apply

Research Product Manager at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$175K/yr - $475K/yr|On-site|San Francisco

Nov 28, 2025

Apply

Infrastructure Engineer at Kernel | San Francisco

Kernel

Full-time|On-site|San Francisco

Dec 4, 2025

Apply

Infrastructure Research Engineer - Numerics at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply

Forward Deployed Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Apr 27, 2026

Apply

Software Engineer, Platform at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Apr 27, 2026

Apply

Infrastructure Research Engineer - Training Systems

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We envision a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and goals.Our team comprises scientists, engineers, and builders who have developed some of the most widely utilized AI products, such as ChatGPT and Character.ai, alongside open-weight models like Mistral, and popular open-source initiatives like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the PositionWe are seeking an Infrastructure Research Engineer to design and construct the foundational systems that facilitate the scalable and efficient training of large models for both deployment and research purposes. Your primary objective will be to streamline experimentation and training at Thinking Machines, enabling our research teams to concentrate on scientific advancements rather than system limitations.This role is a perfect match for an individual who possesses a strong blend of deep systems expertise and a keen interest in machine learning at scale. You will take full ownership of the training stack, ensuring that every GPU cycle contributes to scientific progress.Note: This is an evergreen role that we keep open continuously to express interest. We receive numerous applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. However, we encourage you to apply. We regularly review applications and reach out to candidates as new opportunities arise. Feel free to reapply as you gain more experience, but please avoid applying more than once every six months. We may also post specific roles for individual projects or team needs, in which case you are welcome to apply directly alongside this evergreen role.Key ResponsibilitiesDesign, implement, and optimize distributed training systems that scale across thousands of GPUs and nodes for extensive training workloads.Develop high-performance optimizations to maximize throughput and efficiency.Create reusable frameworks and libraries that enhance training reproducibility, reliability, and scalability for new model architectures.Establish standards for reliability, maintainability, and security, ensuring systems remain robust under rapid iterations.Collaborate with researchers and engineers to construct scalable infrastructure.Publish and disseminate findings through internal documentation, open-source libraries, or technical reports that contribute to the advancement of scalable AI infrastructure.

Nov 27, 2025

Apply

Research Engineer, Infrastructure & Inference

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply

Full Stack Software Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco, California

Apr 28, 2026

Apply