Site Reliability Engineer At Thinking Machines San Francisco jobs in San Francisco – Browse 11,483 openings on RoboApply Jobs

Site Reliability Engineer At Thinking Machines San Francisco jobs in San Francisco

Open roles matching “Site Reliability Engineer At Thinking Machines San Francisco” with location signals for San Francisco. 11,483 active listings on RoboApply Jobs.

11,483 jobs found

1 - 20 of 11,483 Jobs
Apply
companyThinking Machines Lab logo
Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Thinking Machines Lab brings together scientists, engineers, and innovators who have shaped well-known AI products like ChatGPT and Character.ai, as well as open-weight models such as Mistral. The team also contributes to open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything. The company’s mission centers on advancing collaborative general intelligence, aiming to make AI accessible and adaptable to individual needs. Tinker, the company’s fine-tuning API, enables researchers and developers to customize advanced AI models using their own data and algorithms. Thinking Machines manages the infrastructure, giving users the flexibility to train open-weight models while focusing on their unique requirements. As Tinker expands, the platform continues to evolve alongside its growing community. Role overview The Site Reliability Engineer will focus on improving the reliability and resilience of the Tinker platform. This role involves close collaboration with platform engineers and research teams to strengthen every layer of the system, from infrastructure to user-facing services. What you will do Define and take ownership of end-to-end reliability, including CI/CD workflows, production observability, and incident response processes. Set Service Level Objectives for distributed training systems, balancing reliability, scheduling latency, and development speed. Design and implement monitoring and observability across the training pipeline. Manage incident response for Tinker, ensuring prompt recovery, thorough incident analysis, and systematic improvements to prevent recurrence. Enhance multi-tenant isolation and resource scheduling to support LoRA-based workload co-scheduling, maintaining both reliability and data separation. Collaborate with security teams to identify and address production vulnerabilities. This position is based in San Francisco.

Apr 28, 2026
Apply
companyThinking Machines Lab logo
Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Thinking Machines Lab brings together scientists, engineers, and innovators behind widely recognized AI products such as ChatGPT and Character.ai, as well as open-source frameworks like PyTorch, OpenAI Gym, Fairseq, and Segment Anything. The team is driven by a mission to enhance humanity through collaborative general intelligence, aiming for a future where AI adapts to individual needs and goals. Tinker, the lab’s fine-tuning API, empowers researchers and developers to customize advanced AI models for their own use cases. Tinker manages the infrastructure, allowing users to train open-weight models with their chosen datasets, algorithms, and objectives. As Tinker grows its user base and features, the team is expanding to better support the community. Role overview The Forward Deployed Engineer acts as the main point of contact for a broad range of clients, from solo developers to large organizations. This role identifies customer challenges and requirements, then translates those insights into actionable product improvements. Both customer interaction and product development responsibilities are central to this position. What you will do Triage and resolve customer issues across the full stack, including analyzing logs, reproducing failures, and tracing job executions. Develop tools, integrations, and automation to address recurring problems and speed up user support. Create and update clear documentation and practical guides based on real user experiences and implementations. Work closely with research and infrastructure teams to turn customer feedback into prioritized engineering tasks. Help shape Tinker’s product roadmap by sharing insights from daily customer interactions.

Apr 27, 2026
Apply
companyThinking Machines Lab logo
Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Thinking Machines Lab aims to advance collaborative general intelligence, making AI accessible and adaptable for individuals and organizations. The team brings together scientists, engineers, and innovators behind well-known AI solutions, including ChatGPT, Character.ai, Mistral, and open-source projects like PyTorch, OpenAI Gym, Fairseq, and Segment Anything. Tinker, the lab’s fine-tuning API, helps researchers and developers customize AI models using their own data and algorithms. By handling the infrastructure, Tinker allows users to focus on training and deploying models that suit their needs. With a growing customer base and expanding features, the team is looking for a Software Engineer, Platform to support Tinker's continued development. Role overview This position centers on building and maintaining the core platform systems that power Tinker. The engineer will manage billing and usage metering, permissions and access control, organizational structures, data exports, audit logging, and the administrative tools that tie these systems together. Collaboration with product and legal teams is essential, as changes to features, pricing, and enterprise agreements will involve this role. What you will do Design the authorization layer for all products, including RBAC, API key scoping, organizational hierarchies, and permission boundaries. Oversee billing infrastructure, covering usage metering, plan management, payment processing, invoicing, and revenue recognition support. Develop and improve models for organizations and teams, such as seat management, SSO/SAML, workspace isolation, and invitation flows. Implement data export and deletion processes that align with enterprise standards and data residency requirements. Create audit logging systems to track user actions and decisions. This role is based in San Francisco.

Apr 27, 2026
Apply
companyDrata logo
Full-time|$166.9K/yr - $225.9K/yr|Hybrid|Hybrid - San Francisco

Drata helps organizations demonstrate their commitment to security and integrity. The platform supports companies as they build and maintain trust with users, customers, partners, and prospects. Values Built on Trust: Consistency shapes decisions and actions. Integrity: Choosing to do what is right, every time. Customer-Obsessed: Prioritizing customer needs above all else. Competitive Fire: Striving for higher standards and greater achievements. Diversity: Welcoming different perspectives to encourage creative solutions. Automation First: Pursuing efficiency by saving time and resources wherever possible. How the Team Works Drata blends high standards with a supportive environment focused on growth. Team members are encouraged to own their work, improve continuously, and deliver meaningful results. The company values quick, informed decisions that drive immediate impact, while always keeping the mission and customer needs at the center. The San Francisco-based team uses a hybrid work model. Colleagues collaborate in the office Tuesday through Thursday, focusing on alignment and innovation. Mondays and Fridays offer flexibility for deep work or personal needs. Growth and Culture Drata has expanded to over 600 professionals worldwide, recognized for a culture that values trust, speed, and continuous learning. The environment supports both personal and professional development. See the Speed: CEO Adam Markowitz discusses Drata’s rapid journey to $100M ARR in four years. Hear the Voice of the Team: Employee stories highlight collaboration and growth at Drata.

Apr 27, 2026
Apply
companyThinking Machines Lab logo
Full-time|$350K/yr - $475K/yr|On-site|San Francisco, California

Thinking Machines Lab brings together scientists, engineers, and innovators who have contributed to well-known AI products such as ChatGPT, Character.ai, and open-weight models like Mistral. The team’s open-source projects include PyTorch, OpenAI Gym, Fairseq, and Segment Anything. Their mission centers on advancing collaborative general intelligence and making AI tools accessible for a wide range of users and goals. The Tinker platform offers a fine-tuning API that lets researchers and developers tailor advanced AI models to their needs. By handling the underlying infrastructure, Tinker enables users to train open-weight models with custom data, algorithms, and objectives. As demand grows, the team is adding new features and supporting an expanding community. Role overview The Full Stack Software Engineer will play a key part in building and maintaining the products and services that Tinker users depend on. This position involves working closely with frontend, backend, and infrastructure teams to deliver the Tinker console, developer tools, and essential features. What you will do Develop and enhance Tinker’s APIs and backend services using Python and Rust, focusing on areas like job submission, orchestration, billing, and usage tracking. Design and launch user interfaces, including the Tinker console and upcoming developer tools, using React and TypeScript. Refine the developer experience by improving SDK usability, error messages, API design, and onboarding processes. Work to increase system reliability, observability, and security in production, and participate in on-call rotations. Create internal tools that help research and infrastructure teams work more efficiently. Location This role is based in San Francisco, California.

Apr 28, 2026
Apply
companyMercor logo
Full-time|On-site|San Francisco

Join the Mercor TeamAt Mercor, we stand at the dynamic intersection of labor markets and AI research. Collaborating with premier AI labs and enterprises, we empower the human intelligence that is crucial for AI's evolution.Our expansive talent network plays a vital role in training cutting-edge AI models, akin to the way educators impart knowledge to their students—by sharing insights, experiences, and contextual understanding that code alone cannot convey. Currently, our network of over 30,000 experts generates more than $2 million daily.We are pioneering a novel category of work where expertise fuels AI progress. Achieving this vision necessitates an ambitious, fast-paced, and deeply dedicated team. You will collaborate with researchers, operators, and AI firms that are at the forefront of transforming societal structures.Mercor is a thriving Series C company with a valuation of $10 billion. We operate five days a week in-person at our new headquarters in San Francisco.About the RoleAs a Site Reliability Engineer (SRE) at Mercor, you will take ownership of production reliability for our critical systems, working closely with our infrastructure leadership. You will play a pivotal role in establishing our SRE function and defining how Mercor manages large-scale, high-availability systems.Your ResponsibilitiesEnsure the reliability and safety of production for key shared services and customer-facing systems.Collaborate directly with infrastructure leadership to outline SRE priorities, reliability benchmarks, and the production safety roadmap.Enhance the structure of our production systems to ensure stability, resource efficiency, isolation, and observability.Advocate for and implement modern SRE methodologies (e.g., incident management, postmortems, SLIs/SLOs) across engineering teams.Work alongside engineering and applied AI teams to facilitate sustainable growth.Promote SRE best practices internally, supporting teams in a safe, scalable, and consistent production onboarding process.Who We SeekThe ideal candidate will have:Extensive experience in genuine SRE roles (not merely operations) across various positions or organizations.A deep understanding of SRE methodologies popularized by Google (e.g., error budgets, reliability vs. risk trade-offs, large-scale distributed systems).5+ years of SRE experience; ideally, 15+ years in total experience for this inaugural SRE position.A proven track record of managing systems at scale, with a strong grasp of the complexities involved.

Dec 27, 2025
Apply
company
Full-time|$214K/yr - $260K/yr|Hybrid|Hub - San Francisco

At Superhuman, we embrace a vibrant hybrid work model that offers our team members the ideal blend of focused individual work and collaborative in-person interactions, fostering trust, innovation, and a robust team culture.About SuperhumanSuperhuman, the AI productivity platform, is on a transformative mission to unlock the superhuman potential within everyone. With the integration of Grammarly's writing assistance and innovative tools like Coda’s collaborative workspaces and Go, our proactive AI assistant, we empower over 40 million individuals and 50,000 organizations globally. Founded in 2009, we strive to eliminate busywork and enhance productivity. Discover more at superhuman.com and explore our values here.The OpportunityTo meet our ambitious goals, we are seeking a Site Reliability Engineer (SRE) to join our infrastructure team. This pivotal role focuses on developing software solutions to maintain the reliability of our back-end systems while collaborating with engineering teams to strategize our future growth. You will also engage with our production engineering teams in Europe as we transition from a “you build it, you own it” approach.At Superhuman, our engineers and researchers enjoy the autonomy to innovate and drive breakthroughs, directly impacting our product roadmap. As we rapidly scale our interfaces, algorithms, and infrastructure, the complexity of our technical challenges is growing. Learn more about our technical endeavors on our technical blog.As an SRE, your responsibilities will include:Scaling our Kubernetes-based control plane that processes billions of events each day.Enhancing our automation mechanisms to efficiently respond to workload demands.Deploying machine learning systems across various departments.

Jun 18, 2025
Apply
companyHyperbolic Labs logo
Full-time|On-site|San Francisco, CA

Who We AreAt Hyperbolic Labs, we are committed to democratizing AI by removing barriers to computing power with our Open-Access AI Cloud. By aggregating global computing resources, we provide an innovative GPU marketplace and AI inference service that ensures both affordability and accessibility. As trailblazers at the convergence of AI and open-source technology, we envision a future where AI innovation is only limited by creativity, not by resource availability. We invite forward-thinking individuals who share our dedication to making AI universally accessible, secure, and affordable. Join us in crafting a platform that empowers innovators worldwide to realize their visionary AI projects.In anticipation of our growth following our Series A funding, our team — guided by co-founders with advanced degrees in AI, Mathematics, and Computer Science — is set to transform the computing landscape.About the RoleWe are in search of a skilled Site Reliability Engineer to guarantee that Hyperbolic's GPU marketplace and AI infrastructure function with outstanding reliability, performance, and security. As an aggregator of computational resources from numerous global providers, our service level objectives (SLOs), trust, and economic efficiency are critical to our product. Your key responsibilities will include defining and maintaining service level objectives, developing resilient incident response protocols, managing capacity across our extensive GPU network, and implementing secure rollout and rollback mechanisms to ensure uninterrupted platform operation around the clock.In this influential role, you'll set the reliability benchmarks that foster customer trust in our platform, design comprehensive monitoring and alerting systems for enhanced infrastructure visibility, automate capacity management and resource allocation processes, lead incident response and post-mortem evaluations, and collaborate closely with engineering teams to bolster system resilience. Security and infrastructure hardening will be paramount, necessitating strong isolation protocols between tenants and suppliers, the implementation of effective key management systems, and the establishment of compliance frameworks. This high-impact position will directly affect our ability to deliver on our commitment to providing affordable, accessible AI compute at scale.

Mar 26, 2026
Apply
companySuperhuman logo
Full-time|$214K/yr - $260K/yr|Hybrid|San Francisco, CA

At Superhuman, we embrace a flexible hybrid working model that combines focused work time with in-person collaboration, fostering trust, innovation, and a vibrant team culture.About SuperhumanSuperhuman, now part of Grammarly, is an AI productivity platform dedicated to unlocking the superhuman potential in everyone. Our suite of applications integrates AI with over 1 million tools and websites, offering innovative solutions such as Grammarly's writing assistance, Coda's collaborative workspaces, Mail's inbox management, and Go, our proactive AI assistant. Since our inception in 2009, we have empowered over 40 million individuals and 50,000 organizations worldwide, enabling them to eliminate busywork and focus on what truly matters. Discover more at superhuman.com and explore our values here.The OpportunityIn pursuit of our ambitious goals, we are seeking a Site Reliability Engineer to enhance our infrastructure team. This pivotal role involves building software that ensures the reliability of our back-end systems while collaborating closely with our engineering teams. You will also help plan for our future growth as we shift from a “you build it, you own it” model.Our engineers and researchers enjoy the freedom to innovate and influence our product roadmap, tackling increasingly complex technical challenges as we scale our systems. Learn more about our technical endeavors on our technical blog.As a Site Reliability Engineer, your responsibilities will include:Scaling our Kubernetes-based control plane, processing billions of events daily.Enhancing our automation mechanisms in response to workload demands.Deploying machine learning systems across the organization.

Mar 18, 2026
Apply
companyThinking Machines Lab logo
Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Thinking Machines Lab brings together scientists, engineers, and innovators who have contributed to well-known AI products such as ChatGPT, Character.ai, and open-source frameworks like PyTorch, OpenAI Gym, Fairseq, and Segment Anything. The team's mission centers on advancing collaborative general intelligence, aiming to make AI accessible for people to address their own needs and ambitions. The Tinker platform offers a fine-tuning API that lets researchers and developers tailor advanced AI models to their specific requirements. Tinker provides the infrastructure, while users maintain flexibility to train open-weight models with their own data and algorithms. As Tinker grows its features and user base, the team is expanding to support the platform's evolution. Role overview This Full Stack Software Engineer role focuses on designing, building, and maintaining the products and services that Tinker users rely on. The work covers frontend, backend, and infrastructure, with an emphasis on the Tinker console, developer tools, and meeting the changing needs of the Tinker community. What you will do Develop and improve Tinker’s APIs and backend services using Python and Rust, including systems for job submission, orchestration, billing, and usage tracking. Build user-facing interfaces such as the Tinker console and future developer tools with React and TypeScript. Enhance the developer experience by refining SDK usability, error messages, API design, and onboarding workflows. Increase system reliability, observability, and security in Tinker’s production environment, and participate in on-call rotations. Create internal tools to support the research and infrastructure teams working on Tinker. This position is based in San Francisco.

Apr 27, 2026
Apply
companyThinking Machines Lab logo
Full-time|$175K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we strive to empower humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone can access the knowledge and tools necessary to harness AI for their specific needs and aspirations.Our team comprises scientists, engineers, and innovators who have developed some of the most widely utilized AI products, such as ChatGPT and Character.ai, along with notable open-weight models like Mistral, as well as prominent open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleAs a Research Product Manager (RPM) at Thinking Machines Lab, you will play a pivotal role in driving complex, high-impact technical products and programs that encompass research, infrastructure, and applied initiatives. You will facilitate the transformation of ambitious concepts into reality by propelling cross-functional collaboration, ensuring projects maintain momentum, and fostering clarity in fast-paced, ambiguous settings.Your contributions will connect people, ideas, and systems, guaranteeing that our critical research initiatives remain aligned, well-defined, and progressing efficiently. This position is ideal for someone who excels in technical discussions, comprehends the intricacies of research, can conceptualize at a high level while also delving into detailed aspects, ultimately aiming to assist the company in executing at scale.Note: This is an "evergreen role" that we keep open on an ongoing basis to express interest. We receive numerous applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. Nevertheless, we encourage you to apply. We continuously review applications and reach out to applicants as new opportunities arise. You are welcome to reapply if you gain more experience, but please refrain from applying more than once every six months. You may also find that we post job openings for specific roles related to separate projects or team needs. In those cases, you are welcome to apply directly in addition to this evergreen role.What You’ll DoDrive and coordinate large-scale research products and programs, ensuring that complex projects are executed efficiently, transparently, and with scientific rigor.Translate technical ideas into actionable, well-scoped plans, defining milestones and ensuring team alignment across model development, data campaigns, infrastructure, and product integration.Collaborate across disciplines—from research and ML infrastructure to legal and business development—quickly ramping up on new domains as necessary.Create and maintain compute and resource roadmaps, identifying bottlenecks and solutions to optimize project flow.

Nov 28, 2025
Apply
companyprosper logo
Full-time|On-site|San Francisco, CA

Role overview The Senior Site Reliability Engineer at prosper plays a key role in maintaining and improving the reliability and performance of the company’s core systems. Collaboration with teams across the organization is essential to ensure services remain stable and efficient. What you will do Design and set up monitoring tools to track the health and performance of systems Automate routine operational tasks to minimize manual intervention and boost efficiency Diagnose and resolve complex technical problems that impact infrastructure or services Support projects aimed at strengthening infrastructure stability and preparing for future growth Location This role is located in San Francisco, CA.

Apr 27, 2026
Apply
companyBlaxel logo
Full-time|On-site|San Francisco

Join Our Team as a Site Reliability EngineerBlaxel is seeking a highly skilled Site Reliability Engineer to enhance the reliability, performance, and scalability of our cutting-edge AI infrastructure platform.In this role, you will develop and manage the essential systems that support scalable agentic AI. Your primary goal: maintain our ultra-low-latency, stateful, serverless compute engine, ensuring it remains robust as we handle billions of agent requests from the world's most advanced AI teams.This position is deeply technical and execution-oriented. You will take charge of our reliability framework, encompassing observability, performance optimization, incident management, infrastructure health, and the automation processes that ensure seamless operations. We are looking for innovators who can design new reliability systems, advance automation capabilities, and continuously adapt the platform to accommodate next-generation AI workloads. If you are a builder who excels in managing critical infrastructure at scale, we want to hear from you.Your ResponsibilitiesWorking closely with our founders, infrastructure team, and development team—leveraging AI for maximum efficiency—you will architect and manage the systems that keep Blaxel fast, resilient, and secure.Design, operate, and iteratively enhance the core infrastructure that drives our 25ms cold-start compute engine.Develop and refine our observability stack (metrics, traces, logs), ensuring proactive issue detection.Establish, monitor, and drive SLOs/SLIs across vital system components to ensure world-class reliability.Lead incident response with precision: conduct root cause analyses, post-mortems, and implement systemic solutions.Design and deploy self-healing, automated operational systems to minimize manual work and scale operations.Collaborate across compute, networking, storage, and sandboxed execution layers to optimize performance under intense workloads.Create automation tools—often utilizing AI agents—to enhance operations, debugging, capacity planning, and failure predictions.Test and stress our systems to their limits: engage in load testing, chaos engineering, and performance benchmarking.Champion security best practices at the infrastructure level, from sandboxed compute to network isolation.Collaborate with platform engineers to ensure reliability is an integral part of new features from inception.Who You AreExtensive technical expertise in site reliability engineering, with a passion for building scalable systems.

Mar 3, 2026
Apply
companyCarta logo
Full-time|On-site|San Francisco, California; Santa Clara, California; Seattle, WA

Join Carta as a Senior Site Reliability Engineer, where you will play a pivotal role in enhancing our infrastructure and ensuring the reliability of our platforms. You will work collaboratively with cross-functional teams to implement innovative solutions that drive operational excellence and scalability.

Apr 3, 2026
Apply
companyThinking Machines Lab logo
Full-time|$200K/yr - $250K/yr|On-site|San Francisco, CA

At Thinking Machines Lab, we are on a mission to enhance humanity through the advancement of collaborative general intelligence. Our vision is to create a future where everyone has the opportunity to leverage AI tailored to their individual needs and aspirations.Our team comprises scientists, engineers, and innovators who have developed some of the most renowned AI products in the industry, such as ChatGPT, Character.ai, as well as open-weight models like Mistral and popular open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking an Executive Business Partner to provide vital support to several technical leaders from our San Francisco office. Your role will be crucial in ensuring our team remains focused and organized by managing personal logistics and handling tasks that may otherwise be overlooked.This position is unique, requiring creativity and flexibility to adapt to various work styles and the dynamic challenges of a fast-paced startup environment. You will enjoy significant autonomy in decision-making without extensive supervision.What You’ll DoManage calendars, schedule meetings, and coordinate travel for 3-4 technical leaders.Act as the primary liaison between your supported leaders and other departments within the company.Assist with recruiting coordination efforts.Monitor projects and commitments to ensure nothing is overlooked.

Mar 19, 2026
Apply
companyEngFlow logo
Full-time|On-site|San Francisco

Join Our Team at EngFlowEngFlow is revolutionizing the software development process by enabling developers to save valuable time in their build and test cycles. Our innovative cloud-based distributed service optimizes workflows through advanced remote execution and caching, significantly enhancing efficiency, productivity, and product quality.Supported by esteemed investors, EngFlow is at the forefront of transforming how organizations develop software and deliver thoroughly tested products. Our solutions can accelerate builds by tenfold or more, and our observability platform provides crucial insights for ongoing optimization. Founded by leading contributors to Bazel, we create tools that empower engineering teams, from startups to Fortune 500 companies, to boost developer velocity and build performance.Discover more about our mission, culture, and team: EngFlow | Watch Our VideoWe are seeking a talented and experienced Site Reliability Engineer to join our dynamic engineering team. In this pivotal role, you will bridge the gap between software engineering and systems operations, ensuring our distributed infrastructure is highly available, performant, and scalable, thereby allowing our engineers to work swiftly and with confidence.

Jan 27, 2026
Apply
companyThinking Machines Lab logo
Full-time|$190K/yr - $300K/yr|On-site|San Francisco, California

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We envision a future where everyone has access to the knowledge and tools necessary to leverage AI for their unique goals.Our team consists of scientists, engineers, and builders who have developed some of the most utilized AI products, such as ChatGPT and Character.ai, alongside open-weight models like Mistral and popular open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything.HR Business PartnerThe HR Business Partner role is essential in empowering our team to thrive as we scale. You will be pivotal in coaching our leaders and designing people systems that align with our mission.As the HR Business Partner, you will facilitate leadership coaching and the design of performance management systems that foster growth and collaboration. You will support managers in enhancing team dynamics and personal development while building a scalable people infrastructure that includes performance feedback systems, compensation structures, and career frameworks.What You’ll DoProvide coaching to managers by observing their leadership styles, identifying strengths and areas for growth, and promoting continuous improvement.Advise leadership on organizational strategies, including team structure, succession planning, and strategic people decisions that influence our operational effectiveness.Develop compensation frameworks that attract top-tier machine learning talent while ensuring alignment with our core values and principles.Create career progression frameworks tailored for a research environment where growth often transcends traditional management roles and where contributions such as mentorship and expertise are valued.Establish feedback and evaluation mechanisms that prioritize personal improvement over mere assessment.

Feb 2, 2026
Apply
companyLatent logo
Full-time|On-site|San Francisco

Site Reliability EngineerLocation: San Francisco, CA (5 Days In-Office)As a Site Reliability Engineer at Latent, you will be the backbone of our infrastructure, ensuring the exceptional stability and performance of our cutting-edge clinical AI platform that serves major health systems. Your role is pivotal in enhancing operational excellence, directly impacting patient access to critical treatments.What Makes a Great Engineer at LatentWe seek individuals who are not just technically skilled but also passionate about ownership and high standards. You will thrive in our dynamic, in-office culture where teamwork and a winning mentality are key.Tool Proficiency: You are highly adept with your tools, fluent in command line operations, and skilled in keyboard shortcuts.Ownership: You take pride in managing complex systems and have a successful history of scaling mission-critical deployments.Automation Drive: You have a passion for automation, consistently seeking innovative methods to enhance efficiency and establish operational excellence.Problem Solver: You proactively address challenges, stepping in to resolve issues without waiting for others.Your ResponsibilitiesAs our SRE, you will take full ownership of the production environment and enhance the developer experience:Infrastructure Ownership: Design, implement, and maintain a robust production environment, having experience with over 500 machine deployments.Kubernetes Mastery: Utilize your expertise in Kubernetes and Helm to manage our containerized infrastructure, ensuring optimal deployment, scalability, and operational health.CI/CD & Deployment Optimization: Streamline the deployment pipelines for TypeScript and Python/ML, supporting rapid feature releases while upholding top-notch reliability.DevX Support: Enhance developer workflows by supporting Developer Experience (DevX) initiatives to improve tool proficiency and CI/CD systems.Infrastructure as Code (IaC): Manage infrastructure definitions using Terraform.

Dec 5, 2025
Apply
companyThinking Machines Lab logo
Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we are on a mission to empower humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and objectives.We are a diverse team of scientists, engineers, and builders responsible for developing some of the most influential AI products on the market, such as ChatGPT and Character.ai. Our contributions extend to open-weight models like Mistral and popular open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking talented engineers to join our team and develop the libraries and tools that will accelerate research efforts at Thinking Machines. You will take charge of our internal infrastructure—creating evaluation libraries, reinforcement learning training libraries, and experiment tracking platforms—while building systems that enhance research velocity over time.This position emphasizes collaboration. You will work closely with researchers to identify bottlenecks and pain points, ensuring that they trust your systems to function seamlessly and find them enjoyable to use.What You'll DoDesign, build, and manage research infrastructure, including evaluation frameworks, RL training systems, experiment tracking platforms, visualization tools, and shared utilities.Develop high-throughput, scalable pipelines for distributed evaluation, reward modeling, and multimodal assessment.Establish systems for reproducibility, traceability, and robust quality control across research experiments and model training runs, implementing effective monitoring and observability.Collaborate directly with researchers to identify bottlenecks and unlock new capabilities, managing research tools like a product manager by proactively seeking feedback and tracking adoption.Work alongside infrastructure, data, and product teams to integrate tools across the technical stack.

Feb 3, 2026
Apply
companyThinking Machines Lab logo
Full-time|$175K/yr - $300K/yr|On-site|San Francisco, California

Thinking Machines Lab brings together scientists, engineers, and innovators with a track record in developing widely used AI products and open-source projects. The team has contributed to tools like ChatGPT, Character.ai, Mistral, PyTorch, OpenAI Gym, Fairseq, and Segment Anything. The company’s mission centers on advancing collaborative general intelligence to help people achieve more with AI tailored to their needs. Tinker, the company’s fine-tuning API, enables researchers and developers to adapt advanced AI models to their own data and algorithms. By handling the infrastructure, Tinker allows users to focus on customization, opening up capabilities that were once limited to a few specialized labs. As Tinker’s customer base and feature set grow, the team is focused on building a scalable platform and supporting an expanding community. Role overview The GTM Strategy & Operations Lead will build and refine the commercial structure for Tinker. This person will design strategies and processes that turn organic product adoption into a consistent, scalable revenue stream. The role involves shaping how Tinker’s fine-tuning capabilities are packaged, priced, launched, and sold across different customer segments. Collaboration with product, engineering, and research teams is central to the work. Tinker is designed for technically sophisticated users. The GTM lead must be comfortable discussing training infrastructure and understand how developers evaluate and adopt new tools. What you will do Develop and execute commercialization strategies for Tinker, including pricing, packaging, and launch plans based on market and competitor analysis. Create go-to-market approaches tailored to different types of customers. Manage partnerships to expand Tinker’s reach and open new channels for demand. Design and oversee customer pilots, onboarding, and expansion playbooks to move accounts from testing to production use. Produce commercial playbooks to help customer-facing engineers and FDEs position and sell Tinker effectively. Set and track success metrics for launches and GTM projects, running experiments to test assumptions about pricing and product packaging.

Apr 27, 2026

Sign in to browse more jobs

Create account — see all 11,483 results

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.