PinterestSan Francisco, CA, US; Palo Alto, CA, US; Seattle, WA, US
On-site Full-time $189.7K/yr - $332K/yr
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Senior
Qualifications
What You Will Contribute:Develop state-of-the-art technology leveraging the latest advancements in deep learning and machine learning to enhance user personalization on Pinterest. Collaborate closely with cross-functional teams to experiment and refine ML models across diverse product surfaces (Homefeed, Ads, Growth, Shopping, and Search), gaining insights into the application of ML in various domains. Utilize data-driven methodologies and harness the distinctive attributes of our data to optimize candidate retrieval. Engage in a dynamic environment characterized by rapid experimentation and product launches. Stay abreast of industry developments in recommendation systems. Incorporate LLMs to improve content comprehension. What We Seek:Minimum of 2 years of professional experience implementing machine learning techniques (e.g., user modeling, personalization, recommender systems, search algorithms, ranking, natural language processing, reinforcement learning, and graph representation learning). A degree in computer science, statistics, or a related field; or equivalent practical experience. Hands-on experience in developing end-to-end data processing pipelines.
About the job
About Pinterest:
At Pinterest, we inspire millions globally to explore creative ideas and envision lasting memories. Our mission is to empower everyone to create a life they love, and that journey begins with our talented team.
Join a workplace where innovation thrives, passion drives growth, and every unique experience is celebrated. With a culture that embraces flexibility, we enable you to do your best work and build a fulfilling career.
As a part of Pinterest's vibrant community of over 500 million users and 300 billion ideas saved, our Machine Learning engineers play a pivotal role in crafting personalized experiences that empower Pinners to create their ideal lives. With a nimble team of just over 4,000 employees worldwide, we provide unparalleled access to extensive data and opportunities to contribute to large-scale recommendation systems.
Within the Monetization ML Engineering team, you will bridge the gap between the aspirations of Pinners and the offerings of our partners. In this pivotal role, you will lead the vision for advancing the machine learning technology stack in our Ads division.
About Pinterest
Pinterest is a leading platform where creativity meets inspiration, enabling users to explore new ideas and create memorable experiences. Our mission is to provide a space for everyone to discover the inspiration they need to enhance their lives, powered by a passionate and innovative team.
Similar jobs
1 - 20 of 7,055 Jobs
Search for Senior Machine Learning Engineer Distributed Ml Systems
OverviewPluralis Research is at the forefront of innovation in Protocol Learning, specializing in the collaborative training of foundational models. Our approach ensures that no single participant ever has or can obtain a complete version of the model. This initiative aims to create community-driven, collectively owned frontier models that operate on self-sustaining economic principles.We are seeking experienced Senior or Staff Machine Learning Engineers with over 5 years of expertise in distributed systems and large-scale machine learning training. In this role, you will design and implement a groundbreaking substrate for training distributed ML models that function effectively over consumer-grade internet connections.
Full-time|$218.4K/yr - $273K/yr|On-site|San Francisco, CA; Seattle, WA; New York, NY
Join Scale AI's ML platform team (RLXF) as a Machine Learning Research Engineer, where you will play a pivotal role in developing our advanced distributed framework for training and inference of large language models. This platform is vital for enabling machine learning engineers, researchers, data scientists, and operators to conduct rapid and automated training, as well as evaluation of LLMs and data quality.At Scale, we occupy a unique position in the AI landscape, serving as an essential provider of training and evaluation data along with comprehensive solutions for the entire ML lifecycle. You will collaborate closely with Scale's ML teams and researchers to enhance the foundational platform that underpins our ML research and development initiatives. Your contributions will be crucial in optimizing the platform to support the next generation of LLM training, inference, and data curation.If you are passionate about driving the future of AI through groundbreaking innovations, we want to hear from you!
Full-time|$227.2K/yr - $417K/yr|Hybrid|San Francisco, CA; Los Angeles, CA; New York, NY (Hybrid); USA - Remote
About the Role:Join our dynamic ML Infrastructure team as a Software Engineer, where you'll collaborate intimately with the Machine Learning and Product teams to construct top-tier machine learning inference platforms. These cutting-edge platforms drive vital services such as personalized recommendations, search functionalities, and content comprehension at Tubi.Your primary focus will be on the development and maintenance of low-latency ML model serving systems that cater to Deep Learning, LLM, and Search models. This will include the creation of self-service infrastructure and critical components such as the inference engine, feature store, vector store, and experimentation engine.In this role, you'll enhance our service deployment and operational processes, with opportunities to contribute to open-source projects. Enjoy architectural freedom to explore innovative frameworks, spearhead significant cross-functional projects, and elevate the capabilities of our ML and Product teams.We are currently hiring for two positions:Staff Software EngineerPrincipal Software EngineerAdditional Details: As a Principal Engineer, you will serve as a technical leader and visionary, guiding the advancement of our machine learning platform. You'll address complex technical challenges, shape architectural decisions, and mentor senior engineers, fostering a culture of excellence and continuous improvement. Your contributions will impact millions of users.
About Our TeamJoin the innovative Sora team at OpenAI, where we are at the forefront of developing multimodal capabilities for our foundation models. Our hybrid research and product team is dedicated to seamlessly integrating multimodal functionalities into our AI solutions, ensuring they are dependable, user-centric, and aligned with our vision of benefiting society at large.Role OverviewAs a Machine Learning Engineer specializing in Distributed Data Systems, you will be instrumental in designing and scaling the infrastructure that facilitates large-scale multimodal training and evaluation at OpenAI. Your role will involve managing complex distributed data pipelines, collaborating closely with researchers to convert their requirements into robust, production-ready systems, and enhancing pipelines that are essential for Sora's rapid iteration cycles.We are seeking detail-oriented engineers with extensive experience in distributed systems who thrive in high-stakes environments and excel in building resilient infrastructure.This position is located in San Francisco, CA, and follows a hybrid work model, requiring three days in the office each week. We also provide relocation assistance for new team members.Key Responsibilities:Design, implement, and maintain data infrastructure systems, including distributed computing, data orchestration, distributed storage, streaming infrastructure, and machine learning systems, with a focus on scalability, reliability, and security.Ensure our data platform can scale exponentially while maintaining high reliability and efficiency.Collaborate with researchers to gain a deep understanding of their requirements, translating them into production-ready systems.Strengthen, optimize, and manage critical data infrastructure systems that support multimodal training and evaluation.You Will Excel in This Role If You:Possess strong experience with distributed systems and large-scale infrastructure, coupled with a keen interest in data.Exhibit meticulous attention to detail and a commitment to building and maintaining reliable systems.Demonstrate solid software engineering fundamentals and effective organizational skills.Thrive in environments characterized by ambiguity and rapid change.About OpenAIOpenAI is a trailblazing AI research and deployment organization committed to ensuring that general-purpose artificial intelligence serves humanity. We continuously push the boundaries of AI capabilities and strive to create technology that benefits everyone.
Why Join Achira?Become part of an exceptional team comprised of scientists, ML researchers, and engineers dedicated to transforming the landscape of drug discovery.Engage with cutting-edge machine learning infrastructure at an unprecedented scale, leveraging extensive computing resources, vast datasets, and ambitious goals.Take ownership of significant projects from conception through to architecture and deployment on large-scale infrastructures.Thrive in a culture that values thoroughness, speed, and a proactive, builder-oriented mindset.About the RoleAt Achira, we are developing state-of-the-art foundation models that address the most complex challenges in simulation for drug discovery and beyond. Our atomistic foundation simulation models (FSMs) serve as comprehensive representations of the physical microcosm, encompassing machine learning interaction potentials (MLIPs), neural network potentials (NNPs), and various generative model classes.We are looking for a Software Engineer who is enthusiastic about distributed computing and its applications in machine learning. You will play a pivotal role in designing and constructing the infrastructure for our ML data generation pipelines, model training, and fine-tuning workflows across large-scale distributed systems.Your expertise will be crucial in ensuring our compute clusters are efficient, observable, cost-effective, and dependable, enabling us to advance the frontiers of ML development. If you are passionate about distributed systems, performance optimization, and cloud cost efficiency, we encourage you to apply.You will be empowered to conceptualize and manage complex workloads across multiple vendors worldwide. Achira's mission revolves around computation, and providing seamless access to our uniquely tailored workloads at the lowest possible cost is critical to our success.
About UsSieve is a pioneering AI research lab dedicated solely to video data. We harness exabyte-scale video infrastructure and innovative video understanding techniques, along with a multitude of data sources, to create datasets that advance the field of video modeling. Given that video constitutes 80% of internet traffic, it serves as a vital medium that fuels creativity, communication, gaming, AR/VR, and robotics. Our mission is to tackle the most significant challenge in the development of these applications: acquiring high-quality training data.With a small yet highly skilled team of just 15 members, we have formed strategic partnerships with leading AI labs and achieved $XXM in revenue last quarter alone. Our Series A funding round last year was backed by prestigious firms, including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleAs a Distributed Systems Engineer at Sieve, you will be responsible for designing and implementing systems that efficiently manage the compute, scheduling, and orchestration of complex machine learning and ETL pipelines. Your work will ensure these systems operate quickly, reliably, and cost-effectively while processing large volumes of video data.You will thrive in this role if you are passionate about optimizing system uptime, have experience with cloud technologies, and enjoy working with high-performance distributed systems involving thousands of GPUs. Additionally, you will play a key role in developing excellent internal tools and CI/CD pipelines to facilitate rapid iteration.
Saris AI develops applied AI solutions for the banking sector, with teams in San Francisco, Montreal, and Toronto. The company builds automation tools that handle complex, long-context reasoning and agent-driven decision-making. Reliability and compliance shape every product, and Saris AI's agents already manage real customer workflows in production. As revenue grows, the engineering team is expanding to enhance current offerings and explore new directions. The Senior Machine Learning Engineer role is based in San Francisco and sits within the core engineering group. The team works in a collaborative, early-stage setting, balancing infrastructure needs with the delivery of features that serve customers directly. What you will do Build and maintain machine learning infrastructure, such as evaluation frameworks, prompt management systems, and tools for model observability. Develop new AI features for customers while supporting and improving the underlying infrastructure. Shape strategies for evaluation, LLM routing, prompt engineering, and model selection. Set practical standards to boost quality without slowing down development. Guide technical direction by clarifying trade-offs and architectural choices. Requirements Minimum 4 years of experience in machine learning or AI engineering, including production deployment of ML systems. Direct experience with large language models, prompt engineering, evaluation techniques, and model routing. Background in building tools and systems that deliver value to users. Comfort making pragmatic trade-offs and recognizing when a solution is sufficient. Ability to navigate ambiguity, define problems, and deliver results independently. Strong focus on end users and understanding the impact of ML decisions on customer experience. Supports team growth through code reviews, collaboration, and clear technical communication. Bonus Experience in regulated industries, especially banking.
OverviewPluralis Research is at the forefront of Protocol Learning, innovating a decentralized approach to train and deploy AI models that democratizes access beyond just well-funded corporations. By aggregating computational resources from diverse participants, we incentivize collaboration while safeguarding against centralized control of model weights, paving the way for a truly open and cooperative environment for advanced AI.We are seeking a talented Machine Learning Training Platform Engineer to design, develop, and scale the core infrastructure that powers our decentralized ML training platform. In this role, you will have ownership over essential systems including infrastructure orchestration, distributed computing, and service integration, facilitating ongoing experimentation and large-scale model training.ResponsibilitiesMulti-Cloud Infrastructure: Create resource management systems that provision and orchestrate computing resources across AWS, GCP, and Azure using infrastructure-as-code tools like Pulumi or Terraform. Manage dynamic scaling, state synchronization, and concurrent operations across hundreds of diverse nodes.Distributed Training Systems: Design fault-tolerant infrastructure for distributed machine learning, including GPU clusters, NVIDIA runtime, S3 checkpointing, large dataset management and streaming, health monitoring, and resilient retry strategies.Real-World Networking: Develop systems that simulate and manage real-world network conditions—such as bandwidth shaping, latency injection, and packet loss—while accommodating dynamic node churn and ensuring efficient data flow across workers with varying connectivity, as our training occurs on consumer nodes and non-co-located infrastructure.
Highlight is pioneering a shared intelligence framework for the contemporary workforce. Our solution integrates context across every member and tool in your team, effectively eliminating information silos. As your organization evolves, Highlight adapts by intelligently routing knowledge and reliably automating workflows.The RoleWe are seeking a Senior Machine Learning Engineer to contribute to the development of the AI systems that drive Highlight's operations. This position involves working across the entire ML stack, including data pipelines, model training, retrieval, ranking, evaluations, and deployment. You will deliver features that create a seamless experience for our users. This is a hands-on individual contributor role where you will manage critical systems from inception to deployment, collaborating closely with our Head of AI Engineering to enhance our ML capabilities.You will excel in this role if you are passionate about delivering production-ready ML systems, maintaining high standards of output quality, and tackling complex problems that have a tangible impact on users. We prioritize speed, uphold rigorous standards, and believe in the merit of ideas, regardless of hierarchy.Note: This is an on-site position, requiring five days a week at our San Francisco or New York City offices.ResponsibilitiesDesign, develop, and refine ML pipelines for context retrieval, action detection, and output generation.Implement and iterate on Retrieval-Augmented Generation (RAG) systems, ranking models, and prompt engineering to enhance quality.Establish comprehensive evaluations and monitoring frameworks to assess and enhance ML system performance.Explore alternative models and fine-tuning strategies for our substantial background processing workloads, creating improved systems that enhance quality while optimizing cost efficiency.Work in collaboration with Backend and Product Engineering teams to deliver AI-powered features.Influence ML infrastructure decisions and establish best practices across the engineering organization.Stay updated on advancements in LLMs, retrieval techniques, and ML tools; share insights with the team.Candidate ProfileWe are looking for an individual who is excited about crafting the AI user experience for the future. We value extreme ownership, accountability, and proactivity as we strive to exceed expectations. The ideal candidate is a skilled craftsman who enjoys developing exceptional products utilizing LLMs at scale.You are a suitable candidate if you possess:4+ years of experience in ML/AI engineering, with substantial involvement in deploying production systems.
Join Our Team at MacroscopeAt Macroscope, we are dedicated to being the definitive source of truth for any software development company. Our mission is to empower leaders with clarity and provide engineers with the time they need to innovate.We enable leaders to gain insights into the evolution of their products and codebases—tracking changes, understanding team contributions, and identifying progress—all grounded in the ultimate source of truth: the code itself.Founded by experienced entrepreneurs who have successfully built and sold multiple companies, and held executive positions in public tech firms, we are backed by top-tier venture capital firms such as Lightspeed Venture Partners, Thrive Capital, Google Ventures, and Adverb.The RoleWe are seeking a Senior Applied Machine Learning Engineer who will be responsible for designing, developing, and optimizing the ML and AI systems that drive our core offerings. You will have full ownership of the systems, overseeing everything from data collection and evaluation to model experimentation and large-scale production deployment.This cross-functional position entails leading the ML/AI lifecycle for one of our most vital features: AI Code Review. Collaborating closely with our co-founders, you will make pivotal decisions that shape our product's development—ranging from building high-quality datasets to interpreting experimental results and enhancing model performance architecture. Additionally, you will play a significant role in crafting and implementing software that seamlessly integrates our models with our backend applications and user experience, offering a unique opportunity to influence our product's evolution significantly.Technology Stack: Typescript/React (frontend), Golang (backend), Temporal, Google Cloud (GCP), Postgres, Terraform, and custom-built AST "code walkers" in several programming languages including Golang, Typescript, Swift, Python, and Rust.
Full-time|$268K/yr - $368.5K/yr|On-site|San Francisco, CA
About FaireFaire is a transformative online wholesale marketplace, driven by the conviction that local businesses are the future. Independent retailers around the globe generate more revenue than massive corporations like Walmart and Amazon combined, yet individually, they remain small. At Faire, we harness technology, data, and machine learning to connect this vibrant community of entrepreneurs. Think of your favorite local boutique — we empower them to discover and sell the best products from around the world. With our innovative tools and insights, we aim to level the playing field, enabling small businesses to thrive against larger competitors.By championing the growth of independent businesses, Faire positively impacts local economies on a global scale. We’re in search of intelligent, resourceful, and passionate individuals to join us in fueling the shop local movement. If you value community, we invite you to be part of ours.About this RoleAs the Senior Staff Machine Learning Platform Engineer, you will spearhead the technical vision and evolution of Faire's ML platform. You will establish standards, influence organization-wide architecture, and lead intricate, cross-functional initiatives that enhance data science velocity at scale. This position is crucial for adapting ML workflows to leverage modern AI productivity tools. You will not only develop models but also design the systems that enable those models to empower tens of thousands of small retailers in competing and growing their local businesses.
Full-time|$162.8K/yr - $203.5K/yr|On-site|San Francisco, CA
At Lyft, our mission is to connect and serve communities by fostering an inclusive work environment where every team member feels valued and empowered to excel.With over half a billion rides completed, Lyft is tackling complex challenges in a fast-evolving landscape filled with extensive data and innovative solutions across various domains including Marketplace, Mapping, Fraud Prevention, Trust & Safety, and Growth. As we redefine transportation with our next-generation platform, we utilize advanced machine learning techniques that process peta-byte scale data to create low-cost, ultra-immersive transportation solutions that enhance lives. Our dedicated Machine Learning Engineers are at the forefront of these efforts, crafting solutions that significantly influence our core business operations.If you are a critical thinker with a robust background in machine learning workflows, passionate about leveraging data to solve business challenges, and thrive in a dynamic, collaborative setting, we want to hear from you!As a Senior Machine Learning Engineer, you will design and implement algorithms that drive the core services and influential products of our platform. The range of challenges we tackle is remarkably diverse, spanning transportation, economics, forecasting, mapping, safety, personalization, and adaptive control. We are eager to welcome motivated experts in these fields who are excited about developing reliable ML systems and solving problems through data in an innovative and fast-paced environment.
About GridwareGridware is an innovative technology firm based in San Francisco, committed to safeguarding and enhancing the electrical grid. We have pioneered an advanced class of grid management known as Active Grid Response (AGR), which focuses on monitoring the electrical, physical, and environmental aspects of the grid to improve reliability and safety. Our cutting-edge AGR platform utilizes high-precision sensors to identify potential issues early, enabling proactive maintenance and fault mitigation. This all-encompassing strategy enhances safety, minimizes outages, and promotes efficient grid operations. Supported by climate-tech and Silicon Valley investors, we are at the forefront of transforming grid management. For further details, visit www.Gridware.io.Role OverviewIn the role of Senior Machine Learning Infrastructure Engineer, you will collaborate closely with the Automation organization and the core ML, Operations, and Analytics teams to enhance and develop the infrastructure surrounding model deployment and monitoring. This position is crucial for amplifying the time-saving benefits that Gridware provides to its customers.
Full-time|$189.7K/yr - $332K/yr|On-site|San Francisco, CA, US; Palo Alto, CA, US; Seattle, WA, US
About Pinterest:At Pinterest, we inspire millions globally to explore creative ideas and envision lasting memories. Our mission is to empower everyone to create a life they love, and that journey begins with our talented team.Join a workplace where innovation thrives, passion drives growth, and every unique experience is celebrated. With a culture that embraces flexibility, we enable you to do your best work and build a fulfilling career.As a part of Pinterest's vibrant community of over 500 million users and 300 billion ideas saved, our Machine Learning engineers play a pivotal role in crafting personalized experiences that empower Pinners to create their ideal lives. With a nimble team of just over 4,000 employees worldwide, we provide unparalleled access to extensive data and opportunities to contribute to large-scale recommendation systems.Within the Monetization ML Engineering team, you will bridge the gap between the aspirations of Pinners and the offerings of our partners. In this pivotal role, you will lead the vision for advancing the machine learning technology stack in our Ads division.
Full-time|$126K/yr - $196K/yr|Hybrid|San Francisco
About Scribd:At Scribd Inc. (pronounced 'scribbed'), we're on a mission to ignite human curiosity. Join our innovative team as we craft a diverse world of stories and knowledge, democratizing the exchange of ideas and empowering collective intelligence through our four flagship products: Everand, Scribd, Slideshare, and Fable.This job posting is for an exciting, open position within our organization.We foster a culture where authenticity and boldness thrive, facilitating open debates and commitments as we embrace the unexpected. Every team member is empowered to take initiative, prioritizing the needs of our customers.In terms of workplace structure, we prioritize a balance between personal flexibility and communal connections. Our Scribd Flex initiative allows employees, in collaboration with their managers, to determine their daily work styles that best suit their individual needs while promoting intentional in-person interactions to enhance collaboration and company culture. Therefore, occasional in-person attendance is mandatory for all employees, regardless of their location.What do we seek in our new team members? We value 'GRIT'—the intersection of passion and perseverance toward long-term goals. At Scribd Inc., we believe in harnessing the potential that GRIT unlocks and encourage each employee to adopt a GRIT-driven approach to their work. This means we are looking for individuals who can set and achieve Goals, deliver Results in their responsibilities, contribute Innovative ideas, and positively impact the broader Team through collaboration and a positive attitude.About Our Machine Learning Team:Our Machine Learning team is pivotal in developing the platform and product applications that drive personalized discovery, recommendations, and generative AI functionalities across Scribd, Slideshare, and Everand. The ML team operates on the Orion ML Platform, providing essential ML infrastructure such as a feature store, model registry, model inference systems, and embedding-based retrieval (EBR). Our Machine Learning Engineers collaborate closely with the Product team to integrate machine learning into user-facing features, including real-time personalization and AskAI LLM-powered experiences.
Full-time|Remote|San Francisco, CA or remote within the U.S.
At Philo, we are a dedicated team of technology and product enthusiasts committed to reshaping the television landscape. We blend cutting-edge technology with the captivating medium of television to create the ultimate viewing experience. Our mission is to enhance streaming capabilities through innovative cloud delivery and sophisticated machine learning algorithms that personalize content discovery. As a Senior Machine Learning Engineer specializing in Recommendation Systems, you will be at the forefront of our content personalization initiatives, significantly enhancing user engagement and satisfaction. Your expertise will help ensure that every time users open the Philo app, they find something they want to watch. In this pivotal role, you will spearhead the development of advanced algorithms and large-scale systems that drive Philo's recommendation engine. Collaborating closely with data science, product, infrastructure, and backend engineering teams, you will tackle complex machine learning challenges and develop innovative, data-driven solutions that enhance content discovery and foster user retention.
About KreaKrea is at the forefront of developing advanced AI creative tools designed to enhance and empower human creativity. Our mission is to create intuitive and controllable AI solutions that allow creatives to express themselves across various formats including text, images, video, sound, and 3D.About the PositionWe are seeking a talented Machine Learning Engineer to lead the design and implementation of Krea’s personalization and recommendation systems from the ground up. You will take full ownership of how we comprehend user preferences, curate engaging content, and customize generative models to reflect individual aesthetics.This role sits at the exciting intersection of recommendation systems, representation learning, and generative imaging and video technologies.Your ResponsibilitiesLead the architecture and development of Krea’s personalization and recommendation framework, overseeing the technical direction from inception to deployment.Craft algorithms that effectively model user preferences and tastes, enabling our systems to adapt to individual styles and aesthetics.Develop high-quality, curated feeds that strike a balance between exploration, personalization, and aesthetic coherence.Collaborate closely with our model and research teams to co-create personalization mechanisms that shape how our generative models learn, adapt, and express creative styles.Contribute to research in personalized image generation, with a focus on style, taste, and subjective quality.Work in tandem with product, design, and research teams to define what “good personalization” means in a creative context.Take systems from initial research and prototyping stages through to production, ongoing iteration, and enhancement.
Full-time|$155.6K/yr - $320.3K/yr|Remote|San Francisco, CA, US; Remote, US
About tvScientific tvScientific is the premier CTV advertising platform exclusively tailored for performance marketers. Our innovative approach harnesses vast data and state-of-the-art science to automate and enhance TV advertising, ultimately driving impactful business results. Our platform seamlessly integrates media buying, optimization, measurement, and attribution into one powerful, efficient solution. Developed by industry veterans with extensive backgrounds in programmatic advertising, digital media, and ad verification, our CTV performance platform is designed to help advertisers confidently scale their business. We are currently seeking a Senior MLOps Engineer to join our dynamic, distributed engineering team focused on our Connected TV ad-buying platform, as we expand our Machine Learning capabilities. Having successfully optimized TV ad campaigns, we are poised for massive growth, and we need your expertise to ensure our scalability is both sustainable and effective. As a proud member of Idealab, tvScientific was co-founded by leaders deeply rooted in programmatic advertising and digital media. We empower our clients to purchase ads across the expansive CTV landscape, including platforms such as Hulu, PlutoTV, and the ad-supported tiers of Disney+ and HBO Max. Following our acquisition by Pinterest, we are intensifying our focus on CTV to enhance the performance of search and social advertising.
About Our TeamThe Training Runtime team is at the forefront of developing a sophisticated distributed machine-learning training runtime that supports everything from initial research prototypes to cutting-edge model deployments. Our mission is twofold: to enhance the capabilities of researchers and to facilitate large-scale model training. We are creating a cohesive and flexible runtime environment that evolves with researchers as they scale their projects.Our initiatives revolve around three key pillars: optimizing high-performance, asynchronous, zero-copy tensor and optimizer-state-aware data movement; constructing resilient, fault-tolerant training frameworks (including robust training loops, effective state management, resilient checkpointing, and comprehensive observability); and managing distributed processes for long-duration, job-specific uses. By embedding established large-scale functionalities into a user-friendly runtime, we empower teams to iterate rapidly and operate reliably at any scale, working closely with model-stack, research, and platform teams. Our success is measured in terms of both training throughput (the speed at which models are trained) and researcher efficiency (the speed at which concepts transform into experiments and products).About the PositionAs a Machine Learning Framework Engineer on our Training team, you will be pivotal in enhancing the training throughput of our internal framework while empowering researchers to explore innovative ideas. This role demands exceptional engineering skills, including the design, implementation, and optimization of state-of-the-art AI models, as well as writing clean, efficient machine learning code—a task that is often more challenging than it seems. A deep understanding of supercomputer performance metrics will also be critical. Ultimately, every project you undertake will aim to advance the field of machine learning.We seek individuals who are passionate about performance optimization, have a solid grasp of distributed systems, and have an aversion to bugs in their code. Given that our training framework is utilized for extensive runs involving numerous GPUs, any performance enhancements will significantly impact our operations.This position is based in San Francisco, CA, and adheres to a hybrid work model requiring three days in the office each week. We also provide relocation assistance for new hires.Key Responsibilities:Implement advanced techniques within our internal training framework to maximize hardware efficiency during training sessions.Conduct profiling and optimization of our training framework to enhance performance.Collaborate with researchers to facilitate the development of next-generation machine learning models.You Will Excel in This Role If You:Possess a strong passion for optimizing system performance.Have a profound understanding of distributed systems and their complexities.Demonstrate meticulous attention to detail, especially in code quality and debugging.
Full-time|$200K/yr - $240K/yr|On-site|San Francisco, CA
Join Us in Building a Safer World.At TRM Labs, we specialize in blockchain analytics and AI solutions aimed at assisting law enforcement, national security agencies, financial institutions, and cryptocurrency businesses in identifying, investigating, and preventing crypto-related fraud and financial crime. Our innovative platforms leverage blockchain intelligence and AI technology to trace funds, detect illicit activity, and construct comprehensive threat profiles. Trusted by leading organizations worldwide, TRM Labs is committed to enabling a safer and more secure environment for all.Our AI Engineering Team is dedicated to pioneering next-generation AI applications, particularly in the realm of Large Language Models (LLMs) and agentic systems. Our goal is to develop resilient pipelines and high-performance infrastructure that facilitate the swift, safe, and scalable deployment of AI systems.We manage extensive petabyte-scale pipelines, ensuring model serving with millisecond latency while providing the necessary observability and governance to make AI production-ready. Our team actively evaluates and integrates leading-edge tools in the LLM and agent space, including open-source stacks, vector databases, evaluation frameworks, and orchestration tools to accelerate TRM’s innovation pace.As a Senior or Staff ML Systems Engineer – LLM, you will play a pivotal role in constructing and scaling our technical infrastructure for AI/ML systems. Your responsibilities will include:Creating reusable CI/CD workflows for model training, evaluation, and deployment, integrating tools such as Langfuse, GitHub Actions, and experiment tracking.Automating model versioning, approval processes, and compliance checks across various environments.Developing a modular and scalable AI infrastructure stack that encompasses vector databases, feature stores, model registries, and observability tools.Collaborating with engineering and data science teams to embed AI models and agents into real-time applications and workflows.Continuously assessing and incorporating state-of-the-art AI tools (e.g., LangChain, LlamaIndex, vLLM, MLflow, BentoML).Promoting AI reliability and governance while enabling experimentation, ensuring compliance, security, and continuous uptime.Enhancing AI/ML Model Performance and ensuring data accuracy and consistency, leading to improved model training and inference.Implementing infrastructure to facilitate both offline and online evaluation of LLMs and agents.
Mar 12, 2026
Sign in to browse more jobs
Create account — see all 7,055 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.