Staff Software Engineer Distributed Systems jobs in San Francisco – Browse 5,945 openings on RoboApply Jobs

Staff Software Engineer - Distributed Data Systems

DatabricksSan Francisco, California

On-site Full-time $192K/yr - $260K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

We are looking for software engineers who possess: Strong programming skills in languages such as Scala, Java, or Python. Experience with distributed systems and cloud computing. Familiarity with big data technologies and architectures. Problem-solving skills and the ability to work collaboratively in a dynamic environment. Knowledge of machine learning concepts is a plus.

About the job

P-186

At Databricks, we are passionate about empowering data teams to tackle some of the world’s most challenging problems, from security threat detection to cancer drug development. Our mission is to build and operate the leading data and AI infrastructure platform, enabling our customers to concentrate on the high-value challenges that are integral to their own objectives.

Founded in 2013 by the original creators of Apache Spark™, Databricks has rapidly evolved from a small office in Berkeley, California, to a global powerhouse with over 1000 employees. Trusted by thousands of organizations, from startups to Fortune 100 companies, we are recognized as one of the fastest-growing SaaS companies worldwide.

Our engineering teams create highly sophisticated products that address significant needs in the industry. We continuously push the limits of data and AI technology while maintaining the resilience, security, and scalability essential for our customers' success on our platform.

We manage one of the largest-scale software platforms, consisting of millions of virtual machines that generate terabytes of logs and process exabytes of data daily. At this scale, we frequently encounter cloud hardware, network, and operating system faults, and our software must effectively shield our customers from these challenges.

Modern data analysis leverages advanced techniques, such as machine learning, that far exceed the capabilities of traditional SQL query engines. As a Software Engineer on the Runtime team at Databricks, you will be instrumental in developing the next generation of distributed data storage and processing systems that outshine specialized SQL query engines in relational query performance, while providing the flexibility and programming abstractions to support a variety of workloads, from ETL to data science.

Examples of projects you may work on include:

Apache Spark™: Contributing to the de facto open-source framework for big data.
Data Plane Storage: Developing reliable, high-performance services and client libraries for storing and accessing vast amounts of data on cloud storage backends like AWS S3 and Azure Blob Store.
Delta Lake: A storage management system that merges the scalability and cost-effectiveness of data lakes with the performance and reliability of data warehouses, featuring low latency streaming. Its higher-level abstractions and guarantees, including ACID transactions and time travel, significantly reduce the complexity of real-world data engineering architectures.
Delta Pipelines: Aiming to simplify the management of data engineering pipelines.

About Databricks

Databricks is a leading data and AI platform company founded by the original creators of Apache Spark™. With a commitment to innovation and excellence, we help organizations solve complex data challenges.

Similar jobs

1 - 20 of 5,945 Jobs

Select all on this page (20)

Apply

Staff Software Engineer - Distributed Data Systems

Databricks

Full-time|$192K/yr - $260K/yr|On-site|San Francisco, California

P-186 At Databricks, we are passionate about empowering data teams to tackle some of the world’s most challenging problems, from security threat detection to cancer drug development. Our mission is to build and operate the leading data and AI infrastructure platform, enabling our customers to concentrate on the high-value challenges that are integral to their own objectives. Founded in 2013 by the original creators of Apache Spark™, Databricks has rapidly evolved from a small office in Berkeley, California, to a global powerhouse with over 1000 employees. Trusted by thousands of organizations, from startups to Fortune 100 companies, we are recognized as one of the fastest-growing SaaS companies worldwide. Our engineering teams create highly sophisticated products that address significant needs in the industry. We continuously push the limits of data and AI technology while maintaining the resilience, security, and scalability essential for our customers' success on our platform. We manage one of the largest-scale software platforms, consisting of millions of virtual machines that generate terabytes of logs and process exabytes of data daily. At this scale, we frequently encounter cloud hardware, network, and operating system faults, and our software must effectively shield our customers from these challenges. Modern data analysis leverages advanced techniques, such as machine learning, that far exceed the capabilities of traditional SQL query engines. As a Software Engineer on the Runtime team at Databricks, you will be instrumental in developing the next generation of distributed data storage and processing systems that outshine specialized SQL query engines in relational query performance, while providing the flexibility and programming abstractions to support a variety of workloads, from ETL to data science. Examples of projects you may work on include: Apache Spark™: Contributing to the de facto open-source framework for big data. Data Plane Storage: Developing reliable, high-performance services and client libraries for storing and accessing vast amounts of data on cloud storage backends like AWS S3 and Azure Blob Store. Delta Lake: A storage management system that merges the scalability and cost-effectiveness of data lakes with the performance and reliability of data warehouses, featuring low latency streaming. Its higher-level abstractions and guarantees, including ACID transactions and time travel, significantly reduce the complexity of real-world data engineering architectures. Delta Pipelines: Aiming to simplify the management of data engineering pipelines.

Jan 30, 2026

Apply

Staff Software Engineer, Distributed Systems

Ambience Healthcare

Full-time|$250K/yr - $300K/yr|Hybrid|San Francisco

About Us:At Ambience Healthcare, we are not just another scribe; we are pioneering an AI intelligence platform that reintegrates humanity into healthcare, delivering significant ROI for health systems nationwide.Our innovative technology empowers providers to concentrate on delivering exceptional care by alleviating the administrative burdens that distract them from their patients and essential duties. Ambience offers real-time, coding-aware documentation and clinical workflow support across various healthcare settings at the leading health systems in North America.Our teams operate with unwavering dedication and extreme ownership to develop optimal solutions for our healthcare partners. We cherish transparency, positivity, and deep contemplation, holding each other to high standards because we recognize that the challenges we tackle are of utmost importance.Recognized as the leader in enhancing clinician experience by KLAS Research in their Emerging Solutions Top 20 Report, honored by Fast Company as one of the Next Big Things in Tech, acknowledged by Inc. as one of the best AI companies in healthcare, and selected as a LinkedIn Top Startup in 2024 and 2025. We're proudly supported by Oak HC/FT, Andreessen Horowitz (a16z), OpenAI Startup Fund, and Kleiner Perkins — and we're just beginning our journey.The Role:Ambience is responsible for processing millions of patient encounters across the largest health systems in the country. These organizations rely on us for real-time clinical workflows where latency and reliability significantly influence patient care. A delay during a patient visit is not merely a negative metric; it can lead to a physician abandoning the tool.In this position, you will oversee the core systems that enable Ambience to scale with reliability: database architecture, caching, multi-tenancy, and performance optimization that influences the user experience for clinicians. You will design database architectures that accommodate our growth, construct caching systems that prevent EHR API latency from affecting critical processes, and develop multi-tenant infrastructures that protect customer data while enhancing performance.Your ultimate goal will be to create infrastructure that other teams rely on effortlessly.Our engineering roles are hybrid, requiring presence in our San Francisco office three times a week.

Feb 2, 2026

Apply

Software Engineer - Distributed Systems

Achira

Full-time|On-site|San Francisco Office

Why Join Achira?Become part of an exceptional team comprised of scientists, ML researchers, and engineers dedicated to transforming the landscape of drug discovery.Engage with cutting-edge machine learning infrastructure at an unprecedented scale, leveraging extensive computing resources, vast datasets, and ambitious goals.Take ownership of significant projects from conception through to architecture and deployment on large-scale infrastructures.Thrive in a culture that values thoroughness, speed, and a proactive, builder-oriented mindset.About the RoleAt Achira, we are developing state-of-the-art foundation models that address the most complex challenges in simulation for drug discovery and beyond. Our atomistic foundation simulation models (FSMs) serve as comprehensive representations of the physical microcosm, encompassing machine learning interaction potentials (MLIPs), neural network potentials (NNPs), and various generative model classes.We are looking for a Software Engineer who is enthusiastic about distributed computing and its applications in machine learning. You will play a pivotal role in designing and constructing the infrastructure for our ML data generation pipelines, model training, and fine-tuning workflows across large-scale distributed systems.Your expertise will be crucial in ensuring our compute clusters are efficient, observable, cost-effective, and dependable, enabling us to advance the frontiers of ML development. If you are passionate about distributed systems, performance optimization, and cloud cost efficiency, we encourage you to apply.You will be empowered to conceptualize and manage complex workloads across multiple vendors worldwide. Achira's mission revolves around computation, and providing seamless access to our uniquely tailored workloads at the lowest possible cost is critical to our success.

Oct 7, 2025

Apply

Software Engineer - Distributed Systems (Infrastructure)

Cloudflare, Inc.

Full-time|Hybrid|Hybrid

Join Cloudflare as a Software Engineer specializing in Distributed Systems and Infrastructure. In this role, you will be responsible for designing, implementing, and optimizing scalable systems that enhance the performance and reliability of our services. You will collaborate closely with cross-functional teams to develop innovative solutions that support our mission to help build a better Internet.

Mar 4, 2026

Apply

Staff Software Engineer, Stream Compute

Stripe, Inc.

Full-time|On-site|San Francisco, Seattle, New York, Toronto

Join Stripe as a Staff Software Engineer in our Stream Compute team, where you will play a pivotal role in building scalable solutions that power the financial infrastructure of the internet. As a member of our innovative engineering team, you will leverage your expertise to design and implement robust software solutions that enhance the performance and reliability of our streaming data capabilities.

Apr 1, 2026

Apply

Software Engineer specializing in Distributed Systems

Browserbase

Full-time|On-site|San Francisco

At Browserbase, we revolutionize web browsing for AI agents and applications. Our innovative headless browser infrastructure automates interactions with websites, simplifies form filling, and replicates user actions seamlessly.Having successfully raised a $40M Series B last year, we are on an accelerated growth trajectory. Supported by esteemed investors such as Kleiner Perkins, CRV, and Notable Capital, our dynamic team is committed to realizing our CEO's vision for empowering the best AI tools and transforming web automation.Our Core Infrastructure team is essential for maintaining the efficiency of our operations. This group tackles significant distributed systems challenges, ensuring our platform's speed, reliability, and scalability.

Mar 25, 2025

Apply

Software Engineer in GPU Networking & Distributed Systems

Baseten

Full-time|On-site|San Francisco

Join Baseten as a Software Engineer focusing on GPU Networking and Distributed Systems. In this pivotal role, you'll collaborate with talented engineers and researchers to develop cutting-edge solutions that leverage GPU technology for high-performance networking operations. Your contributions will be instrumental in shaping the future of distributed systems, enhancing performance, scalability, and reliability.

Feb 23, 2026

Apply

Software Engineer, Machine Learning Infrastructure & Distributed Systems (Staff & Principal)

Tubi, Inc.

Full-time|$227.2K/yr - $417K/yr|Hybrid|San Francisco, CA; Los Angeles, CA; New York, NY (Hybrid); USA - Remote

About the Role:Join our dynamic ML Infrastructure team as a Software Engineer, where you'll collaborate intimately with the Machine Learning and Product teams to construct top-tier machine learning inference platforms. These cutting-edge platforms drive vital services such as personalized recommendations, search functionalities, and content comprehension at Tubi.Your primary focus will be on the development and maintenance of low-latency ML model serving systems that cater to Deep Learning, LLM, and Search models. This will include the creation of self-service infrastructure and critical components such as the inference engine, feature store, vector store, and experimentation engine.In this role, you'll enhance our service deployment and operational processes, with opportunities to contribute to open-source projects. Enjoy architectural freedom to explore innovative frameworks, spearhead significant cross-functional projects, and elevate the capabilities of our ML and Product teams.We are currently hiring for two positions:Staff Software EngineerPrincipal Software EngineerAdditional Details: As a Principal Engineer, you will serve as a technical leader and visionary, guiding the advancement of our machine learning platform. You'll address complex technical challenges, shape architectural decisions, and mentor senior engineers, fostering a culture of excellence and continuous improvement. Your contributions will impact millions of users.

Mar 23, 2026

Apply

Software Engineer, Distributed Data Systems (Sora)

OpenAI

Full-time|Hybrid|San Francisco

About Our TeamJoin the innovative Sora team at OpenAI, where we are at the forefront of developing multimodal capabilities for our foundation models. As a dynamic hybrid of research and product development, we focus on seamlessly integrating advanced multimodal functionalities into our AI offerings, ensuring they are not only reliable and user-friendly but also aligned with our mission to foster broad societal benefits.About the PositionWe are seeking a dedicated Software Engineer specializing in Distributed Data Systems to architect and enhance the infrastructure that supports large-scale multimodal training and evaluation at OpenAI. In this role, you will oversee distributed data pipelines and collaborate closely with our researchers to translate their requirements into robust, high-performance systems. You will play a crucial role in fortifying the pipelines that underpin Sora’s rapid innovation cycles.We are looking for engineers with a keen eye for detail, substantial experience with distributed systems, and a proven track record of building reliable infrastructures in high-stakes environments.This position is based in San Francisco, CA, and follows a hybrid work model requiring three days in the office each week. We also provide relocation assistance to new team members.Key Responsibilities:Design, build, and maintain data infrastructure systems including distributed computing, data orchestration, distributed storage, streaming infrastructure, and machine learning infrastructure, ensuring they are scalable, reliable, and secure.Ensure our data platform can scale dramatically while maintaining high levels of reliability and efficiency.Collaborate with researchers to deeply understand their needs and translate them into production-ready systems.Harden, optimize, and maintain vital data infrastructure systems that drive multimodal training and evaluation.Ideal Candidates Will Have:Extensive experience with distributed systems and large-scale infrastructure, coupled with a strong passion for data.A detail-oriented mindset and a commitment to building and maintaining dependable systems.Solid software engineering fundamentals and exceptional organizational skills.Comfort with ambiguity and rapid changes in a fast-paced environment.About OpenAIOpenAI is a pioneering AI research and deployment organization dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We strive to advance digital intelligence in a way that is safe and beneficial, pushing the boundaries of innovation and technology.

Nov 14, 2025

Apply

Senior Software Engineer - Distributed Data Systems

Databricks

Full-time|$166K/yr - $225K/yr|On-site|San Francisco, California

At Databricks, we are driven by a passion for empowering data teams to tackle the world’s most challenging problems — from transforming transportation to accelerating medical innovations. We achieve this by creating and maintaining the leading data and AI infrastructure platform, enabling our clients to leverage profound data insights for business enhancement. Founded by engineers with a customer-first mentality, we eagerly embrace every opportunity to tackle complex technical challenges, ranging from the design of next-generation UI/UX for data interactions to scaling our services across millions of virtual machines. Our journey has just begun.As a member of the Runtime team at Databricks, you will be instrumental in developing the next generation of distributed data storage and processing systems. These systems will surpass specialized SQL query engines in relational query performance while offering the programming abstractions necessary to support a variety of workloads, from ETL to data science.Example projects include:Apache Spark™: Contribute to the de facto open-source standard framework for big data.Data Plane Storage: Develop reliable and high-performance services and client libraries for managing vast amounts of data within cloud storage backends like AWS S3 and Azure Blob Store.Delta Lake: Design a storage management system that merges the scalability and cost-effectiveness of data lakes with the performance and reliability of data warehouses, providing features like ACID transactions and time travel.Delta Pipelines: Simplify the orchestration and operation of numerous data pipelines, enabling clients to deploy, test, and upgrade pipelines effortlessly.Performance Engineering: Create the next-generation query optimizer and execution engine that is fast, scalable, and robust.

Jan 30, 2026

Apply

Senior Software Engineer Specializing in Python and Distributed Systems

Scribd, Inc.

Full-time|$120K/yr - $228K/yr|Hybrid|San Francisco

At Scribd, Inc., we are dedicated to enhancing human understanding through our suite of innovative products, including Scribd®, Slideshare®, Everand™, and Fable. Our mission revolves around transforming access into deeper insights and expertise for billions globally.Our CultureWe foster a culture where authenticity and boldness are encouraged; where constructive debates lead to commitment, and where every team member is empowered to prioritize customer needs.We believe that exceptional work emerges from harmonizing individual flexibility with a strong sense of community. Our Scribd Flex program allows employees to select their preferred work style and location, while also emphasizing the importance of intentional in-person interactions to enhance collaboration and culture. All employees are expected to participate in occasional in-person meetings, regardless of their location.We look for team members who embody “GRIT”—the intersection of passion and perseverance towards long-term goals. GRIT serves as a framework for our operations: setting and achieving Goals, delivering impactful Results, contributing Innovative ideas, and building a strong Team through collaboration.Join us at Scribd (pronounced “scribbed”) as we ignite human curiosity and create a world filled with stories and knowledge, democratizing the exchange of ideas and empowering collective expertise.The TeamOur ML Data Engineering team is responsible for powering metadata extraction, enrichment, and content understanding across our platforms.

Nov 17, 2025

Apply

Technical Staff Member - Distributed Systems

Gimlet Labs

Full-time|On-site|San Francisco

At Gimlet Labs, we are pioneering the first heterogeneous neocloud tailored for AI workloads. As AI technology evolves, the industry confronts critical limitations in power, capacity, and cost linked to the traditional homogeneous, vertically integrated infrastructure. Gimlet addresses these challenges by decoupling AI workloads from the fundamental hardware, intelligently partitioning them into components and orchestrating each to the hardware that best meets its performance and efficiency needs. This innovative approach facilitates heterogeneous systems across diverse vendors and generations of hardware, including the latest emerging accelerators, resulting in significant improvements in performance and cost efficiency at scale.Building upon this platform, Gimlet is developing a production-grade neocloud for agentic workloads. Our customers can deploy and manage their workloads through stable, production-ready APIs without the complexities of hardware selection, placement, or low-level performance optimization.Gimlet collaborates with foundational labs, hyperscalers, and AI-native companies to enable real production workloads designed to scale to gigawatt-class AI datacenters.We are currently in search of a Technical Staff Member specializing in distributed systems. In this role, you will be instrumental in developing the core platform responsible for scheduling, routing, and managing AI workloads reliably at production scale. You will engage with systems that coordinate execution across thousands of nodes, provide stable production APIs, and guarantee predictable workload performance under real-world conditions of load and failure.This position is ideal for engineers passionate about building foundational infrastructure, grasping end-to-end systems, and operating at scale.

Mar 10, 2026

Apply

Distributed Systems Engineer - Platform

Inngest

Full-time|Remote|US/Remote

At Inngest, our Systems Engineers are the architects behind the backbone of our platform, crafting a robust execution layer, an efficient queueing system, and scalable state stores that interlink seamlessly. This role presents an exciting opportunity to tackle complex technical challenges while deriving immense satisfaction from your contributions.About Us: Inngest is pioneering innovative solutions to long-standing challenges faced by developers. Our mission is to create first-of-its-kind tools that enhance the daily workflow of developers, prioritizing user experience and performance. A strong product-centric mindset and a passion for developer tools are essential for success in this role.The Role: A successful Systems Engineer at Inngest must possess a blend of generalist and specialist skills. You will collaborate with our team to enhance the functionality of our queueing system (including debounce and concurrency mechanisms), manage vast amounts of data in our state store, and refine the API layers that facilitate user interactions. Your work will have a direct impact on millions of developers, and you will engage closely with designers, engineers, and founders to optimize user experience.Note: This position requires overlapping working hours with US PST. While residing in the San Francisco Bay Area is preferred, exceptional candidates from anywhere in the United States will be considered. Our engineering team operates in-person several days a week in San Francisco.

Apr 15, 2025

Apply

Senior Software Engineer (Python + Distributed systems)

Scribd Inc.

Full-time|$120K/yr - $228K/yr|On-site|San Francisco

About Scribd:At Scribd Inc. (pronounced “scribbed”), we ignite human curiosity by fostering a world rich in stories and knowledge. Join our innovative team as we democratize the flow of ideas and information, empowering collective expertise through our diverse product offerings: Everand, Scribd, Slideshare, and Fable.This job posting represents an open and approved position within our organization.We cultivate a culture where authenticity and boldness thrive; where we engage in vibrant discussions and embrace the unexpected; and where every employee is empowered to take meaningful actions with a firm customer focus.Our work structure emphasizes a balance between individual flexibility and community engagement. Through our Scribd Flex program, employees can collaborate with their managers to determine the most effective work style that meets their personal needs. A core principle of Scribd Flex is prioritizing intentional in-person interactions that foster collaboration, culture, and connection. Therefore, occasional in-person attendance is required for all Scribd employees, irrespective of their location.What do we seek in new team members? We hire for “GRIT.” GRIT embodies the blend of passion and perseverance towards long-term goals. At Scribd Inc., we believe in the possibilities this can unlock and encourage our employees to adopt a GRIT-driven approach to their work. In practical terms, GRIT represents our standards: we look for individuals who can set and accomplish Goals, deliver Results in their roles, contribute Innovative ideas and solutions, and positively impact the Team through collaboration and attitude.

Nov 17, 2025

Apply

Staff Software Engineer with a Focus on Systems Engineering

Crusoe

Full-time|On-site|San Francisco, CA - US

Join our innovative team at Crusoe as a Staff Software Engineer, where you will leverage your expertise in systems engineering to develop cutting-edge software solutions. In this dynamic role, you will collaborate with cross-functional teams to design, implement, and optimize systems that drive our mission forward. Your contributions will be pivotal in enhancing our technology stack and ensuring the seamless operation of our systems.

Apr 10, 2026

Apply

Software Engineer, Database Systems

OpenAI

Full-time|On-site|San Francisco

About Our Team:Join the innovative Database Systems team at OpenAI, where we specialize in high-performance distributed databases. We are the architects behind Rockset, a cutting-edge real-time search, analytics, and vector database that powers all vector search and retrieval augmented generation (RAG) at OpenAI. Rockset underpins core functionalities across all OpenAI product lines and supports various critical internal applications.About the Role:We are in search of engineers who are passionate about distributed systems, performance optimization at a low level (with our core engine developed in C++), and constructing scalable database infrastructures from scratch. As a member of the Database Systems team, you will play a key role in enhancing the core database engine, making significant contributions to ingestion, query execution, indexing, and storage improvements. You will collaborate with multiple teams across OpenAI to unlock new product capabilities and ensure the reliability and scalability of our online database as usage expands exponentially.Your Responsibilities Will Include:Design, develop, and maintain high-performance distributed systems.Identify and address performance bottlenecks to elevate infrastructure capabilities.Define and guide the long-term technical vision and evolution of the system.Collaborate with product, engineering, and research teams to deliver robust and scalable infrastructure.Investigate complex production issues across the entire technology stack.Contribute to incident response, retrospective analyses, and establishing best practices for system reliability.You Will Excel In This Role If You:Possess substantial experience in building, scaling, and optimizing distributed systems.Exhibit a keen interest in database internals, storage engines, or low-latency query systems.Enjoy tackling complex performance challenges in high-throughput systems.Have experience managing and operating production clusters at scale (e.g., Kubernetes or similar orchestration tools).Approach scalability, correctness, and reliability with a rigorous mindset.Thrive in a fast-paced environment where you can make a significant impact.Qualifications:4+ years of relevant industry experience with a focus on distributed systems.Proficiency in C++ or similar low-level programming languages.Strong problem-solving skills and attention to detail.Experience with performance monitoring and optimization tools.Excellent collaboration and communication skills.

Jul 29, 2025

Apply

Distributed Systems Engineer - Data Platform - Analytics and Alerts

Cloudflare, Inc.

Full-time|Hybrid|Hybrid

Join Cloudflare as a Distributed Systems Engineer within our dynamic Data Platform team, focusing on Analytics and Alerts. In this position, you will play a pivotal role in building and optimizing distributed systems that power our data analytics capabilities, providing real-time insights and alerts to enhance our customer experience.

Mar 4, 2026

Apply

Distributed Training Engineer - Member of Technical Staff

Liquid AI

Full-time|On-site|San Francisco

About Liquid AIOriginating from MIT CSAIL, Liquid AI specializes in the development of general-purpose AI systems designed to operate seamlessly across various platforms, including data center accelerators and on-device hardware. Our focus is on delivering low latency, efficient memory usage, privacy, and reliability. We collaborate with organizations in diverse sectors such as consumer electronics, automotive, life sciences, and financial services. As we experience rapid growth, we seek outstanding talent to join our mission.The OpportunityThe Training Infrastructure team is at the forefront of building the distributed systems that empower our next-generation Liquid Foundation Models. As our operations expand, we aim to innovate, implement, and enhance the infrastructure crucial for large-scale training.This role is centered around high ownership of training systems, emphasizing runtime, performance, and reliability rather than a typical platform or SRE function. You will collaborate within a small, agile team, creating vital systems from the ground up instead of working with pre-existing infrastructure.While San Francisco and Boston are preferred, we are open to other locations.What We're Looking ForWe are seeking an individual who:Embraces the complexity of distributed systems: Our team is dedicated to maintaining stability during extensive training runs, troubleshooting training failures across GPU clusters, and enhancing overall performance.Is passionate about building: We value team members who take pride in developing robust, efficient, and reliable infrastructure.Excels in uncertain environments: Our systems are designed to support evolving model architectures. You will be making decisions based on incomplete information and rapidly iterating.Aligns with team goals and delivers results: The best engineers on our team align with collective priorities while providing data-driven feedback when challenges arise.The WorkDesign and develop core systems that ensure quick and reliable large training runs.Create scalable distributed training infrastructure for GPU clusters.Implement and refine parallelism and sharding strategies for evolving architectures.Optimize distributed efficiency through topology-aware collectives, communication/compute overlap, and straggler mitigation.Develop data loading systems to eliminate I/O bottlenecks for multimodal datasets.

Jul 29, 2025

Apply

Distributed Systems Engineer

Sieve

Full-time|On-site|San Francisco

About UsSieve is a pioneering AI research lab dedicated solely to video data. We harness exabyte-scale video infrastructure and innovative video understanding techniques, along with a multitude of data sources, to create datasets that advance the field of video modeling. Given that video constitutes 80% of internet traffic, it serves as a vital medium that fuels creativity, communication, gaming, AR/VR, and robotics. Our mission is to tackle the most significant challenge in the development of these applications: acquiring high-quality training data.With a small yet highly skilled team of just 15 members, we have formed strategic partnerships with leading AI labs and achieved $XXM in revenue last quarter alone. Our Series A funding round last year was backed by prestigious firms, including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleAs a Distributed Systems Engineer at Sieve, you will be responsible for designing and implementing systems that efficiently manage the compute, scheduling, and orchestration of complex machine learning and ETL pipelines. Your work will ensure these systems operate quickly, reliably, and cost-effectively while processing large volumes of video data.You will thrive in this role if you are passionate about optimizing system uptime, have experience with cloud technologies, and enjoy working with high-performance distributed systems involving thousands of GPUs. Additionally, you will play a key role in developing excellent internal tools and CI/CD pipelines to facilitate rapid iteration.

Apr 26, 2025

Apply

Distributed Systems Engineer at archil | San Francisco

Archil

Full-time|On-site|San Francisco

Role OverviewJoin our innovative team as a Distributed Systems Engineer at Archil, where you will play a pivotal role in developing cutting-edge storage solutions. You will work across the entire technology stack, tackling challenges as they arise and significantly shaping our product's technical and strategic direction.Your responsibilities will include:Being on-call for our production systems to assist customers promptly in case of issues.Innovating and implementing unprecedented features in our storage services.Designing interactions within distributed systems to ensure atomicity and idempotency.Deploying and standardizing infrastructure across various cloud environments.Navigating evolving customer requirements amidst ambiguity.

Jun 2, 2025

Create account — see all 5,945 results

1 - 20 of 5,945 Jobs

Select all on this page (20)

Apply

Staff Software Engineer - Distributed Data Systems

Databricks

Full-time|$192K/yr - $260K/yr|On-site|San Francisco, California

Jan 30, 2026

Apply

Staff Software Engineer, Distributed Systems

Ambience Healthcare