Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Entry Level
Qualifications
The ideal candidate should possess a strong foundation in distributed systems, with experience in designing and implementing scalable applications. Proficiency in programming languages such as Go, Python, or Java is essential. A background in data analytics or real-time processing systems will be a significant advantage.
About the job
Join Cloudflare as a Distributed Systems Engineer within our dynamic Data Platform team, focusing on Analytics and Alerts. In this position, you will play a pivotal role in building and optimizing distributed systems that power our data analytics capabilities, providing real-time insights and alerts to enhance our customer experience.
About Cloudflare, Inc.
Cloudflare is a global leader in web performance and security, helping to build a better Internet. We protect and accelerate any Internet application without adding hardware, installing software, or changing a line of code.
Join the Sora TeamAt Sora, we are at the forefront of integrating video capabilities into OpenAI’s foundational models. Our innovative hybrid research and product team is dedicated to expanding the boundaries of video model capabilities while ensuring their reliability and safety. We achieve this through rigorous research, experimentation, and real-world deployment, aiming to disseminate our advancements to a broader audience.Your Role as a Distributed Systems/ML EngineerIn this pivotal role, you will be instrumental in enhancing the training throughput of our internal framework, empowering researchers to experiment with cutting-edge ideas. Your responsibilities will encompass designing, implementing, and optimizing state-of-the-art AI models, ensuring that your machine learning code is bug-free, and leveraging your expertise in supercomputer performance. We seek individuals who are passionate about performance optimization, possess a deep understanding of distributed systems, and have a zero-tolerance policy for bugs in code.This position is based in San Francisco, CA, following a hybrid work model with three days in the office each week. We also provide relocation assistance for new team members.Key Responsibilities:Collaborate closely with researchers to facilitate the development of systems-efficient video models and architectures.Implement the latest techniques within our training framework to achieve exceptional hardware efficiency during training runs.Profile and optimize our training framework to ensure peak performance.You Will Excel in This Role If You:Possess experience with multi-modal machine learning pipelines.Enjoy delving into system implementations and grasping their fundamentals to enhance performance and maintainability.Demonstrate strong software engineering expertise and proficiency in Python.Have experience in understanding and optimizing training kernels.Are eager to explore stable training dynamics.About OpenAIOpenAI is a pioneering AI research and deployment organization committed to ensuring that general-purpose artificial intelligence is beneficial for all of humanity. We continually push the boundaries of what is possible with AI, striving to create a positive impact in various fields.
Sciforium is a pioneering AI infrastructure company dedicated to developing state-of-the-art multimodal AI models and a proprietary, high-efficiency serving platform. With substantial multi-million-dollar funding and direct collaboration from AMD engineers, our team is rapidly expanding to create the complete stack that drives cutting-edge AI models and real-time applications.About the RoleWe are on the lookout for a talented Distributed Training Engineer to develop, optimize, and maintain the essential software stack that supports our extensive AI training operations. In this role, you will engage with the entire machine learning infrastructure, ranging from low-level CUDA/ROCm runtimes to high-level frameworks such as JAX and PyTorch, ensuring that our distributed training systems are swift, scalable, stable, and efficient.This opportunity is perfect for individuals passionate about deep systems engineering, troubleshooting complex hardware-software interactions, and enhancing performance at every level of the machine learning stack. You will significantly contribute to the training and deployment of next-generation LLMs and generative AI models.Key ResponsibilitiesSoftware Stack Maintenance: Manage, update, and enhance critical ML libraries and frameworks, including JAX, PyTorch, CUDA, and ROCm across various environments and hardware configurations.End-to-End Stack Ownership: Construct, sustain, and continually refine the entire ML software stack, from ROCm/CUDA drivers to high-level JAX/PyTorch tooling.Distributed Training Optimization: Ensure optimal sharding, partitioning, and configuration of all model implementations for large-scale distributed training.System Integration: Consistently integrate and validate modules for runtime correctness, memory efficiency, and scalability across multi-node GPU/accelerator clusters.Profiling & Performance Analysis: Perform detailed profiling of compilation graphs, training workloads, and runtime execution to enhance performance and eliminate bottlenecks.Debugging & Reliability: Diagnose intricate hardware-software interaction issues, including vLLM compilation failures on ROCm, CUDA memory leaks, distributed runtime failures, and kernel-level inconsistencies.Collaborate with research, infrastructure, and kernel engineering teams to enhance system throughput, stability, and developer experience.
About Our TeamJoin the innovative Sora team at OpenAI, where we are at the forefront of developing multimodal capabilities for our foundation models. As a dynamic hybrid of research and product development, we focus on seamlessly integrating advanced multimodal functionalities into our AI offerings, ensuring they are not only reliable and user-friendly but also aligned with our mission to foster broad societal benefits.About the PositionWe are seeking a dedicated Software Engineer specializing in Distributed Data Systems to architect and enhance the infrastructure that supports large-scale multimodal training and evaluation at OpenAI. In this role, you will oversee distributed data pipelines and collaborate closely with our researchers to translate their requirements into robust, high-performance systems. You will play a crucial role in fortifying the pipelines that underpin Sora’s rapid innovation cycles.We are looking for engineers with a keen eye for detail, substantial experience with distributed systems, and a proven track record of building reliable infrastructures in high-stakes environments.This position is based in San Francisco, CA, and follows a hybrid work model requiring three days in the office each week. We also provide relocation assistance to new team members.Key Responsibilities:Design, build, and maintain data infrastructure systems including distributed computing, data orchestration, distributed storage, streaming infrastructure, and machine learning infrastructure, ensuring they are scalable, reliable, and secure.Ensure our data platform can scale dramatically while maintaining high levels of reliability and efficiency.Collaborate with researchers to deeply understand their needs and translate them into production-ready systems.Harden, optimize, and maintain vital data infrastructure systems that drive multimodal training and evaluation.Ideal Candidates Will Have:Extensive experience with distributed systems and large-scale infrastructure, coupled with a strong passion for data.A detail-oriented mindset and a commitment to building and maintaining dependable systems.Solid software engineering fundamentals and exceptional organizational skills.Comfort with ambiguity and rapid changes in a fast-paced environment.About OpenAIOpenAI is a pioneering AI research and deployment organization dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We strive to advance digital intelligence in a way that is safe and beneficial, pushing the boundaries of innovation and technology.
About Liquid AIOriginating from MIT CSAIL, Liquid AI specializes in the development of general-purpose AI systems designed to operate seamlessly across various platforms, including data center accelerators and on-device hardware. Our focus is on delivering low latency, efficient memory usage, privacy, and reliability. We collaborate with organizations in diverse sectors such as consumer electronics, automotive, life sciences, and financial services. As we experience rapid growth, we seek outstanding talent to join our mission.The OpportunityThe Training Infrastructure team is at the forefront of building the distributed systems that empower our next-generation Liquid Foundation Models. As our operations expand, we aim to innovate, implement, and enhance the infrastructure crucial for large-scale training.This role is centered around high ownership of training systems, emphasizing runtime, performance, and reliability rather than a typical platform or SRE function. You will collaborate within a small, agile team, creating vital systems from the ground up instead of working with pre-existing infrastructure.While San Francisco and Boston are preferred, we are open to other locations.What We're Looking ForWe are seeking an individual who:Embraces the complexity of distributed systems: Our team is dedicated to maintaining stability during extensive training runs, troubleshooting training failures across GPU clusters, and enhancing overall performance.Is passionate about building: We value team members who take pride in developing robust, efficient, and reliable infrastructure.Excels in uncertain environments: Our systems are designed to support evolving model architectures. You will be making decisions based on incomplete information and rapidly iterating.Aligns with team goals and delivers results: The best engineers on our team align with collective priorities while providing data-driven feedback when challenges arise.The WorkDesign and develop core systems that ensure quick and reliable large training runs.Create scalable distributed training infrastructure for GPU clusters.Implement and refine parallelism and sharding strategies for evolving architectures.Optimize distributed efficiency through topology-aware collectives, communication/compute overlap, and straggler mitigation.Develop data loading systems to eliminate I/O bottlenecks for multimodal datasets.
About Our TeamThe Training Runtime team is at the forefront of developing a sophisticated distributed machine-learning training runtime that supports everything from initial research prototypes to cutting-edge model deployments. Our mission is twofold: to enhance the capabilities of researchers and to facilitate large-scale model training. We are creating a cohesive and flexible runtime environment that evolves with researchers as they scale their projects.Our initiatives revolve around three key pillars: optimizing high-performance, asynchronous, zero-copy tensor and optimizer-state-aware data movement; constructing resilient, fault-tolerant training frameworks (including robust training loops, effective state management, resilient checkpointing, and comprehensive observability); and managing distributed processes for long-duration, job-specific uses. By embedding established large-scale functionalities into a user-friendly runtime, we empower teams to iterate rapidly and operate reliably at any scale, working closely with model-stack, research, and platform teams. Our success is measured in terms of both training throughput (the speed at which models are trained) and researcher efficiency (the speed at which concepts transform into experiments and products).About the PositionAs a Machine Learning Framework Engineer on our Training team, you will be pivotal in enhancing the training throughput of our internal framework while empowering researchers to explore innovative ideas. This role demands exceptional engineering skills, including the design, implementation, and optimization of state-of-the-art AI models, as well as writing clean, efficient machine learning code—a task that is often more challenging than it seems. A deep understanding of supercomputer performance metrics will also be critical. Ultimately, every project you undertake will aim to advance the field of machine learning.We seek individuals who are passionate about performance optimization, have a solid grasp of distributed systems, and have an aversion to bugs in their code. Given that our training framework is utilized for extensive runs involving numerous GPUs, any performance enhancements will significantly impact our operations.This position is based in San Francisco, CA, and adheres to a hybrid work model requiring three days in the office each week. We also provide relocation assistance for new hires.Key Responsibilities:Implement advanced techniques within our internal training framework to maximize hardware efficiency during training sessions.Conduct profiling and optimization of our training framework to enhance performance.Collaborate with researchers to facilitate the development of next-generation machine learning models.You Will Excel in This Role If You:Possess a strong passion for optimizing system performance.Have a profound understanding of distributed systems and their complexities.Demonstrate meticulous attention to detail, especially in code quality and debugging.
OverviewPluralis Research is at the forefront of innovation in Protocol Learning, specializing in the collaborative training of foundational models. Our approach ensures that no single participant ever has or can obtain a complete version of the model. This initiative aims to create community-driven, collectively owned frontier models that operate on self-sustaining economic principles.We are seeking experienced Senior or Staff Machine Learning Engineers with over 5 years of expertise in distributed systems and large-scale machine learning training. In this role, you will design and implement a groundbreaking substrate for training distributed ML models that function effectively over consumer-grade internet connections.
At Inngest, our Systems Engineers are the architects behind the backbone of our platform, crafting a robust execution layer, an efficient queueing system, and scalable state stores that interlink seamlessly. This role presents an exciting opportunity to tackle complex technical challenges while deriving immense satisfaction from your contributions.About Us: Inngest is pioneering innovative solutions to long-standing challenges faced by developers. Our mission is to create first-of-its-kind tools that enhance the daily workflow of developers, prioritizing user experience and performance. A strong product-centric mindset and a passion for developer tools are essential for success in this role.The Role: A successful Systems Engineer at Inngest must possess a blend of generalist and specialist skills. You will collaborate with our team to enhance the functionality of our queueing system (including debounce and concurrency mechanisms), manage vast amounts of data in our state store, and refine the API layers that facilitate user interactions. Your work will have a direct impact on millions of developers, and you will engage closely with designers, engineers, and founders to optimize user experience.Note: This position requires overlapping working hours with US PST. While residing in the San Francisco Bay Area is preferred, exceptional candidates from anywhere in the United States will be considered. Our engineering team operates in-person several days a week in San Francisco.
About UsSieve is a pioneering AI research lab dedicated solely to video data. We harness exabyte-scale video infrastructure and innovative video understanding techniques, along with a multitude of data sources, to create datasets that advance the field of video modeling. Given that video constitutes 80% of internet traffic, it serves as a vital medium that fuels creativity, communication, gaming, AR/VR, and robotics. Our mission is to tackle the most significant challenge in the development of these applications: acquiring high-quality training data.With a small yet highly skilled team of just 15 members, we have formed strategic partnerships with leading AI labs and achieved $XXM in revenue last quarter alone. Our Series A funding round last year was backed by prestigious firms, including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleAs a Distributed Systems Engineer at Sieve, you will be responsible for designing and implementing systems that efficiently manage the compute, scheduling, and orchestration of complex machine learning and ETL pipelines. Your work will ensure these systems operate quickly, reliably, and cost-effectively while processing large volumes of video data.You will thrive in this role if you are passionate about optimizing system uptime, have experience with cloud technologies, and enjoy working with high-performance distributed systems involving thousands of GPUs. Additionally, you will play a key role in developing excellent internal tools and CI/CD pipelines to facilitate rapid iteration.
Why Join Achira?Become part of an exceptional team comprised of scientists, ML researchers, and engineers dedicated to transforming the landscape of drug discovery.Engage with cutting-edge machine learning infrastructure at an unprecedented scale, leveraging extensive computing resources, vast datasets, and ambitious goals.Take ownership of significant projects from conception through to architecture and deployment on large-scale infrastructures.Thrive in a culture that values thoroughness, speed, and a proactive, builder-oriented mindset.About the RoleAt Achira, we are developing state-of-the-art foundation models that address the most complex challenges in simulation for drug discovery and beyond. Our atomistic foundation simulation models (FSMs) serve as comprehensive representations of the physical microcosm, encompassing machine learning interaction potentials (MLIPs), neural network potentials (NNPs), and various generative model classes.We are looking for a Software Engineer who is enthusiastic about distributed computing and its applications in machine learning. You will play a pivotal role in designing and constructing the infrastructure for our ML data generation pipelines, model training, and fine-tuning workflows across large-scale distributed systems.Your expertise will be crucial in ensuring our compute clusters are efficient, observable, cost-effective, and dependable, enabling us to advance the frontiers of ML development. If you are passionate about distributed systems, performance optimization, and cloud cost efficiency, we encourage you to apply.You will be empowered to conceptualize and manage complex workloads across multiple vendors worldwide. Achira's mission revolves around computation, and providing seamless access to our uniquely tailored workloads at the lowest possible cost is critical to our success.
Role OverviewJoin Archil as a Senior Distributed Systems Engineer, where you will play a critical role in developing our innovative storage solutions. You'll engage with technologies across the entire stack to tackle challenges and contribute to building Archil volumes, significantly influencing both technical design and product strategy.Key ResponsibilitiesProvide on-call support for our production systems, ensuring customer satisfaction in case of issues.Innovate and implement unprecedented capabilities within our storage services.Design interactions in distributed systems focusing on atomicity and idempotency.Deploy and generalize infrastructure across multiple cloud environments.Adapt to evolving customer needs amidst ambiguity.Lead engineering teams through complex decisions and provide insightful PR feedback.
Join Krea's Innovative TeamAt Krea, we are at the forefront of developing next-generation AI creative tools. Our commitment lies in making AI an intuitive and controllable medium for creatives. We aspire to create tools that enhance human creativity rather than replace it.We view AI as a transformative medium that enables expressions across diverse formats—text, images, video, sound, and even 3D. Our focus is on creating smarter, more adaptable tools that leverage this medium effectively.The Role of Supercomputing and AI Infrastructure at KreaOur team is responsible for building and managing the foundational infrastructure that supports Krea's research and inference processes. This includes distributed training systems, over 1000 Kubernetes GPU clusters, and extensive petabyte-scale data pipelines. Much of our work involves creating bespoke solutions, such as custom distributed datastores, job orchestration systems, and advanced streaming pipelines, which are designed to handle modern AI workloads efficiently.Key Projects You Will Contribute To:Distributed Data Systems: Design and implement multi-stage pipelines to transform petabytes of raw data into clean, annotated datasets; run classification models across billions of images; deploy and integrate large language models to caption extensive multimedia data.GPU Infrastructure: Manage distributed training and inference across 1000+ GPU Kubernetes clusters; address orchestration and scaling challenges for large-scale GPU job processing; optimize research workflows across multiple datacenters.Distributed Training: Profile and enhance dataloaders streaming thousands of images per second; troubleshoot InfiniBand networking during extensive training runs; develop fault tolerance systems for large-scale pretraining; collaborate with researchers to refine reinforcement learning infrastructure.Applied ML Pipelines: Identify clean scenes in millions of videos utilizing distributed shot-boundary detection; tailor and train models to sift through billions of images for specific queries; construct systems that link raw cluster capacity with research outcomes.
Join Cloudflare as a Software Engineer specializing in Distributed Systems and Infrastructure. In this role, you will be responsible for designing, implementing, and optimizing scalable systems that enhance the performance and reliability of our services. You will collaborate closely with cross-functional teams to develop innovative solutions that support our mission to help build a better Internet.
Full-time|$250K/yr - $300K/yr|Hybrid|San Francisco
About Us:At Ambience Healthcare, we are not just another scribe; we are pioneering an AI intelligence platform that reintegrates humanity into healthcare, delivering significant ROI for health systems nationwide.Our innovative technology empowers providers to concentrate on delivering exceptional care by alleviating the administrative burdens that distract them from their patients and essential duties. Ambience offers real-time, coding-aware documentation and clinical workflow support across various healthcare settings at the leading health systems in North America.Our teams operate with unwavering dedication and extreme ownership to develop optimal solutions for our healthcare partners. We cherish transparency, positivity, and deep contemplation, holding each other to high standards because we recognize that the challenges we tackle are of utmost importance.Recognized as the leader in enhancing clinician experience by KLAS Research in their Emerging Solutions Top 20 Report, honored by Fast Company as one of the Next Big Things in Tech, acknowledged by Inc. as one of the best AI companies in healthcare, and selected as a LinkedIn Top Startup in 2024 and 2025. We're proudly supported by Oak HC/FT, Andreessen Horowitz (a16z), OpenAI Startup Fund, and Kleiner Perkins — and we're just beginning our journey.The Role:Ambience is responsible for processing millions of patient encounters across the largest health systems in the country. These organizations rely on us for real-time clinical workflows where latency and reliability significantly influence patient care. A delay during a patient visit is not merely a negative metric; it can lead to a physician abandoning the tool.In this position, you will oversee the core systems that enable Ambience to scale with reliability: database architecture, caching, multi-tenancy, and performance optimization that influences the user experience for clinicians. You will design database architectures that accommodate our growth, construct caching systems that prevent EHR API latency from affecting critical processes, and develop multi-tenant infrastructures that protect customer data while enhancing performance.Your ultimate goal will be to create infrastructure that other teams rely on effortlessly.Our engineering roles are hybrid, requiring presence in our San Francisco office three times a week.
Join Cloudflare as a Distributed Systems Engineer within our dynamic Data Platform team, focusing on Analytics and Alerts. In this position, you will play a pivotal role in building and optimizing distributed systems that power our data analytics capabilities, providing real-time insights and alerts to enhance our customer experience.
Join Cloudflare as a Distributed Systems Engineer focusing on our Data Platform, where you will play a pivotal role in developing analytics and alert systems that enhance our services. You will collaborate with a talented team to design scalable and efficient systems to manage and analyze vast amounts of data. Your work will directly impact the performance and reliability of our offerings, ensuring our customers have the best possible experience.
At Browserbase, we revolutionize web browsing for AI agents and applications. Our innovative headless browser infrastructure automates interactions with websites, simplifies form filling, and replicates user actions seamlessly.Having successfully raised a $40M Series B last year, we are on an accelerated growth trajectory. Supported by esteemed investors such as Kleiner Perkins, CRV, and Notable Capital, our dynamic team is committed to realizing our CEO's vision for empowering the best AI tools and transforming web automation.Our Core Infrastructure team is essential for maintaining the efficiency of our operations. This group tackles significant distributed systems challenges, ensuring our platform's speed, reliability, and scalability.
Role OverviewJoin our innovative team as a Distributed Systems Engineer at Archil, where you will play a pivotal role in developing cutting-edge storage solutions. You will work across the entire technology stack, tackling challenges as they arise and significantly shaping our product's technical and strategic direction.Your responsibilities will include:Being on-call for our production systems to assist customers promptly in case of issues.Innovating and implementing unprecedented features in our storage services.Designing interactions within distributed systems to ensure atomicity and idempotency.Deploying and standardizing infrastructure across various cloud environments.Navigating evolving customer requirements amidst ambiguity.
Join Baseten as a Software Engineer focusing on GPU Networking and Distributed Systems. In this pivotal role, you'll collaborate with talented engineers and researchers to develop cutting-edge solutions that leverage GPU technology for high-performance networking operations. Your contributions will be instrumental in shaping the future of distributed systems, enhancing performance, scalability, and reliability.
Full-time|$166K/yr - $225K/yr|On-site|San Francisco, California
At Databricks, we are driven by a passion for empowering data teams to tackle the world’s most challenging problems — from transforming transportation to accelerating medical innovations. We achieve this by creating and maintaining the leading data and AI infrastructure platform, enabling our clients to leverage profound data insights for business enhancement. Founded by engineers with a customer-first mentality, we eagerly embrace every opportunity to tackle complex technical challenges, ranging from the design of next-generation UI/UX for data interactions to scaling our services across millions of virtual machines. Our journey has just begun.As a member of the Runtime team at Databricks, you will be instrumental in developing the next generation of distributed data storage and processing systems. These systems will surpass specialized SQL query engines in relational query performance while offering the programming abstractions necessary to support a variety of workloads, from ETL to data science.Example projects include:Apache Spark™: Contribute to the de facto open-source standard framework for big data.Data Plane Storage: Develop reliable and high-performance services and client libraries for managing vast amounts of data within cloud storage backends like AWS S3 and Azure Blob Store.Delta Lake: Design a storage management system that merges the scalability and cost-effectiveness of data lakes with the performance and reliability of data warehouses, providing features like ACID transactions and time travel.Delta Pipelines: Simplify the orchestration and operation of numerous data pipelines, enabling clients to deploy, test, and upgrade pipelines effortlessly.Performance Engineering: Create the next-generation query optimizer and execution engine that is fast, scalable, and robust.
Full-time|$120K/yr - $228K/yr|Hybrid|San Francisco
At Scribd, Inc., we are dedicated to enhancing human understanding through our suite of innovative products, including Scribd®, Slideshare®, Everand™, and Fable. Our mission revolves around transforming access into deeper insights and expertise for billions globally.Our CultureWe foster a culture where authenticity and boldness are encouraged; where constructive debates lead to commitment, and where every team member is empowered to prioritize customer needs.We believe that exceptional work emerges from harmonizing individual flexibility with a strong sense of community. Our Scribd Flex program allows employees to select their preferred work style and location, while also emphasizing the importance of intentional in-person interactions to enhance collaboration and culture. All employees are expected to participate in occasional in-person meetings, regardless of their location.We look for team members who embody “GRIT”—the intersection of passion and perseverance towards long-term goals. GRIT serves as a framework for our operations: setting and achieving Goals, delivering impactful Results, contributing Innovative ideas, and building a strong Team through collaboration.Join us at Scribd (pronounced “scribbed”) as we ignite human curiosity and create a world filled with stories and knowledge, democratizing the exchange of ideas and empowering collective expertise.The TeamOur ML Data Engineering team is responsible for powering metadata extraction, enrichment, and content understanding across our platforms.
Nov 17, 2025
Sign in to browse more jobs
Create account — see all 5,237 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.