Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Entry Level
Qualifications
We are looking for candidates with a strong background in software engineering, particularly in distributed systems. Ideal candidates should possess:Experience with programming languages such as Go, Java, or PythonUnderstanding of distributed systems concepts and architectureProficiency in cloud technologies and infrastructuresStrong problem-solving skills and the ability to work in a fast-paced environmentBachelor's degree in Computer Science or a related field
About the job
Join Cloudflare as a Software Engineer specializing in Distributed Systems and Infrastructure. In this role, you will be responsible for designing, implementing, and optimizing scalable systems that enhance the performance and reliability of our services. You will collaborate closely with cross-functional teams to develop innovative solutions that support our mission to help build a better Internet.
About Cloudflare, Inc.
Cloudflare is a leader in web performance and security, providing a suite of tools and services to protect and accelerate web properties. Our mission is to help build a better Internet, and we are committed to innovation and excellence in everything we do.
Similar jobs
1 - 20 of 5,368 Jobs
Search for Engineer Supercomputing Distributed Systems
Join Krea's Innovative TeamAt Krea, we are at the forefront of developing next-generation AI creative tools. Our commitment lies in making AI an intuitive and controllable medium for creatives. We aspire to create tools that enhance human creativity rather than replace it.We view AI as a transformative medium that enables expressions across diverse formats—text, images, video, sound, and even 3D. Our focus is on creating smarter, more adaptable tools that leverage this medium effectively.The Role of Supercomputing and AI Infrastructure at KreaOur team is responsible for building and managing the foundational infrastructure that supports Krea's research and inference processes. This includes distributed training systems, over 1000 Kubernetes GPU clusters, and extensive petabyte-scale data pipelines. Much of our work involves creating bespoke solutions, such as custom distributed datastores, job orchestration systems, and advanced streaming pipelines, which are designed to handle modern AI workloads efficiently.Key Projects You Will Contribute To:Distributed Data Systems: Design and implement multi-stage pipelines to transform petabytes of raw data into clean, annotated datasets; run classification models across billions of images; deploy and integrate large language models to caption extensive multimedia data.GPU Infrastructure: Manage distributed training and inference across 1000+ GPU Kubernetes clusters; address orchestration and scaling challenges for large-scale GPU job processing; optimize research workflows across multiple datacenters.Distributed Training: Profile and enhance dataloaders streaming thousands of images per second; troubleshoot InfiniBand networking during extensive training runs; develop fault tolerance systems for large-scale pretraining; collaborate with researchers to refine reinforcement learning infrastructure.Applied ML Pipelines: Identify clean scenes in millions of videos utilizing distributed shot-boundary detection; tailor and train models to sift through billions of images for specific queries; construct systems that link raw cluster capacity with research outcomes.
At Inngest, our Systems Engineers are the architects behind the backbone of our platform, crafting a robust execution layer, an efficient queueing system, and scalable state stores that interlink seamlessly. This role presents an exciting opportunity to tackle complex technical challenges while deriving immense satisfaction from your contributions.About Us: Inngest is pioneering innovative solutions to long-standing challenges faced by developers. Our mission is to create first-of-its-kind tools that enhance the daily workflow of developers, prioritizing user experience and performance. A strong product-centric mindset and a passion for developer tools are essential for success in this role.The Role: A successful Systems Engineer at Inngest must possess a blend of generalist and specialist skills. You will collaborate with our team to enhance the functionality of our queueing system (including debounce and concurrency mechanisms), manage vast amounts of data in our state store, and refine the API layers that facilitate user interactions. Your work will have a direct impact on millions of developers, and you will engage closely with designers, engineers, and founders to optimize user experience.Note: This position requires overlapping working hours with US PST. While residing in the San Francisco Bay Area is preferred, exceptional candidates from anywhere in the United States will be considered. Our engineering team operates in-person several days a week in San Francisco.
About UsSieve is a pioneering AI research lab dedicated solely to video data. We harness exabyte-scale video infrastructure and innovative video understanding techniques, along with a multitude of data sources, to create datasets that advance the field of video modeling. Given that video constitutes 80% of internet traffic, it serves as a vital medium that fuels creativity, communication, gaming, AR/VR, and robotics. Our mission is to tackle the most significant challenge in the development of these applications: acquiring high-quality training data.With a small yet highly skilled team of just 15 members, we have formed strategic partnerships with leading AI labs and achieved $XXM in revenue last quarter alone. Our Series A funding round last year was backed by prestigious firms, including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleAs a Distributed Systems Engineer at Sieve, you will be responsible for designing and implementing systems that efficiently manage the compute, scheduling, and orchestration of complex machine learning and ETL pipelines. Your work will ensure these systems operate quickly, reliably, and cost-effectively while processing large volumes of video data.You will thrive in this role if you are passionate about optimizing system uptime, have experience with cloud technologies, and enjoy working with high-performance distributed systems involving thousands of GPUs. Additionally, you will play a key role in developing excellent internal tools and CI/CD pipelines to facilitate rapid iteration.
Why Join Achira?Become part of an exceptional team comprised of scientists, ML researchers, and engineers dedicated to transforming the landscape of drug discovery.Engage with cutting-edge machine learning infrastructure at an unprecedented scale, leveraging extensive computing resources, vast datasets, and ambitious goals.Take ownership of significant projects from conception through to architecture and deployment on large-scale infrastructures.Thrive in a culture that values thoroughness, speed, and a proactive, builder-oriented mindset.About the RoleAt Achira, we are developing state-of-the-art foundation models that address the most complex challenges in simulation for drug discovery and beyond. Our atomistic foundation simulation models (FSMs) serve as comprehensive representations of the physical microcosm, encompassing machine learning interaction potentials (MLIPs), neural network potentials (NNPs), and various generative model classes.We are looking for a Software Engineer who is enthusiastic about distributed computing and its applications in machine learning. You will play a pivotal role in designing and constructing the infrastructure for our ML data generation pipelines, model training, and fine-tuning workflows across large-scale distributed systems.Your expertise will be crucial in ensuring our compute clusters are efficient, observable, cost-effective, and dependable, enabling us to advance the frontiers of ML development. If you are passionate about distributed systems, performance optimization, and cloud cost efficiency, we encourage you to apply.You will be empowered to conceptualize and manage complex workloads across multiple vendors worldwide. Achira's mission revolves around computation, and providing seamless access to our uniquely tailored workloads at the lowest possible cost is critical to our success.
Role OverviewJoin Archil as a Senior Distributed Systems Engineer, where you will play a critical role in developing our innovative storage solutions. You'll engage with technologies across the entire stack to tackle challenges and contribute to building Archil volumes, significantly influencing both technical design and product strategy.Key ResponsibilitiesProvide on-call support for our production systems, ensuring customer satisfaction in case of issues.Innovate and implement unprecedented capabilities within our storage services.Design interactions in distributed systems focusing on atomicity and idempotency.Deploy and generalize infrastructure across multiple cloud environments.Adapt to evolving customer needs amidst ambiguity.Lead engineering teams through complex decisions and provide insightful PR feedback.
At Physical Intelligence, we are pioneering general-purpose AI applications for the physical world. Our innovative approach involves orchestrating thousands of accelerators across a diverse ecosystem of GPU and TPU clusters, which encompass various hardware generations, cloud platforms, and cluster configurations.Researchers frequently encounter challenges in identifying the optimal cluster for their tasks, understanding resource availability, and configuring their workloads efficiently. This process is not scalable. To enhance productivity, we require an intelligent scheduling and compute system that can automatically determine the best job placements based on availability, hardware compatibility, cost considerations, and priority levels, allowing researchers to concentrate on their scientific endeavors.This position encompasses the complete ownership of this challenge: the development of scheduling systems, placement logic, cluster management frameworks, and operational tools essential for seamless operations.This role is distinct from traditional cloud DevOps; it focuses on resource allocation intelligence, utilization efficiency, fault tolerance, and ensuring a smooth experience for large-scale distributed training.About the TeamThe ML Infrastructure team is dedicated to bolstering and accelerating Physical Intelligence’s fundamental modeling initiatives by creating systems that ensure large-scale training is reliable, reproducible, and efficient. You will collaborate closely with the ML Infrastructure, data platform, and research teams to eliminate compute scheduling as a bottleneck.Key Responsibilities- Lead Intelligent Job Scheduling and Placement: Design and implement multi-tenant scheduling systems that automatically allocate training jobs to the most suitable cluster based on hardware specifications, topology, availability, cost, and priority. Facilitate equitable resource sharing across teams and projects through quota management, priority tiers, and preemption policies. Simplify cluster discrepancies so researchers can submit jobs without needing detailed knowledge of cluster specifics.- Enhance Multi-cluster Orchestration: Develop the control plane responsible for overseeing the job lifecycle across various clusters (including mixed GPU/TPU setups, multi-generational hardware, both on-premises and cloud-based) and enable effortless job migration, failover, and rescheduling.- Optimize Accelerator Utilization and Performance: Continuously monitor and enhance GPU/TPU usage across the entire fleet. Apply priority, preemption, queuing, and fairness strategies that balance research momentum with cost efficiency.- Guarantee Scalability and Stability: Implement fault detection, automatic recovery mechanisms, and resilience strategies for long-running multi-node training tasks. Oversee health checks, node management, and scaling strategies to ensure optimal performance.
Join Cloudflare as a Software Engineer specializing in Distributed Systems and Infrastructure. In this role, you will be responsible for designing, implementing, and optimizing scalable systems that enhance the performance and reliability of our services. You will collaborate closely with cross-functional teams to develop innovative solutions that support our mission to help build a better Internet.
Full-time|$350K/yr - $475K/yr|On-site|San Francisco
At Thinking Machines Lab, our vision is to enhance human potential by advancing collaborative general intelligence. We are dedicated to creating an inclusive future where everyone can harness AI's capabilities tailored to their unique aspirations.Our team comprises scientists, engineers, and innovators behind some of the most impactful AI solutions, including ChatGPT and Character.ai, as well as open-source projects like PyTorch and Segment Anything.About the RoleWe are seeking a talented Software Engineer to architect, develop, and maintain the GPU supercomputing infrastructure essential for large-scale AI training and inference. Your contributions will ensure high-performance, reliable, and cost-effective computing resources, enabling our users and researchers to achieve rapid advancements at scale.This is an "evergreen role," open for ongoing interest. We receive numerous applications, and while an immediate fit may not always be available, we encourage you to apply. We actively review applications and reach out when new opportunities arise. Reapplications are welcome after six months, and we also post specific roles for unique projects or teams.What You’ll DoAutomate and manage large GPU clusters, handling provisioning, imaging, and capacity strategy.Develop software that simplifies cluster management, providing a cohesive interface for training and inference tasks.Enhance scheduling and orchestration frameworks (Kubernetes, Slurm, or similar) for optimized resource allocation, preemption, and multi-tenancy management.Monitor and improve operational efficiency, focusing on speed, reliability, and error recovery mechanisms.Design robust storage solutions for datasets, checkpoints, and logs, ensuring clear data retention and lineage.Collaborate with researchers to facilitate large-scale experiments, offering guidance on parallelism and performance considerations.
Full-time|$250K/yr - $300K/yr|Hybrid|San Francisco
About Us:At Ambience Healthcare, we are not just another scribe; we are pioneering an AI intelligence platform that reintegrates humanity into healthcare, delivering significant ROI for health systems nationwide.Our innovative technology empowers providers to concentrate on delivering exceptional care by alleviating the administrative burdens that distract them from their patients and essential duties. Ambience offers real-time, coding-aware documentation and clinical workflow support across various healthcare settings at the leading health systems in North America.Our teams operate with unwavering dedication and extreme ownership to develop optimal solutions for our healthcare partners. We cherish transparency, positivity, and deep contemplation, holding each other to high standards because we recognize that the challenges we tackle are of utmost importance.Recognized as the leader in enhancing clinician experience by KLAS Research in their Emerging Solutions Top 20 Report, honored by Fast Company as one of the Next Big Things in Tech, acknowledged by Inc. as one of the best AI companies in healthcare, and selected as a LinkedIn Top Startup in 2024 and 2025. We're proudly supported by Oak HC/FT, Andreessen Horowitz (a16z), OpenAI Startup Fund, and Kleiner Perkins — and we're just beginning our journey.The Role:Ambience is responsible for processing millions of patient encounters across the largest health systems in the country. These organizations rely on us for real-time clinical workflows where latency and reliability significantly influence patient care. A delay during a patient visit is not merely a negative metric; it can lead to a physician abandoning the tool.In this position, you will oversee the core systems that enable Ambience to scale with reliability: database architecture, caching, multi-tenancy, and performance optimization that influences the user experience for clinicians. You will design database architectures that accommodate our growth, construct caching systems that prevent EHR API latency from affecting critical processes, and develop multi-tenant infrastructures that protect customer data while enhancing performance.Your ultimate goal will be to create infrastructure that other teams rely on effortlessly.Our engineering roles are hybrid, requiring presence in our San Francisco office three times a week.
At Browserbase, we revolutionize web browsing for AI agents and applications. Our innovative headless browser infrastructure automates interactions with websites, simplifies form filling, and replicates user actions seamlessly.Having successfully raised a $40M Series B last year, we are on an accelerated growth trajectory. Supported by esteemed investors such as Kleiner Perkins, CRV, and Notable Capital, our dynamic team is committed to realizing our CEO's vision for empowering the best AI tools and transforming web automation.Our Core Infrastructure team is essential for maintaining the efficiency of our operations. This group tackles significant distributed systems challenges, ensuring our platform's speed, reliability, and scalability.
Join Cloudflare as a Distributed Systems Engineer within our dynamic Data Platform team, focusing on Analytics and Alerts. In this position, you will play a pivotal role in building and optimizing distributed systems that power our data analytics capabilities, providing real-time insights and alerts to enhance our customer experience.
Join Cloudflare as a Distributed Systems Engineer focusing on our Data Platform, where you will play a pivotal role in developing analytics and alert systems that enhance our services. You will collaborate with a talented team to design scalable and efficient systems to manage and analyze vast amounts of data. Your work will directly impact the performance and reliability of our offerings, ensuring our customers have the best possible experience.
Role OverviewJoin our innovative team as a Distributed Systems Engineer at Archil, where you will play a pivotal role in developing cutting-edge storage solutions. You will work across the entire technology stack, tackling challenges as they arise and significantly shaping our product's technical and strategic direction.Your responsibilities will include:Being on-call for our production systems to assist customers promptly in case of issues.Innovating and implementing unprecedented features in our storage services.Designing interactions within distributed systems to ensure atomicity and idempotency.Deploying and standardizing infrastructure across various cloud environments.Navigating evolving customer requirements amidst ambiguity.
Join Baseten as a Software Engineer focusing on GPU Networking and Distributed Systems. In this pivotal role, you'll collaborate with talented engineers and researchers to develop cutting-edge solutions that leverage GPU technology for high-performance networking operations. Your contributions will be instrumental in shaping the future of distributed systems, enhancing performance, scalability, and reliability.
About Our TeamJoin the innovative Sora team at OpenAI, where we are at the forefront of developing multimodal capabilities for our foundation models. As a dynamic hybrid of research and product development, we focus on seamlessly integrating advanced multimodal functionalities into our AI offerings, ensuring they are not only reliable and user-friendly but also aligned with our mission to foster broad societal benefits.About the PositionWe are seeking a dedicated Software Engineer specializing in Distributed Data Systems to architect and enhance the infrastructure that supports large-scale multimodal training and evaluation at OpenAI. In this role, you will oversee distributed data pipelines and collaborate closely with our researchers to translate their requirements into robust, high-performance systems. You will play a crucial role in fortifying the pipelines that underpin Sora’s rapid innovation cycles.We are looking for engineers with a keen eye for detail, substantial experience with distributed systems, and a proven track record of building reliable infrastructures in high-stakes environments.This position is based in San Francisco, CA, and follows a hybrid work model requiring three days in the office each week. We also provide relocation assistance to new team members.Key Responsibilities:Design, build, and maintain data infrastructure systems including distributed computing, data orchestration, distributed storage, streaming infrastructure, and machine learning infrastructure, ensuring they are scalable, reliable, and secure.Ensure our data platform can scale dramatically while maintaining high levels of reliability and efficiency.Collaborate with researchers to deeply understand their needs and translate them into production-ready systems.Harden, optimize, and maintain vital data infrastructure systems that drive multimodal training and evaluation.Ideal Candidates Will Have:Extensive experience with distributed systems and large-scale infrastructure, coupled with a strong passion for data.A detail-oriented mindset and a commitment to building and maintaining dependable systems.Solid software engineering fundamentals and exceptional organizational skills.Comfort with ambiguity and rapid changes in a fast-paced environment.About OpenAIOpenAI is a pioneering AI research and deployment organization dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We strive to advance digital intelligence in a way that is safe and beneficial, pushing the boundaries of innovation and technology.
About Our TeamJoin the innovative Sora team at OpenAI, where we are at the forefront of developing multimodal capabilities for our foundation models. Our hybrid research and product team is dedicated to seamlessly integrating multimodal functionalities into our AI solutions, ensuring they are dependable, user-centric, and aligned with our vision of benefiting society at large.Role OverviewAs a Machine Learning Engineer specializing in Distributed Data Systems, you will be instrumental in designing and scaling the infrastructure that facilitates large-scale multimodal training and evaluation at OpenAI. Your role will involve managing complex distributed data pipelines, collaborating closely with researchers to convert their requirements into robust, production-ready systems, and enhancing pipelines that are essential for Sora's rapid iteration cycles.We are seeking detail-oriented engineers with extensive experience in distributed systems who thrive in high-stakes environments and excel in building resilient infrastructure.This position is located in San Francisco, CA, and follows a hybrid work model, requiring three days in the office each week. We also provide relocation assistance for new team members.Key Responsibilities:Design, implement, and maintain data infrastructure systems, including distributed computing, data orchestration, distributed storage, streaming infrastructure, and machine learning systems, with a focus on scalability, reliability, and security.Ensure our data platform can scale exponentially while maintaining high reliability and efficiency.Collaborate with researchers to gain a deep understanding of their requirements, translating them into production-ready systems.Strengthen, optimize, and manage critical data infrastructure systems that support multimodal training and evaluation.You Will Excel in This Role If You:Possess strong experience with distributed systems and large-scale infrastructure, coupled with a keen interest in data.Exhibit meticulous attention to detail and a commitment to building and maintaining reliable systems.Demonstrate solid software engineering fundamentals and effective organizational skills.Thrive in environments characterized by ambiguity and rapid change.About OpenAIOpenAI is a trailblazing AI research and deployment organization committed to ensuring that general-purpose artificial intelligence serves humanity. We continuously push the boundaries of AI capabilities and strive to create technology that benefits everyone.
Full-time|$166K/yr - $225K/yr|On-site|San Francisco, California
At Databricks, we are driven by a passion for empowering data teams to tackle the world’s most challenging problems — from transforming transportation to accelerating medical innovations. We achieve this by creating and maintaining the leading data and AI infrastructure platform, enabling our clients to leverage profound data insights for business enhancement. Founded by engineers with a customer-first mentality, we eagerly embrace every opportunity to tackle complex technical challenges, ranging from the design of next-generation UI/UX for data interactions to scaling our services across millions of virtual machines. Our journey has just begun.As a member of the Runtime team at Databricks, you will be instrumental in developing the next generation of distributed data storage and processing systems. These systems will surpass specialized SQL query engines in relational query performance while offering the programming abstractions necessary to support a variety of workloads, from ETL to data science.Example projects include:Apache Spark™: Contribute to the de facto open-source standard framework for big data.Data Plane Storage: Develop reliable and high-performance services and client libraries for managing vast amounts of data within cloud storage backends like AWS S3 and Azure Blob Store.Delta Lake: Design a storage management system that merges the scalability and cost-effectiveness of data lakes with the performance and reliability of data warehouses, providing features like ACID transactions and time travel.Delta Pipelines: Simplify the orchestration and operation of numerous data pipelines, enabling clients to deploy, test, and upgrade pipelines effortlessly.Performance Engineering: Create the next-generation query optimizer and execution engine that is fast, scalable, and robust.
Full-time|$120K/yr - $228K/yr|Hybrid|San Francisco
At Scribd, Inc., we are dedicated to enhancing human understanding through our suite of innovative products, including Scribd®, Slideshare®, Everand™, and Fable. Our mission revolves around transforming access into deeper insights and expertise for billions globally.Our CultureWe foster a culture where authenticity and boldness are encouraged; where constructive debates lead to commitment, and where every team member is empowered to prioritize customer needs.We believe that exceptional work emerges from harmonizing individual flexibility with a strong sense of community. Our Scribd Flex program allows employees to select their preferred work style and location, while also emphasizing the importance of intentional in-person interactions to enhance collaboration and culture. All employees are expected to participate in occasional in-person meetings, regardless of their location.We look for team members who embody “GRIT”—the intersection of passion and perseverance towards long-term goals. GRIT serves as a framework for our operations: setting and achieving Goals, delivering impactful Results, contributing Innovative ideas, and building a strong Team through collaboration.Join us at Scribd (pronounced “scribbed”) as we ignite human curiosity and create a world filled with stories and knowledge, democratizing the exchange of ideas and empowering collective expertise.The TeamOur ML Data Engineering team is responsible for powering metadata extraction, enrichment, and content understanding across our platforms.
Join Cloudflare as a Distributed Systems Engineer and help us build and maintain our innovative Data Platform. In this role, you'll be working on our Analytical Database Platform, focusing on enhancing data processing and storage technologies to support our global client base. If you are passionate about distributing systems and enjoy solving complex problems, this is the perfect opportunity for you!
About Our TeamAt OpenAI, our Hardware team specializes in developing cutting-edge silicon and system-level solutions tailored to meet the rigorous demands of advanced AI applications. We are at the forefront of creating the next generation of AI-native silicon and collaborate closely with our software and research partners to ensure our hardware is seamlessly integrated with AI models. Our mission extends beyond just delivering production-grade silicon for OpenAI’s supercomputing infrastructure; we are also dedicated to innovating custom design tools and methodologies that enhance hardware optimized for AI.About the PositionWe are seeking a highly skilled Mechanical Engineer with a minimum of 7 years of experience in the design of IT hardware, encompassing everything from chip/package to system levels. In this role, you will collaborate with a team of experts across thermal, mechanical, electrical, software, and systems engineering to support the design, analysis, and validation of mechanical and thermal systems that guarantee the reliability, efficiency, and longevity of critical hardware. A strong analytical mindset, hands-on testing experience, and the ability to thrive in a fast-paced, multidisciplinary environment are essential for success in this role.This position is based in San Francisco, CA, and follows a hybrid work model, requiring 3 days in the office per week. We also offer relocation assistance for new hires.Key ResponsibilitiesLead mechanical designs for AI supercomputer products within data center applications.Collaborate with cross-functional teams to design and enhance thermal solutions for data center hardware, including chips, power modules, and system-level cooling architectures.Integrate thermal management strategies into hardware designs from concept through to mass production.Design and validate mechanical systems such as chassis, enclosures, and cooling systems while ensuring compliance with performance and reliability standards.Conduct 3D modeling, finite element analysis (FEA), tolerance analysis, and prototyping to ensure manufacturability and adherence to stringent quality requirements.Perform mechanical testing, including vibration, shock, and thermal cycling, to ensure long-term reliability under extreme operating conditions.Identify and assess new technologies and methodologies to enhance mechanical and thermal performance in product designs, contributing expertise in mechanical design to new product development initiatives.
Nov 20, 2025
Sign in to browse more jobs
Create account — see all 5,368 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.