Engineer, Supercomputing & Distributed Systems

KreaSan Francisco

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Entry Level

Qualifications

Qualifications:Bachelor's degree in Computer Science, Engineering, or a related field. Familiarity with AI systems and frameworks. Experience with distributed systems and data pipelines. Proficiency in programming languages such as Python, Go, or similar. Strong analytical and problem-solving skills. Ability to work collaboratively in a fast-paced environment.

About the job

Join Krea's Innovative Team

At Krea, we are at the forefront of developing next-generation AI creative tools. Our commitment lies in making AI an intuitive and controllable medium for creatives. We aspire to create tools that enhance human creativity rather than replace it.

We view AI as a transformative medium that enables expressions across diverse formats—text, images, video, sound, and even 3D. Our focus is on creating smarter, more adaptable tools that leverage this medium effectively.

The Role of Supercomputing and AI Infrastructure at Krea

Our team is responsible for building and managing the foundational infrastructure that supports Krea's research and inference processes. This includes distributed training systems, over 1000 Kubernetes GPU clusters, and extensive petabyte-scale data pipelines. Much of our work involves creating bespoke solutions, such as custom distributed datastores, job orchestration systems, and advanced streaming pipelines, which are designed to handle modern AI workloads efficiently.

Key Projects You Will Contribute To:

Distributed Data Systems: Design and implement multi-stage pipelines to transform petabytes of raw data into clean, annotated datasets; run classification models across billions of images; deploy and integrate large language models to caption extensive multimedia data.
GPU Infrastructure: Manage distributed training and inference across 1000+ GPU Kubernetes clusters; address orchestration and scaling challenges for large-scale GPU job processing; optimize research workflows across multiple datacenters.
Distributed Training: Profile and enhance dataloaders streaming thousands of images per second; troubleshoot InfiniBand networking during extensive training runs; develop fault tolerance systems for large-scale pretraining; collaborate with researchers to refine reinforcement learning infrastructure.
Applied ML Pipelines: Identify clean scenes in millions of videos utilizing distributed shot-boundary detection; tailor and train models to sift through billions of images for specific queries; construct systems that link raw cluster capacity with research outcomes.

About Krea

Krea is pioneering the development of next-generation AI creative tools aimed at empowering creatives. Our mission is to make AI a complementary force in the creative process, allowing individuals to express themselves through various mediums.

Similar jobs

1 - 20 of 5,730 Jobs

Search for Software Engineer Specializing In Distributed Systems

5,730 results

Select all on this page (20)

Apply

Software Engineer specializing in Distributed Systems

Browserbase

Full-time|On-site|San Francisco

At Browserbase, we revolutionize web browsing for AI agents and applications. Our innovative headless browser infrastructure automates interactions with websites, simplifies form filling, and replicates user actions seamlessly.Having successfully raised a $40M Series B last year, we are on an accelerated growth trajectory. Supported by esteemed investors such as Kleiner Perkins, CRV, and Notable Capital, our dynamic team is committed to realizing our CEO's vision for empowering the best AI tools and transforming web automation.Our Core Infrastructure team is essential for maintaining the efficiency of our operations. This group tackles significant distributed systems challenges, ensuring our platform's speed, reliability, and scalability.

Mar 25, 2025

Apply

Senior Software Engineer Specializing in Python and Distributed Systems

Scribd, Inc.

Full-time|$120K/yr - $228K/yr|Hybrid|San Francisco

At Scribd, Inc., we are dedicated to enhancing human understanding through our suite of innovative products, including Scribd®, Slideshare®, Everand™, and Fable. Our mission revolves around transforming access into deeper insights and expertise for billions globally.Our CultureWe foster a culture where authenticity and boldness are encouraged; where constructive debates lead to commitment, and where every team member is empowered to prioritize customer needs.We believe that exceptional work emerges from harmonizing individual flexibility with a strong sense of community. Our Scribd Flex program allows employees to select their preferred work style and location, while also emphasizing the importance of intentional in-person interactions to enhance collaboration and culture. All employees are expected to participate in occasional in-person meetings, regardless of their location.We look for team members who embody “GRIT”—the intersection of passion and perseverance towards long-term goals. GRIT serves as a framework for our operations: setting and achieving Goals, delivering impactful Results, contributing Innovative ideas, and building a strong Team through collaboration.Join us at Scribd (pronounced “scribbed”) as we ignite human curiosity and create a world filled with stories and knowledge, democratizing the exchange of ideas and empowering collective expertise.The TeamOur ML Data Engineering team is responsible for powering metadata extraction, enrichment, and content understanding across our platforms.

Nov 17, 2025

Apply

Software Engineer - Distributed Systems

Achira

Full-time|On-site|San Francisco Office

Why Join Achira?Become part of an exceptional team comprised of scientists, ML researchers, and engineers dedicated to transforming the landscape of drug discovery.Engage with cutting-edge machine learning infrastructure at an unprecedented scale, leveraging extensive computing resources, vast datasets, and ambitious goals.Take ownership of significant projects from conception through to architecture and deployment on large-scale infrastructures.Thrive in a culture that values thoroughness, speed, and a proactive, builder-oriented mindset.About the RoleAt Achira, we are developing state-of-the-art foundation models that address the most complex challenges in simulation for drug discovery and beyond. Our atomistic foundation simulation models (FSMs) serve as comprehensive representations of the physical microcosm, encompassing machine learning interaction potentials (MLIPs), neural network potentials (NNPs), and various generative model classes.We are looking for a Software Engineer who is enthusiastic about distributed computing and its applications in machine learning. You will play a pivotal role in designing and constructing the infrastructure for our ML data generation pipelines, model training, and fine-tuning workflows across large-scale distributed systems.Your expertise will be crucial in ensuring our compute clusters are efficient, observable, cost-effective, and dependable, enabling us to advance the frontiers of ML development. If you are passionate about distributed systems, performance optimization, and cloud cost efficiency, we encourage you to apply.You will be empowered to conceptualize and manage complex workloads across multiple vendors worldwide. Achira's mission revolves around computation, and providing seamless access to our uniquely tailored workloads at the lowest possible cost is critical to our success.

Oct 7, 2025

Apply

Software Engineer - Distributed Systems (Infrastructure)

Cloudflare, Inc.

Full-time|Hybrid|Hybrid

Join Cloudflare as a Software Engineer specializing in Distributed Systems and Infrastructure. In this role, you will be responsible for designing, implementing, and optimizing scalable systems that enhance the performance and reliability of our services. You will collaborate closely with cross-functional teams to develop innovative solutions that support our mission to help build a better Internet.

Mar 4, 2026

Apply

Staff Software Engineer, Distributed Systems

Ambience Healthcare

Full-time|$250K/yr - $300K/yr|Hybrid|San Francisco

About Us:At Ambience Healthcare, we are not just another scribe; we are pioneering an AI intelligence platform that reintegrates humanity into healthcare, delivering significant ROI for health systems nationwide.Our innovative technology empowers providers to concentrate on delivering exceptional care by alleviating the administrative burdens that distract them from their patients and essential duties. Ambience offers real-time, coding-aware documentation and clinical workflow support across various healthcare settings at the leading health systems in North America.Our teams operate with unwavering dedication and extreme ownership to develop optimal solutions for our healthcare partners. We cherish transparency, positivity, and deep contemplation, holding each other to high standards because we recognize that the challenges we tackle are of utmost importance.Recognized as the leader in enhancing clinician experience by KLAS Research in their Emerging Solutions Top 20 Report, honored by Fast Company as one of the Next Big Things in Tech, acknowledged by Inc. as one of the best AI companies in healthcare, and selected as a LinkedIn Top Startup in 2024 and 2025. We're proudly supported by Oak HC/FT, Andreessen Horowitz (a16z), OpenAI Startup Fund, and Kleiner Perkins — and we're just beginning our journey.The Role:Ambience is responsible for processing millions of patient encounters across the largest health systems in the country. These organizations rely on us for real-time clinical workflows where latency and reliability significantly influence patient care. A delay during a patient visit is not merely a negative metric; it can lead to a physician abandoning the tool.In this position, you will oversee the core systems that enable Ambience to scale with reliability: database architecture, caching, multi-tenancy, and performance optimization that influences the user experience for clinicians. You will design database architectures that accommodate our growth, construct caching systems that prevent EHR API latency from affecting critical processes, and develop multi-tenant infrastructures that protect customer data while enhancing performance.Your ultimate goal will be to create infrastructure that other teams rely on effortlessly.Our engineering roles are hybrid, requiring presence in our San Francisco office three times a week.

Feb 2, 2026

Apply

Software Engineer in GPU Networking & Distributed Systems

Baseten

Full-time|On-site|San Francisco

Join Baseten as a Software Engineer focusing on GPU Networking and Distributed Systems. In this pivotal role, you'll collaborate with talented engineers and researchers to develop cutting-edge solutions that leverage GPU technology for high-performance networking operations. Your contributions will be instrumental in shaping the future of distributed systems, enhancing performance, scalability, and reliability.

Feb 23, 2026

Apply

Software Engineer, Distributed Data Systems (Sora)

OpenAI

Full-time|Hybrid|San Francisco

About Our TeamJoin the innovative Sora team at OpenAI, where we are at the forefront of developing multimodal capabilities for our foundation models. As a dynamic hybrid of research and product development, we focus on seamlessly integrating advanced multimodal functionalities into our AI offerings, ensuring they are not only reliable and user-friendly but also aligned with our mission to foster broad societal benefits.About the PositionWe are seeking a dedicated Software Engineer specializing in Distributed Data Systems to architect and enhance the infrastructure that supports large-scale multimodal training and evaluation at OpenAI. In this role, you will oversee distributed data pipelines and collaborate closely with our researchers to translate their requirements into robust, high-performance systems. You will play a crucial role in fortifying the pipelines that underpin Sora’s rapid innovation cycles.We are looking for engineers with a keen eye for detail, substantial experience with distributed systems, and a proven track record of building reliable infrastructures in high-stakes environments.This position is based in San Francisco, CA, and follows a hybrid work model requiring three days in the office each week. We also provide relocation assistance to new team members.Key Responsibilities:Design, build, and maintain data infrastructure systems including distributed computing, data orchestration, distributed storage, streaming infrastructure, and machine learning infrastructure, ensuring they are scalable, reliable, and secure.Ensure our data platform can scale dramatically while maintaining high levels of reliability and efficiency.Collaborate with researchers to deeply understand their needs and translate them into production-ready systems.Harden, optimize, and maintain vital data infrastructure systems that drive multimodal training and evaluation.Ideal Candidates Will Have:Extensive experience with distributed systems and large-scale infrastructure, coupled with a strong passion for data.A detail-oriented mindset and a commitment to building and maintaining dependable systems.Solid software engineering fundamentals and exceptional organizational skills.Comfort with ambiguity and rapid changes in a fast-paced environment.About OpenAIOpenAI is a pioneering AI research and deployment organization dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We strive to advance digital intelligence in a way that is safe and beneficial, pushing the boundaries of innovation and technology.

Nov 14, 2025

Apply

Senior Software Engineer - Distributed Data Systems

Databricks

Full-time|$166K/yr - $225K/yr|On-site|San Francisco, California

At Databricks, we are driven by a passion for empowering data teams to tackle the world’s most challenging problems — from transforming transportation to accelerating medical innovations. We achieve this by creating and maintaining the leading data and AI infrastructure platform, enabling our clients to leverage profound data insights for business enhancement. Founded by engineers with a customer-first mentality, we eagerly embrace every opportunity to tackle complex technical challenges, ranging from the design of next-generation UI/UX for data interactions to scaling our services across millions of virtual machines. Our journey has just begun.As a member of the Runtime team at Databricks, you will be instrumental in developing the next generation of distributed data storage and processing systems. These systems will surpass specialized SQL query engines in relational query performance while offering the programming abstractions necessary to support a variety of workloads, from ETL to data science.Example projects include:Apache Spark™: Contribute to the de facto open-source standard framework for big data.Data Plane Storage: Develop reliable and high-performance services and client libraries for managing vast amounts of data within cloud storage backends like AWS S3 and Azure Blob Store.Delta Lake: Design a storage management system that merges the scalability and cost-effectiveness of data lakes with the performance and reliability of data warehouses, providing features like ACID transactions and time travel.Delta Pipelines: Simplify the orchestration and operation of numerous data pipelines, enabling clients to deploy, test, and upgrade pipelines effortlessly.Performance Engineering: Create the next-generation query optimizer and execution engine that is fast, scalable, and robust.

Jan 30, 2026

Apply

Staff Software Engineer - Distributed Data Systems

Databricks

Full-time|$192K/yr - $260K/yr|On-site|San Francisco, California

P-186 At Databricks, we are passionate about empowering data teams to tackle some of the world’s most challenging problems, from security threat detection to cancer drug development. Our mission is to build and operate the leading data and AI infrastructure platform, enabling our customers to concentrate on the high-value challenges that are integral to their own objectives. Founded in 2013 by the original creators of Apache Spark™, Databricks has rapidly evolved from a small office in Berkeley, California, to a global powerhouse with over 1000 employees. Trusted by thousands of organizations, from startups to Fortune 100 companies, we are recognized as one of the fastest-growing SaaS companies worldwide. Our engineering teams create highly sophisticated products that address significant needs in the industry. We continuously push the limits of data and AI technology while maintaining the resilience, security, and scalability essential for our customers' success on our platform. We manage one of the largest-scale software platforms, consisting of millions of virtual machines that generate terabytes of logs and process exabytes of data daily. At this scale, we frequently encounter cloud hardware, network, and operating system faults, and our software must effectively shield our customers from these challenges. Modern data analysis leverages advanced techniques, such as machine learning, that far exceed the capabilities of traditional SQL query engines. As a Software Engineer on the Runtime team at Databricks, you will be instrumental in developing the next generation of distributed data storage and processing systems that outshine specialized SQL query engines in relational query performance, while providing the flexibility and programming abstractions to support a variety of workloads, from ETL to data science. Examples of projects you may work on include: Apache Spark™: Contributing to the de facto open-source framework for big data. Data Plane Storage: Developing reliable, high-performance services and client libraries for storing and accessing vast amounts of data on cloud storage backends like AWS S3 and Azure Blob Store. Delta Lake: A storage management system that merges the scalability and cost-effectiveness of data lakes with the performance and reliability of data warehouses, featuring low latency streaming. Its higher-level abstractions and guarantees, including ACID transactions and time travel, significantly reduce the complexity of real-world data engineering architectures. Delta Pipelines: Aiming to simplify the management of data engineering pipelines.

Jan 30, 2026

Apply

Distributed Systems Engineer - Platform

Inngest

Full-time|Remote|US/Remote

At Inngest, our Systems Engineers are the architects behind the backbone of our platform, crafting a robust execution layer, an efficient queueing system, and scalable state stores that interlink seamlessly. This role presents an exciting opportunity to tackle complex technical challenges while deriving immense satisfaction from your contributions.About Us: Inngest is pioneering innovative solutions to long-standing challenges faced by developers. Our mission is to create first-of-its-kind tools that enhance the daily workflow of developers, prioritizing user experience and performance. A strong product-centric mindset and a passion for developer tools are essential for success in this role.The Role: A successful Systems Engineer at Inngest must possess a blend of generalist and specialist skills. You will collaborate with our team to enhance the functionality of our queueing system (including debounce and concurrency mechanisms), manage vast amounts of data in our state store, and refine the API layers that facilitate user interactions. Your work will have a direct impact on millions of developers, and you will engage closely with designers, engineers, and founders to optimize user experience.Note: This position requires overlapping working hours with US PST. While residing in the San Francisco Bay Area is preferred, exceptional candidates from anywhere in the United States will be considered. Our engineering team operates in-person several days a week in San Francisco.

Apr 15, 2025

Apply

Senior Software Engineer (Python + Distributed systems)

Scribd Inc.

Full-time|$120K/yr - $228K/yr|On-site|San Francisco

About Scribd:At Scribd Inc. (pronounced “scribbed”), we ignite human curiosity by fostering a world rich in stories and knowledge. Join our innovative team as we democratize the flow of ideas and information, empowering collective expertise through our diverse product offerings: Everand, Scribd, Slideshare, and Fable.This job posting represents an open and approved position within our organization.We cultivate a culture where authenticity and boldness thrive; where we engage in vibrant discussions and embrace the unexpected; and where every employee is empowered to take meaningful actions with a firm customer focus.Our work structure emphasizes a balance between individual flexibility and community engagement. Through our Scribd Flex program, employees can collaborate with their managers to determine the most effective work style that meets their personal needs. A core principle of Scribd Flex is prioritizing intentional in-person interactions that foster collaboration, culture, and connection. Therefore, occasional in-person attendance is required for all Scribd employees, irrespective of their location.What do we seek in new team members? We hire for “GRIT.” GRIT embodies the blend of passion and perseverance towards long-term goals. At Scribd Inc., we believe in the possibilities this can unlock and encourage our employees to adopt a GRIT-driven approach to their work. In practical terms, GRIT represents our standards: we look for individuals who can set and accomplish Goals, deliver Results in their roles, contribute Innovative ideas and solutions, and positively impact the Team through collaboration and attitude.

Nov 17, 2025

Apply

Software Engineer, Database Systems

OpenAI

Full-time|On-site|San Francisco

About Our Team:Join the innovative Database Systems team at OpenAI, where we specialize in high-performance distributed databases. We are the architects behind Rockset, a cutting-edge real-time search, analytics, and vector database that powers all vector search and retrieval augmented generation (RAG) at OpenAI. Rockset underpins core functionalities across all OpenAI product lines and supports various critical internal applications.About the Role:We are in search of engineers who are passionate about distributed systems, performance optimization at a low level (with our core engine developed in C++), and constructing scalable database infrastructures from scratch. As a member of the Database Systems team, you will play a key role in enhancing the core database engine, making significant contributions to ingestion, query execution, indexing, and storage improvements. You will collaborate with multiple teams across OpenAI to unlock new product capabilities and ensure the reliability and scalability of our online database as usage expands exponentially.Your Responsibilities Will Include:Design, develop, and maintain high-performance distributed systems.Identify and address performance bottlenecks to elevate infrastructure capabilities.Define and guide the long-term technical vision and evolution of the system.Collaborate with product, engineering, and research teams to deliver robust and scalable infrastructure.Investigate complex production issues across the entire technology stack.Contribute to incident response, retrospective analyses, and establishing best practices for system reliability.You Will Excel In This Role If You:Possess substantial experience in building, scaling, and optimizing distributed systems.Exhibit a keen interest in database internals, storage engines, or low-latency query systems.Enjoy tackling complex performance challenges in high-throughput systems.Have experience managing and operating production clusters at scale (e.g., Kubernetes or similar orchestration tools).Approach scalability, correctness, and reliability with a rigorous mindset.Thrive in a fast-paced environment where you can make a significant impact.Qualifications:4+ years of relevant industry experience with a focus on distributed systems.Proficiency in C++ or similar low-level programming languages.Strong problem-solving skills and attention to detail.Experience with performance monitoring and optimization tools.Excellent collaboration and communication skills.

Jul 29, 2025

Apply

Distributed Systems Engineer - Data Platform - Analytics and Alerts

Cloudflare, Inc.

Full-time|Hybrid|Hybrid

Join Cloudflare as a Distributed Systems Engineer within our dynamic Data Platform team, focusing on Analytics and Alerts. In this position, you will play a pivotal role in building and optimizing distributed systems that power our data analytics capabilities, providing real-time insights and alerts to enhance our customer experience.

Mar 4, 2026

Apply

Distributed Systems Engineer

Sieve

Full-time|On-site|San Francisco

About UsSieve is a pioneering AI research lab dedicated solely to video data. We harness exabyte-scale video infrastructure and innovative video understanding techniques, along with a multitude of data sources, to create datasets that advance the field of video modeling. Given that video constitutes 80% of internet traffic, it serves as a vital medium that fuels creativity, communication, gaming, AR/VR, and robotics. Our mission is to tackle the most significant challenge in the development of these applications: acquiring high-quality training data.With a small yet highly skilled team of just 15 members, we have formed strategic partnerships with leading AI labs and achieved $XXM in revenue last quarter alone. Our Series A funding round last year was backed by prestigious firms, including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleAs a Distributed Systems Engineer at Sieve, you will be responsible for designing and implementing systems that efficiently manage the compute, scheduling, and orchestration of complex machine learning and ETL pipelines. Your work will ensure these systems operate quickly, reliably, and cost-effectively while processing large volumes of video data.You will thrive in this role if you are passionate about optimizing system uptime, have experience with cloud technologies, and enjoy working with high-performance distributed systems involving thousands of GPUs. Additionally, you will play a key role in developing excellent internal tools and CI/CD pipelines to facilitate rapid iteration.

Apr 26, 2025

Apply

Distributed Systems Engineer at archil | San Francisco

Archil

Full-time|On-site|San Francisco

Role OverviewJoin our innovative team as a Distributed Systems Engineer at Archil, where you will play a pivotal role in developing cutting-edge storage solutions. You will work across the entire technology stack, tackling challenges as they arise and significantly shaping our product's technical and strategic direction.Your responsibilities will include:Being on-call for our production systems to assist customers promptly in case of issues.Innovating and implementing unprecedented features in our storage services.Designing interactions within distributed systems to ensure atomicity and idempotency.Deploying and standardizing infrastructure across various cloud environments.Navigating evolving customer requirements amidst ambiguity.

Jun 2, 2025

Apply

Software Engineer, Machine Learning Infrastructure & Distributed Systems (Staff & Principal)

Tubi, Inc.

Full-time|$227.2K/yr - $417K/yr|Hybrid|San Francisco, CA; Los Angeles, CA; New York, NY (Hybrid); USA - Remote

About the Role:Join our dynamic ML Infrastructure team as a Software Engineer, where you'll collaborate intimately with the Machine Learning and Product teams to construct top-tier machine learning inference platforms. These cutting-edge platforms drive vital services such as personalized recommendations, search functionalities, and content comprehension at Tubi.Your primary focus will be on the development and maintenance of low-latency ML model serving systems that cater to Deep Learning, LLM, and Search models. This will include the creation of self-service infrastructure and critical components such as the inference engine, feature store, vector store, and experimentation engine.In this role, you'll enhance our service deployment and operational processes, with opportunities to contribute to open-source projects. Enjoy architectural freedom to explore innovative frameworks, spearhead significant cross-functional projects, and elevate the capabilities of our ML and Product teams.We are currently hiring for two positions:Staff Software EngineerPrincipal Software EngineerAdditional Details: As a Principal Engineer, you will serve as a technical leader and visionary, guiding the advancement of our machine learning platform. You'll address complex technical challenges, shape architectural decisions, and mentor senior engineers, fostering a culture of excellence and continuous improvement. Your contributions will impact millions of users.

Mar 23, 2026

Apply

Senior Distributed Systems Engineer

Archil

Full-time|On-site|San Francisco

Role OverviewJoin Archil as a Senior Distributed Systems Engineer, where you will play a critical role in developing our innovative storage solutions. You'll engage with technologies across the entire stack to tackle challenges and contribute to building Archil volumes, significantly influencing both technical design and product strategy.Key ResponsibilitiesProvide on-call support for our production systems, ensuring customer satisfaction in case of issues.Innovate and implement unprecedented capabilities within our storage services.Design interactions in distributed systems focusing on atomicity and idempotency.Deploy and generalize infrastructure across multiple cloud environments.Adapt to evolving customer needs amidst ambiguity.Lead engineering teams through complex decisions and provide insightful PR feedback.

Jun 17, 2025

Apply

Engineer, Supercomputing & Distributed Systems

Krea

Full-time|On-site|San Francisco

Join Krea's Innovative TeamAt Krea, we are at the forefront of developing next-generation AI creative tools. Our commitment lies in making AI an intuitive and controllable medium for creatives. We aspire to create tools that enhance human creativity rather than replace it.We view AI as a transformative medium that enables expressions across diverse formats—text, images, video, sound, and even 3D. Our focus is on creating smarter, more adaptable tools that leverage this medium effectively.The Role of Supercomputing and AI Infrastructure at KreaOur team is responsible for building and managing the foundational infrastructure that supports Krea's research and inference processes. This includes distributed training systems, over 1000 Kubernetes GPU clusters, and extensive petabyte-scale data pipelines. Much of our work involves creating bespoke solutions, such as custom distributed datastores, job orchestration systems, and advanced streaming pipelines, which are designed to handle modern AI workloads efficiently.Key Projects You Will Contribute To:Distributed Data Systems: Design and implement multi-stage pipelines to transform petabytes of raw data into clean, annotated datasets; run classification models across billions of images; deploy and integrate large language models to caption extensive multimedia data.GPU Infrastructure: Manage distributed training and inference across 1000+ GPU Kubernetes clusters; address orchestration and scaling challenges for large-scale GPU job processing; optimize research workflows across multiple datacenters.Distributed Training: Profile and enhance dataloaders streaming thousands of images per second; troubleshoot InfiniBand networking during extensive training runs; develop fault tolerance systems for large-scale pretraining; collaborate with researchers to refine reinforcement learning infrastructure.Applied ML Pipelines: Identify clean scenes in millions of videos utilizing distributed shot-boundary detection; tailor and train models to sift through billions of images for specific queries; construct systems that link raw cluster capacity with research outcomes.

Apr 3, 2026

Apply

Software Engineer, Platform Systems

OpenAI

Full-time|On-site|San Francisco

About Our TeamThe Platform Systems team at OpenAI is at the forefront of innovation, merging advanced AI technologies with large-scale distributed systems. We are tasked with creating the engineering and research infrastructure essential for training OpenAI's premier models on some of the most powerful, custom-built supercomputers globally.Our team is dedicated to developing the core software for model training, delving deep into the technological stack. This encompasses collective communication, compute efficiency, parallelism strategies, fault tolerance, failure detection, and observability. The systems we design are pivotal to enhancing OpenAI's research capabilities, facilitating reliable and efficient training at the leading edge of technology.We work in close partnership with researchers across the organization, continuously integrating insights from various OpenAI projects to advance our training platform.About the RoleAs a Software Engineer specializing in Platform Systems, you will architect and develop distributed systems that enhance visibility into large-scale training operations, ensuring their dependable operation at scale.Your responsibilities will include designing systems for failure detection, tracing, and observability that pinpoint slow or malfunctioning nodes, identify performance bottlenecks, and assist engineers in optimizing extensive distributed training tasks. This infrastructure is integral to the functionality of OpenAI's training stack and is continuously evolving to accommodate new use cases and increasingly intricate workloads.This position is central to our training infrastructure, merging systems engineering, performance analysis, and large-scale debugging.Key ResponsibilitiesDesign and develop distributed failure detection, tracing, and profiling systems tailored for large-scale AI training jobs.Create tools to identify slow, faulty, or errant nodes and deliver actionable insights into system behavior.Enhance observability, reliability, and performance across OpenAI's training platform.Troubleshoot and resolve issues within complex, high-throughput distributed systems.Collaborate effectively with systems, infrastructure, and research teams to advance platform capabilities.Adapt and expand failure detection and tracing systems to support new training paradigms and workloads.Ideal Candidate ProfilePossesses a deep passion for performance, stability, and observability in distributed systems.Demonstrates proficiency in systems engineering and performance analysis.Has experience in debugging high-throughput distributed systems.Exhibits strong collaboration skills with a track record of working with cross-functional teams.Shows adaptability and eagerness to embrace new technologies and methodologies.

Jan 23, 2026

Apply

Senior Software Engineer - Foundational Data Systems for AI

Granica

Full-time|On-site|Bay Area Office

About GranicaGranica is an innovative AI research and infrastructure firm dedicated to creating reliable and steerable representations of enterprise data.We build trust through our product Crunch, a policy-driven health layer that ensures large tabular datasets remain efficient, reliable, and reversible. On this solid foundation, we are developing Large Tabular Models—systems designed to learn cross-column and relational structures in order to provide trustworthy answers and automation with inherent provenance and governance.Our MissionAI is currently hampered not only by the design of models but also by the inefficiencies of the data that supports them. Every redundant byte, poorly organized dataset, and inefficient data pathway contributes to significant costs, latency, and energy waste as we scale.Granica aims to eliminate these inefficiencies. We merge cutting-edge research in information theory, probabilistic modeling, and distributed systems to craft self-optimizing data infrastructures: systems that consistently enhance the representation and utilization of information by AI.Our engineering team collaborates closely with the Granica Research group led by Prof. Andrea Montanari of Stanford University, bridging advancements in information theory and learning efficiency with large-scale distributed systems. Together, we firmly believe that the next major advancement in AI will stem from breakthroughs in efficient systems rather than merely larger models.Your ContributionsGlobal Metadata Substrate: Design a transactional and metadata substrate that facilitates time-travel, schema evolution, and atomic consistency across massive petabyte-scale tabular datasets.Adaptive Engines: Develop systems that autonomously reorganize data, learning from access patterns and workloads to maintain peak efficiency without the need for manual tuning.Intelligent Data Layouts: Optimize bit-level organization (including encoding, compression, and layout) to maximize signal extraction per byte read.Autonomous Compute Pipelines: Create distributed compute systems that scale predictably, adapt to dynamic loads, and ensure reliability under failure conditions.Research to Production: Apply new algorithms in compression, representation, and optimization that emerge from ongoing research. We encourage opportunities to publish and open-source your work.Latency as Intelligence: Design systems that inherently minimize latency as a measure of intelligence.

Nov 7, 2025

Create account — see all 5,730 results