Software Engineer, Caching Infrastructure

OpenAISan Francisco

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

In This Role, You Will:Design, develop, and manage OpenAI’s multi-tenant caching platform utilized across inference, identity, quota, and product experiences. Establish the long-term vision and strategic roadmap for caching as a core infrastructure capability, effectively balancing performance, durability, and cost. Collaborate with other infrastructure teams (e.g., networking, observability, databases) and product teams to ensure our caching platform aligns with their requirements. You Might Thrive In This Role If You:Possess 5+ years of experience in building and scaling distributed systems, with a strong emphasis on caching, load balancing, or storage systems. Hold in-depth knowledge of Redis, Memcached, or similar technologies, including clustering, durability configurations, client-side connection patterns, and performance optimization. Have practical experience with Kubernetes, service meshes (e.g., Envoy), and autoscaling solutions. Approach design with a keen focus on latency, reliability, throughput, and cost-efficiency. Excel in a dynamic environment and appreciate the blend of practical engineering with a commitment to long-term technical excellence.

About the job

About the Team

At OpenAI, we are on a mission to develop safe and beneficial artificial general intelligence. Our models are integrated into innovative products such as ChatGPT and various APIs. To ensure these systems are swift, reliable, and economically viable, we require top-tier infrastructure that stands out in the industry.

The Caching Infrastructure team plays a pivotal role by creating a robust caching layer that supports numerous critical applications at OpenAI. Our goal is to deliver a high-availability, multi-tenant caching platform capable of auto-scaling with workload demands, reducing tail latency, and accommodating a wide array of use cases.

We seek an experienced engineer who can design and scale this essential infrastructure. The ideal candidate will possess extensive experience in distributed caching systems (e.g., Redis, Memcached), a solid understanding of networking fundamentals, and expertise in Kubernetes-based service orchestration.

About OpenAI

About OpenAIOpenAI is a pioneering company in AI research and deployment, dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We continuously push the boundaries of AI capabilities, striving to deploy them safely and effectively through our innovative products. As a powerful tool, AI must be developed with a focus on safety and the needs of people, making our mission both challenging and rewarding.

Similar jobs

1 - 20 of 11,652 Jobs

Search for Software Engineer Infrastructure Reliability At Openai San Francisco

11,652 results

Select all on this page (20)

Apply

Software Engineer, Infrastructure Reliability at OpenAI | San Francisco

OpenAI

Full-time|On-site|San Francisco

About Our TeamJoin our dynamic Infrastructure organization at OpenAI, where we are actively seeking talented software engineers to bolster our efforts across several high-impact teams. With a variety of focus areas available—including Core Distributed Systems, Databases, Observability, and Cloud Infrastructure—you'll have the opportunity to work on projects that fascinate you. Our teams operate with a high level of autonomy and foster a deeply collaborative environment, all dedicated to enhancing safety, reliability, and operational velocity across the organization.About the RoleAs a Software Engineer focused on Infrastructure Reliability, you will play a pivotal role in scaling and fortifying the infrastructure that supports some of the world’s most widely utilized AI systems. Your work will ensure that our systems maintain high reliability, observability, performance, and security—enabling researchers to iterate rapidly and allowing products like ChatGPT and the OpenAI API to effectively serve millions of users.This hands-on, impactful role is perfect for engineers who enjoy ownership, excel at solving complex technical challenges across the entire stack, and wish to contribute to systems that facilitate cutting-edge research deployed on a global scale. You will significantly influence technical direction, enhance system resilience, and collaborate closely with infrastructure, product, and research teams to transform intricate infrastructure into dependable platforms.Key ResponsibilitiesDesign, construct, and maintain reliable, high-performance systems utilized across engineering.Identify and resolve performance bottlenecks and inefficiencies, ensuring our infrastructure scales appropriately.Investigate and troubleshoot complex issues thoroughly.Enhance automation to minimize manual tasks and improve internal developer tools.Participate in incident response, postmortem analysis, and the development of best practices surrounding system reliability and scalability.Ideal Candidate ProfilePossess a deep understanding of distributed systems principles, with a proven track record in developing and managing scalable, reliable systems.Demonstrate a strong focus on performance and optimization, with the ability to maximize efficiency in complex, globally distributed systems.Have experience managing orchestration systems such as Kubernetes at scale and creating abstractions over cloud platforms.Be comfortable working within Linux environments and possess strong problem-solving skills.

Mar 19, 2026

Apply

Software Engineer for Scaled Abuse at OpenAI | San Francisco

OpenAI

Full-time|On-site|San Francisco

About the TeamThe Applied team is dedicated to responsibly introducing OpenAI's groundbreaking technology to the global community. We have launched revolutionary products like ChatGPT, Plugins, DALL·E, and APIs for GPT-5, embeddings, and fine-tuning. Additionally, we manage extensive inference infrastructure to support our rapidly advancing initiatives.Our clients leverage our APIs to create dynamic businesses with innovative product features previously thought impossible. For instance, ChatGPT exemplifies the capabilities our technology currently offers. We prioritize the responsible use of our powerful tools, ensuring that safety remains paramount over unrestricted growth.The Fraud Engineering team operates within the Applied Engineering organization, focusing on identifying and mitigating fraudulent activities on our platform. We are seeking a software engineer with expertise in anti-fraud and abuse systems to help us design and implement next-generation solutions.About the RoleThe Scaled Abuse team safeguards OpenAI’s products and users by detecting, preventing, and addressing fraudulent and abusive activities on a large scale. We develop and maintain backend and data systems that facilitate real-time detection, investigative workflows, and enforcement, ensuring a balance between robust protection and an excellent user experience as our platform evolves.Our work lies at the convergence of engineering and abuse prevention expertise. We collaborate closely with Trust & Safety, Security, and Product teams to identify emerging attack trends, translate complex signals into actionable system behavior, and continuously enhance our defenses. We value engineers who can navigate unfamiliar codebases swiftly, develop a deep understanding of system functionality, and suggest practical improvements to enhance overall resilience.In This Role, You Will:Design and develop systems for fraud detection and remediation, balancing fraud loss, implementation costs, and customer experience.Collaborate with finance, security, product, research, and trust & safety teams to effectively combat fraudulent and abusive behaviors on our platform.Stay updated with the latest techniques and tools to remain ahead of determined adversaries.Utilize GPT-5 and future models to enhance our fraud and abuse mitigation efforts.You May Thrive in This Role If You:Possess at least 5 years of software engineering experience, particularly in backend and data systems.

Apr 6, 2026

Apply

Software Engineer - Continuous Deployment at OpenAI | San Francisco

OpenAI

Full-time|$230K/yr - $490K/yr|On-site|San Francisco

About the RoleJoin the Engineering Acceleration Delivery / Continuous Deployment team at OpenAI, where we develop and maintain systems designed to securely deploy OpenAI’s infrastructure and product code into production.Our team is responsible for the deployment platform, release pipelines, and safety mechanisms that empower engineers across OpenAI to make rapid changes while minimizing operational risks. Our goal is to streamline production deployments, enhancing speed, safety, and autonomy.This position is a unique opportunity to work at the convergence of developer productivity, distributed systems reliability, and large-scale infrastructure orchestration.In This Role, You WillArchitect and implement continuous deployment infrastructure that efficiently manages changes across multiple Kubernetes clusters and global regions.Create systems for progressive delivery, incorporating techniques like canary releases, staged rollouts, and automated rollback processes.Enhance engineering velocity by reducing friction within the release pipeline and automating operational workflows.Collaborate with product and infrastructure teams to ensure their services are deployable, observable, and resilient at scale.Refine and adopt deployment methodologies such as GitOps, infrastructure-as-code, and progressive delivery patterns.Develop systems that automatically assess deployment health through metrics, logs, traces, and alerts to identify regressions and initiate safe rollbacks.Create systems that facilitate agent-assisted or fully autonomous deployment workflows using cutting-edge AI tools.Technologies you will work with include:Kubernetes for large-scale container orchestration and runtime infrastructurePython and FastAPI for internal servicesTerraform for infrastructure as codeGitOps-based deployment workflows (e.g., ArgoCD, Flux, or similar systems)Buildkite for CI orchestration

Mar 10, 2026

Apply

Rack Product Engineer, AI Rack Infrastructure at OpenAI | San Francisco

OpenAI

Full-time|On-site|San Francisco

About Our TeamAt OpenAI, we are forging a global network of cutting-edge datacenters in collaboration with our technology and capital partners to meet the challenges of the most demanding AI workloads. The Industrial Compute team is dedicated to designing, manufacturing, and deploying datacenter infrastructure systems that prioritize reliability, scalability, and performance.Our team collaborates with engineering departments, manufacturing partners, construction teams, and datacenter operations to ensure that our rack systems are efficiently built, validated, and deployed across our rapidly expanding global infrastructure.Our responsibilities encompass product definition, design validation, manufacturing readiness, and field deployment — all aimed at ensuring that our rack infrastructure meets the performance, reliability, and operational requirements of OpenAI's compute platforms.About This RoleWe are on the lookout for a skilled Rack Product Engineer to spearhead the technical development, manufacturing readiness, and lifecycle performance of rack infrastructure utilized across OpenAI's datacenters. You will serve as the engineering subject matter expert, overseeing aspects of management, testing, quality assurance, and manufacturing engineering to scale operations and maintain on-time delivery.This position operates at the intersection of hardware design, manufacturing, supplier engagement, and datacenter deployment. You will collaborate closely with compute, mechanical, power, and networking teams to define rack architectures that are manufacturable, scalable, and operationally reliable.Your role will involve partnering with contract manufacturers and suppliers to ensure that rack systems are built to specifications while also driving internal design improvements, resolving field issues, and supporting rapid deployment of infrastructure on a global scale.Travel RequirementsThis position may require domestic and international travel as needed, estimated at 30%, to manufacturing sites, supplier facilities, and datacenter deployments.

Mar 16, 2026

Apply

Software Engineer - Foundations Retrieval at OpenAI | San Francisco

OpenAI

Full-time|On-site|San Francisco

About the Foundations Retrieval Team The Foundations Research group at OpenAI explores new approaches that could shape artificial intelligence for years to come. The team focuses on improving the science and data behind model training and scaling, especially for future advanced models. Areas of focus include data utilization, scaling laws, optimization strategies, model architectures, and efficiency improvements. Within Foundations, the Search team builds agentic search solutions. This group works closely with others to design interfaces between models and the core search stack, serving, indexing, and retrieval, so model intent leads to reliable, real-world results. The team develops large-scale systems to transform and index massive information sources, enabling models to reason over global knowledge. Close collaboration with researchers helps move new modeling ideas into production quickly, changing how intelligent systems discover and synthesize information at scale. Role Overview OpenAI is hiring a Software Engineer with expertise in retrieval system development and scalability for its San Francisco office. This role involves working with researchers and engineers to build infrastructure that lets models access the right information when needed. Responsibilities include designing and operating indexing systems, retrieval pipelines, and serving layers. Work in this role will directly improve retrieval capabilities across OpenAI’s research and products, with a strong influence on system performance, reliability, and scalability. What You’ll Do Develop and scale retrieval infrastructure, including indexing, serving, and query execution. Build low-latency, high-throughput systems for real-time model interactions. Work with research teams to bring embedding and retrieval methods into production. Support dense, sparse, and hybrid retrieval pipelines. Maintain system performance, reliability, and observability at scale. Collaborate with Pretraining, Inference, and Product teams to deliver end-to-end retrieval solutions. Help develop model-system interfaces for agentic workflows. Who We’re Looking For Experience building and scaling distributed systems. Background in developing high-performance, low-latency systems. Hands-on work with indexing and retrieval techniques. Familiarity with hybrid retrieval systems. Comfort working collaboratively across multiple teams.

Apr 14, 2026

Apply

Software Engineer for Ads Monetization at OpenAI | San Francisco

OpenAI

Full-time|On-site|San Francisco

About the TeamJoin the Applied team at OpenAI, where we are dedicated to responsibly deploying groundbreaking technology. Our products, including ChatGPT, Sora, and the OpenAI API, are transforming industries by offering advanced capabilities such as GPT-5 and a robust suite of multimodal features spanning text, image, audio, and video. We manage extensive platform infrastructure and large-scale inference systems that cater to our global user base. As we continue to innovate, our influence in the tech world expands.Our clients are leveraging our APIs to fuel rapid business growth, unlocking previously unimaginable product functionalities. With offerings like ChatGPT and Sora, we showcase the potential of our technology across diverse media experiences. As we enhance these capabilities, we remain committed to ethical practices, prioritizing responsible deployment over unregulated expansion.As a member of the Applied Engineering team, the Ads Monetization group within Financial Engineering focuses on developing core systems that manage the financial aspects of ChatGPT Ads. These systems are designed for low-latency, high scale, and exceptional reliability, all while ensuring accuracy, auditability, and transparency in financial operations. This role integrates ad delivery, data engineering, and financial systems.Key Responsibilities:Design and implement the foundational monetization systems for ChatGPT Ads.Create and manage the core services and pipelines that facilitate ads monetization from event capture and validation to pricing, metering, and generating billable outputs.Establish and maintain the definitive source of truth for ads monetization data, including schemas, data models, and invariants to ensure consistency, transparency, and auditability.Oversee accuracy and reconciliation: align production outputs with invoicing and finance requirements, develop controls and monitoring systems, and address discrepancies through investigative efforts.Engage in full-stack development to create comprehensive billing solutions for our ChatGPT and API clients.Collaborate effectively with various stakeholders, including Ads Engineering, Data Science, Product, Finance, and Go-To-Market teams, as well as fellow engineers.Ideal Candidate Qualifications:At least 5 years of professional software engineering experience.Proficiency in designing scalable software systems with a focus on reliability and accuracy.Strong collaborative skills to work effectively in a cross-functional team environment.Experience with financial systems and data engineering principles is a plus.

Mar 2, 2026

Apply

Full Stack Software Engineer at OpenAI Edu | San Francisco

OpenAI

Full-time|On-site|San Francisco

Role Overview OpenAI Edu is hiring a Full Stack Software Engineer in San Francisco. This role centers on building educational tools and platforms, contributing to both the front end and back end of new products. What You Will Do Design and develop features across the stack, from user interfaces to server-side logic. Collaborate closely with engineers to deliver reliable, high-quality software. Apply knowledge of multiple programming languages and frameworks to support project needs. Troubleshoot and resolve technical issues to improve system performance and functionality. Location This position is based in San Francisco.

Apr 16, 2026

Apply

Site Reliability Engineer - Infrastructure for Analytics Platform

OpenAI

Full-time|On-site|San Francisco

The Scaling team at OpenAI builds and maintains the core infrastructure that supports research efforts. This group focuses on enabling rapid progress toward Artificial General Intelligence by providing the systems and tools researchers rely on every day. Their work covers everything from foundational infrastructure to specialized applications, all designed to handle increasing complexity and scale without sacrificing reliability or ease of use. Role overview OpenAI is seeking a Site Reliability Engineer to manage and improve the infrastructure behind its analytics platform. This position centers on supporting production systems that handle data-intensive, low-latency workloads. Key technologies include large-scale ClickHouse clusters, high-throughput Kafka pipelines, and stable integrations with Snowflake. The engineer in this role will turn ambiguous operational challenges into concrete solutions, deliver improvements quickly, and iterate based on real-world feedback. Success in this role means independently setting and raising operational standards, working closely with production systems, and collaborating across teams to ensure reliability at scale. Key responsibilities Manage the full lifecycle of infrastructure: provisioning, upgrades, scaling, and decommissioning using Infrastructure as Code (IaC). Operate and scale ClickHouse clusters, including sharding, replication, capacity planning, tuning, and maintenance. Run Kafka as the primary data ingestion layer, improving throughput, managing lag and backpressure, and ensuring robust failure recovery. Improve latency and reliability for workloads involving heavy data serving and querying. Develop and maintain monitoring and alerting systems, including SLIs/SLOs, dashboards, alert policies, and actionable runbooks. Create and refine incident response protocols, on-call procedures, and postmortem practices. Oversee backup, restore, and disaster recovery strategies, including regular drills. Plan and execute safe rollouts across development, staging, and production environments, using canary deployments and rollback plans. Work daily with software engineers to embed reliability into system design, implementation, and release cycles. Set and promote standards for operational readiness and runbooks, encouraging adoption across teams. Enhance CI/CD pipelines and improve the developer experience for greater speed and safety.

Apr 28, 2026

Apply

Infrastructure Engineering Lead, IT

OpenAI

Full-time|On-site|San Francisco

About Our TeamThe Infrastructure Engineering team operates within the IT department, dedicated to the reliable construction, deployment, and management of critical on-premises and hybrid environments that empower our internal services and vital research and development projects.This newly established team is committed to implementing rigorous Site Reliability Engineering (SRE) practices in environments where uptime, safety, recoverability, and security are paramount. We aim to replace unique, one-off infrastructure with standardized infrastructure-as-code components that enhance reliability and operational efficiency as OpenAI continues to grow.About This RoleWe are in search of an Infrastructure Engineering Lead who will architect, build, and maintain reliable, secure, and scalable infrastructure that supports identity, access, endpoint, and shared platform services throughout the organization.You will take full ownership of infrastructure and identity systems from conceptual design and provisioning to policy enforcement, upgrades, recovery, and ongoing operations. Your goal will be to develop robust, production-grade platforms that minimize operational hurdles, enforce security by default, and empower teams to work more effectively and confidently.This position is ideal for a senior engineer who excels in navigating ambiguity, relishes the challenge of overseeing complex systems from start to finish, and enhances reliability and security by transforming fragile implementations into standardized, repeatable infrastructure.This role is based at our San Francisco headquarters and requires in-office attendance.Key Responsibilities:Define and refine infrastructure patterns for on-prem and hybrid environments, including self-hosted platforms, vendor-supported systems, and lab settings.Establish standardized, production-grade deployment and operational models that replace custom-built solutions.Collaborate with IT, Security, Identity, and Network teams to ensure infrastructure is designed to meet reliability, security, and access standards.Design and enhance the production architecture for Identity and Access Management (IAM) adjacent platforms, such as Microsoft Entra, utilizing SRE principles.Develop common management protocols and shared resources within Azure subscriptions to ensure uniformity and policy compliance in operations.

Jan 30, 2026

Apply

Field Engineer at OpenAI | San Francisco

OpenAI

Full-time|On-site|San Francisco

OpenAI is hiring a Field Engineer in San Francisco. This position centers on bringing advanced AI solutions into practical use across a range of industries. Role overview Field Engineers at OpenAI work directly with clients to implement AI technologies in real-world environments. The role involves hands-on problem solving and adapting solutions to meet client needs. Key responsibilities Collaborate with clients to understand technical requirements and goals Troubleshoot and resolve complex technical issues during deployment Support the rollout of new AI systems and ensure smooth integration Team and impact This role works closely with a group focused on bringing transformative technologies to market. Field Engineers contribute to both client success and the broader adoption of OpenAI’s products.

Apr 29, 2026

Apply

Software Engineer, ChatGPT Infrastructure

OpenAI

Full-time|On-site|San Francisco

About the TeamAt ChatGPT, we are at the forefront of innovation, continuously enhancing our system with new capabilities and adapting to ever-evolving user needs. To sustain our rapid pace of development, we require a robust infrastructure capable of managing real-world production challenges, such as high concurrency and unpredictable traffic patterns.The mission of the ChatGPT Infrastructure team is to design and maintain the foundational platforms that facilitate swift iterations without compromising on performance or reliability. We create the shared systems, data pathways, rollout procedures, and reliability measures that enable teams to deploy changes to ChatGPT efficiently and at scale.Our focus is on high-impact infrastructure: we develop fundamental systems and streamlined processes that leverage hard-earned operational insights, ensuring that engineers do not have to repeatedly navigate similar challenges and pitfalls as they innovate.About the RoleWe are seeking experienced Senior and Staff Software Engineers to architect and construct the underlying infrastructure that supports ChatGPT, amplifying the productivity of teams working on user experience.This role transcends mere maintenance; it is about building platforms: you will define interfaces, develop essential abstractions, and create tools that promote safe and rapid iterations. Your contributions will lead to reduced friction, fewer regressions, enhanced performance, and systems that scale seamlessly as our product grows.Where You Can Make a DifferenceAs part of our team, you may engage with one or more of the following areas:Platform Foundations & Frameworks: Craft core libraries, service frameworks, and shared components that standardize system development and integration.Scalability & Performance Primitives: Develop patterns and infrastructure aimed at minimizing latency, boosting throughput, and maintaining cost efficiency as demand increases.Reliability Guardrails: Implement design mechanisms to prevent outages, including rate limiting, load shedding, and safe fallbacks.Developer Productivity via Golden Paths: Establish streamlined workflows that make common processes fast, safe, and user-friendly.Observability & Debugging Systems: Create instrumentation and metrics models to enhance debugging capabilities.

Feb 23, 2026

Apply

IT Solutions Engineer at OpenAI | San Francisco

OpenAI

Full-time|On-site|San Francisco

About the TeamThe IT Systems Operations team is crucial in connecting Security, Engineering, and Employee Technology platforms, ensuring that employee-facing systems, identity workflows, and enterprise applications operate smoothly and reliably as the organization grows.In addition to implementation, the team creates structured operating patterns that guarantee platform changes, access models, integrations, and lifecycle workflows evolve safely and consistently throughout the enterprise environment.About the RoleAs an IT Solutions Engineer, you will take ownership of identity-connected enterprise SaaS platforms and system change controls to uphold OpenAI’s compliance standards. Your responsibilities will include ensuring that production configurations, access models, and platform changes are compliant, auditable, and operated through defined controls. You will manage and operationalize controlled system and workflow changes across various identity platforms, SaaS applications, collaboration tools, and enterprise infrastructure.Collaboration with Security, Platform Engineering, and IT Support Operations is essential, as you will:Ensure consistent behavior of identity and access workflows across systemsImplement structured rollout and configuration practices for enterprise applicationsEnhance visibility and traceability of system changes affecting employee workflowsTranslate operational requirements into durable automation and policy-aligned implementationsSuccess in this position requires not only strong engineering skills but also sound operational judgment, meticulous documentation practices, and effective cross-functional collaboration.In this role, you will:Enterprise SaaS & Identity Platform OwnershipManage administration and operational stewardship of enterprise SaaS and identity-connected platforms, ensuring configuration integrity, access governance, and compliance with established control requirements.Oversee onboarding, access modification, and offboarding workflows aligned to standardized entitlement models.Guarantee reliable propagation of lifecycle events across downstream systems and maintain an accurate, auditable access state.

Mar 6, 2026

Apply

Software Engineer, Cloud Infrastructure

OpenAI

Full-time|On-site|San Francisco

Join Our Innovative TeamThe Applied Engineering team at OpenAI is dedicated to bridging the gap between research, engineering, product, and design, delivering cutting-edge AI technology to consumers and businesses alike.As a pivotal member of our team, you will manage the core infrastructure that underpins products such as ChatGPT and our API. This includes overseeing our Kubernetes clusters, infrastructure deployment, networking stack, cloud abstractions, and more.Our mission is to learn from our deployments and ensure the responsible and safe use of AI technology. We place a higher priority on safety than on unchecked growth.About Your RoleAs a vital contributor to the cloud infrastructure team, you'll be responsible for constructing and maintaining infrastructure abstractions that facilitate swift and scalable product delivery.This position is based in our San Francisco, CA office.Your Responsibilities:Architect and develop robust development and production platforms that ensure reliability and security at scale.Optimize our infrastructure for scalability to meet future demands.Foster a diverse, equitable, and inclusive work culture that encourages open communication and challenges conventional thinking.Participate in an on-call rotation to maintain the reliability of the systems we build and respond to critical incidents as necessary.You Will Excel in This Position If You:Possess over 5 years of experience in building core infrastructure.Have extensive experience with orchestration systems such as Kubernetes at scale.Are skilled in creating abstractions over cloud platforms.Take pride in developing and managing scalable, reliable, and secure systems.Thrive in environments characterized by ambiguity and rapid change.This role is exclusively located at our San Francisco headquarters. We offer relocation assistance to qualified candidates.

Aug 4, 2025

Apply

Software Engineer, GPU Infrastructure - HPC

OpenAI

Full-time|On-site|San Francisco

About Our TeamJoin the Fleet team at OpenAI, where we empower groundbreaking research and product innovation through our advanced computing infrastructure. We manage extensive systems across data centers, GPUs, and networking, ensuring optimal performance, high availability, and efficiency. Our work is crucial in enabling OpenAI’s models to function seamlessly at scale, supporting both our internal research endeavors and external products like ChatGPT. We are committed to prioritizing safety, reliability, and the ethical deployment of AI technology.About the RoleAs a Software Engineer on the Fleet High Performance Computing (HPC) team, you will play a vital role in ensuring the reliability and uptime of OpenAI’s compute fleet. Minimizing hardware failures is essential for smooth research training progress and uninterrupted services, as even minor hardware issues can lead to significant setbacks. With the rise of large supercomputers, the stakes in maintaining efficiency and stability have never been higher.At the cutting edge of technology, we often lead the charge in troubleshooting complex, state-of-the-art systems at scale. This is a unique opportunity for you to engage with groundbreaking technologies and create innovative solutions that enhance the health and efficiency of our supercomputing infrastructure.Our team fosters a culture of autonomy and ownership, enabling skilled engineers to drive meaningful change. In this role, you will focus on comprehensive system investigations and develop automated solutions to enhance our operations. We seek individuals who dive deep into challenges, conduct thorough investigations, and create scalable automation for detection and remediation.Key Responsibilities:Develop and maintain automation systems for provisioning and managing server fleets.Create tools to monitor server health, performance metrics, and lifecycle events.Collaborate effectively with teams across clusters, networking, and infrastructure.Work closely with external operators to maintain a high level of service quality.Identify and resolve performance bottlenecks and inefficiencies in the system.Continuously enhance automation processes to minimize manual intervention.You Will Excel in This Role if You Have:Experience in managing large-scale server environments.A blend of technical skills in systems programming and infrastructure management.Strong problem-solving abilities and a methodical approach to troubleshooting.Familiarity with high-performance computing technologies and tools.

Feb 5, 2026

Apply

Software Engineer, Privacy Infrastructure

OpenAI

Full-time|On-site|San Francisco

About the TeamJoin OpenAI's Privacy Engineering team, where we operate at the vital crossroads of Security, Privacy, Legal, and Core Infrastructure. Our mission is to develop cutting-edge data infrastructure and systems that empower our privacy, legal, and security teams to operate securely, swiftly, and at scale. We adhere to principles of defensibility by default, enabling impactful research, and fostering a robust security culture in preparation for transformative technologies.About the RoleWe are seeking a talented Software Engineer to design and implement technical systems that facilitate legal compliance workflows, including secure data processing and document review. In this role, you will collaborate closely with Legal, Security, IT, and engineering teams to translate legal processes into actionable technical workflows. This position is perfect for an engineer passionate about large-scale data challenges and who understands the meticulousness required in ensuring compliance.Located in the vibrant city of San Francisco, we offer relocation assistance for qualified candidates.Key Responsibilities:Design and maintain scalable data storage pipelines.Develop search and discovery services (e.g., Spark/Databricks, index layers, metadata catalogs) tailored to partner team requirements.Automate secure data transfers, including encryption, checksumming, and auditing exports to reviewers.Establish secure compute environments that balance usability with stringent security controls.Implement monitoring and KPIs to ensure accountability of data holds and productions.Work cross-functionally to document SOPs, threat models, and chain-of-custody documentation that can withstand scrutiny.Ideal Candidates Will:Possess practical experience in building or operating large-scale data-lake or backup systems (Azure, AWS, GCP).Be proficient with Terraform or Pulumi, CI/CD processes, and capable of converting ad-hoc legal requests into repeatable pipelines.Be comfortable working with discovery workflows (legal holds, enterprise document collections, secure review) or eager to quickly gain expertise.Effectively communicate technical concepts—from storage governance to block-ID APIs—to interdisciplinary teams such as Legal and Engineering.

Apr 24, 2025

Apply

Product Manufacturing Engineer at OpenAI | San Francisco

OpenAI

Full-time|Hybrid|San Francisco

About Our TeamJoin the Compute team, where we design innovative AI supercomputers. Our work spans from workload modeling to enhancing accelerator design, with a strong emphasis on system and data center co-design. We are seeking engineers who are passionate about creating cutting-edge AI supercomputer solutions for data center applications.You will collaborate with partners to optimize hardware for our workloads, identify promising new deep learning accelerators, and facilitate the transition of these hardware platforms from concept to production.If you are enthusiastic about the convergence of advanced deep learning, hardware systems, and data center design, this opportunity is perfect for you!About the RoleWe are in search of a Product Manufacturing Engineer to spearhead technical initiatives linked to manufacturing. You will play a vital role in guiding products from concept to launch and through mass production, with a particular focus on PCB and PCBa manufacturing processes. This role allows you to engage with a diverse range of stakeholders, including design engineering, operations teams, TPMs, and external industry vendors, ensuring that all products meet high-quality standards and are delivered on time.This position is based in San Francisco, CA, with a hybrid work model of three days in the office per week. We also provide relocation assistance to new employees.In this role, you will:Drive manufacturing and quality initiatives to ensure product success from concept to launch.Lead the design and manufacturing processes for next-generation AI hardware systems, collaborating with various stakeholders to guarantee that products are developed and delivered timely and to the highest quality standards.Establish NPI product manufacturing processes, systems, and quality controls, defining clear milestones and deliverables while driving internal process improvements across multiple teams and functions.Conduct hands-on product manufacturing analysis during design, development, testing, prototypes, and production phases.Research automation techniques and develop new tests and systems to enhance efficiency.Own the product manufacturing development for hardware products from L6 up to potentially L10.

Feb 19, 2026

Apply

Senior Support Engineer at OpenAI | San Francisco

OpenAI

Full-time|Hybrid|San Francisco

About Our TeamThe Technical Support team plays a pivotal role in empowering developers and enterprises to create mission-critical solutions utilizing OpenAI models. Our mission is to offer technical guidance, resolve intricate issues, and assist our customers in maximizing the value and adoption of our powerful models. We collaborate closely with Technical Success, Product, Engineering, and other departments to ensure our customers receive an unparalleled experience at scale. Adopting an automation-first approach, we leverage cutting-edge AI technologies to enhance our support operations. Join the Senior Support Engineering (SSE) team at OpenAI and contribute to revolutionizing Technical Support in the AI era.About the PositionWe are in search of a Senior Support Engineer to work alongside our strategic enterprise accounts and product teams, tackling some of the most challenging problems our customers face. As a key member of the elite technical troubleshooting team at OpenAI, you will be the go-to expert for both our customers and Engineering teams when addressing complex technical issues in our environment.In this role, you will design and manage operational processes to monitor our top strategic customers and lead a 24/7 response team. Collaborating closely with our Infrastructure and Engineering teams, you will ensure that our customers enjoy the best possible experience at scale. Engaging directly with our most strategic customers, you will be instrumental in the success of the most innovative, disruptive, and large-scale AI solutions developed using the OpenAI API platform.This position is characterized by low volume but high complexity.This role is based in San Francisco, CA, with a hybrid work model requiring three days in the office each week. We also offer relocation assistance to new employees.Key ResponsibilitiesServe as one of the leading technical and troubleshooting experts for our API platform at OpenAI, acting as the final line of defense before escalation to the core Engineering team.Proactively seek and implement strategies to enhance support operations through automation and advancements in AI technologies, playing a role in shaping the future of technical support in an AI-driven landscape.Set up and utilize advanced monitoring and alerting workflows to detect customer-impacting issues in real time.Collaborate with engineering teams to contribute to reliability reviews and preparation for new features, launches, or strategic customer requirements.

Feb 19, 2026

Apply

Software Engineer, Data Infrastructure

OpenAI

Full-time|Hybrid|San Francisco

About Our TeamAt OpenAI, our Data Platform team is at the heart of our innovative approaches to data management, powering essential product, research, and analytics workflows. We manage some of the largest Spark compute fleets in production, architect data lakes and metadata systems on Iceberg and Delta, and envision exabyte-scale architectures. Our high-throughput streaming platforms utilize Kafka and Flink, while our orchestration is powered by Airflow. We also support machine learning feature engineering tools such as Chronon. Our mission is to provide secure, reliable, and efficient data access at scale, thereby enhancing intelligent, AI-assisted data workflows.Join us in building and maintaining these core platforms that are foundational to OpenAI's products, research, and analytics capabilities.We are not just scaling infrastructure; we are transforming the way people engage with data. Our vision includes intelligent interfaces and AI-powered workflows that make data interactions faster, more reliable, and intuitive.About the PositionIn this role, you will focus on constructing and managing data infrastructure that supports extensive compute fleets and storage systems optimized for high performance and scalability. You will be instrumental in designing, developing, and operating the next generation of data infrastructure at OpenAI. Your responsibilities will encompass scaling and securing big data compute and storage platforms, building and maintaining high-throughput streaming systems, ensuring low-latency data ingestion, and facilitating secure, governed data access for machine learning and analytics. You will also prioritize reliability and performance at extreme scales.You will have complete ownership of the full lifecycle: from architecture to implementation, production operations, and on-call responsibilities.You should be experienced with platforms such as Spark, Kafka, Flink, Airflow, Trino, or Iceberg. Familiarity with infrastructure tools like Terraform, along with expertise in debugging large-scale distributed systems, is essential. A passion for addressing data infrastructure challenges in the AI domain is a must.This role is based in San Francisco, CA. We offer a hybrid work model requiring 3 days in the office each week and provide relocation assistance for new hires.Responsibilities:Design, build, and maintain data infrastructure systems including distributed compute, data orchestration, distributed storage, streaming infrastructure, and machine learning infrastructure, ensuring they are scalable, reliable, and secure.Ensure our data platform can scale significantly while maintaining reliability and efficiency.Enhance company productivity by empowering your fellow engineers and teammates through innovative data solutions.

Jun 27, 2024

Apply

Software Engineer, Fleet Infrastructure

OpenAI

Full-time|Hybrid|San Francisco

Join the Fleet Infrastructure team at OpenAI, where you will play a pivotal role in managing and enhancing one of the world's largest and most efficient GPU fleets, dedicated to powering OpenAI's advanced model training and deployment initiatives. Your contributions will range from:Developing user-friendly scheduling and quota systems to maximize GPU utilization.Creating automated solutions for seamless Kubernetes cluster provisioning and upgrades, ensuring a robust and low-maintenance platform.Building service frameworks and deployment systems that support diverse research workflows.Enhancing model startup times through high-performance snapshot delivery, leveraging advanced blob storage and hardware caching techniques.And much more!About the RoleAs a Software Engineer in Fleet Infrastructure, you will design, develop, deploy, and maintain essential infrastructure systems that facilitate model training and deployment on a massive GPU fleet. This role presents an exciting opportunity to influence a critical system that supports OpenAI's mission to responsibly advance AI capabilities, all while working in a fast-paced environment with tight deadlines.Positioned in San Francisco, CA, we embrace a hybrid work model, encouraging three days in the office each week, along with offering relocation assistance for new hires.In this role, you will:Design, implement, and manage components of our compute fleet, focusing on job scheduling, cluster management, snapshot delivery, and CI/CD systems.Collaborate closely with research and product teams to understand and meet workload requirements effectively.Work alongside hardware, infrastructure, and business teams to deliver a service characterized by high utilization and reliability.

Feb 13, 2025

Apply

Software Engineer, Caching Infrastructure

OpenAI

Full-time|On-site|San Francisco

About the TeamAt OpenAI, we are on a mission to develop safe and beneficial artificial general intelligence. Our models are integrated into innovative products such as ChatGPT and various APIs. To ensure these systems are swift, reliable, and economically viable, we require top-tier infrastructure that stands out in the industry.The Caching Infrastructure team plays a pivotal role by creating a robust caching layer that supports numerous critical applications at OpenAI. Our goal is to deliver a high-availability, multi-tenant caching platform capable of auto-scaling with workload demands, reducing tail latency, and accommodating a wide array of use cases.We seek an experienced engineer who can design and scale this essential infrastructure. The ideal candidate will possess extensive experience in distributed caching systems (e.g., Redis, Memcached), a solid understanding of networking fundamentals, and expertise in Kubernetes-based service orchestration.

Jul 18, 2025

Create account — see all 11,652 results