companyOpenAI logo

Reliability/DFX Engineer

OpenAISan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

About the job

Join Our Innovative Team
At OpenAI, our Hardware organization is pioneering cutting-edge silicon and system-level solutions tailored to meet the demands of advanced AI workloads. We pride ourselves on developing next-generation AI-native silicon while collaborating with software and research partners to create hardware that is intricately integrated with AI models. Our mission includes delivering high-performance silicon for OpenAI’s supercomputing infrastructure and designing custom tools and methodologies that accelerate innovations, specifically optimized for AI applications.

Your Role in Our Mission
We are on the lookout for a dynamic and experienced Reliability/DFX Engineer who possesses extensive knowledge in scaling machine learning systems. As an integral member of our hardware team, you will collaborate with chip design, platform design, hardware health, and the wider industry ecosystem to architect, implement, and deploy dependable next-generation AI accelerator systems. You will take a holistic approach to evaluate system and chip architecture, pinpointing high-ROI opportunities that enhance reliability and availability throughout the stack while translating these insights into actionable strategies and silicon features.

Key Responsibilities:

  • Lead the architecture, implementation, and execution of DFX strategies in silicon from concept to high-volume deployment, proposing impactful features to boost reliability and fault tolerance. Your focus will encompass design for testability, reliability, availability, and serviceability of high-performance AI hardware.

  • Develop system-level reliability models based on empirical data to guide the organization’s DFX and reliability strategy, necessitating a deep understanding of chip and system architecture, design, implementation, and component-level reliability.

  • Collaborate with chip and platform architecture/design teams to explore and implement DFX features, including the specification and integration of digital/mixed-signal IP, firmware/system software, and DFX methodologies.

  • Work alongside hardware health and platform design teams to enhance reliability and fault tolerance in New Product Introduction (NPI) and High-Volume Manufacturing (HVM), driving continuous, data-driven improvements across the stack through optimized operating conditions and data analysis.

  • Act as the DFX/reliability advocate, aligning the broader industry ecosystem with OpenAI’s strategic objectives and roadmap.

Qualifications:

  • Bachelor’s degree in Engineering or related field with 15+ years of experience, or a Master’s degree with 10+ years of relevant experience.

  • Proven expertise in DFX methodologies and reliability engineering for high-performance hardware.

  • Strong analytical and problem-solving skills, with a track record of improving system reliability and performance.

  • Excellent collaboration and communication abilities, capable of working effectively in a cross-functional team environment.

  • Familiarity with AI workloads and associated hardware requirements is highly desirable.

About OpenAI

OpenAI is at the forefront of AI research and development, committed to advancing digital intelligence in a way that is safe and beneficial for humanity. Our hardware team plays a crucial role in this mission, driving innovation in AI-native silicon and system solutions that support advanced AI workloads. We foster a collaborative environment where creative problem-solving and continuous learning are encouraged.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.