companyCerebras Systems logo

Engineering Manager, Kernel Reliability

Cerebras SystemsSunnyvale CA or Toronto Canada
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Manager

Qualifications

QualificationsProven experience in software or hardware reliability, including diagnostic tool building and failure analysis. Strong leadership skills with the ability to guide and mentor a technical team. Deep understanding of system architecture and performance optimization. Excellent problem-solving abilities and a hands-on approach to technology.

About the job

Cerebras Systems is at the forefront of AI technology, having developed the world's largest AI chip, which is 56 times larger than traditional GPUs. Our innovative wafer-scale architecture delivers the AI computing power equivalent to dozens of GPUs on a single chip, simplifying programming to a single device. This revolutionary design enables Cerebras to provide unmatched training and inference speeds, empowering machine learning practitioners to seamlessly execute large-scale ML applications without the complexities of managing multiple GPUs or TPUs.

Our clientele includes elite model labs, global corporations, and pioneering AI-native startups. Notably, OpenAI recently entered into a multi-year partnership with Cerebras to deploy 750 megawatts of scale, significantly enhancing key workloads with ultra high-speed inference.

Thanks to our groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution globally, achieving speeds over 10 times faster than GPU-based hyperscale cloud inference services. This substantial speed boost is transforming user experiences in AI applications by enabling real-time iterations and enhancing intelligence through additional agentic computation.

The Role
We are seeking a highly technical and hands-on Engineering Manager to lead our on-field Kernel Reliability team. You will guide a high-performing team in addressing a critical challenge: enhancing the reliability of our advanced compute clusters along with the associated inference, training, and internal production services. In this influential role, you will define the technical vision while remaining closely engaged with the code, crafting scalable solutions for our rapidly expanding system production and software service offerings. If you possess proven expertise in software or hardware reliability, diagnostic tool development, or failure analysis and debugging, we invite you to connect with us.
Responsibilities
  • Provide hands-on technical leadership, owning the technical vision and roadmap for kernel-centric reliability concerning both internal and customer-facing systems.

About Cerebras Systems

Cerebras Systems is a pioneering technology company that has developed the world's largest AI chip, leading the industry in AI compute power and performance. With a focus on innovative solutions for machine learning and AI applications, Cerebras is transforming the landscape of AI technology.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.