companyCerebras Systems logo

Software Engineer - Kernel Reliability

Cerebras SystemsSunnyvale CA or Toronto Canada
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Entry Level

Qualifications

ResponsibilitiesContribute to the technical roadmap and execution for kernel-centric reliability of our internal and customer-facing systems. Collaborate with System and Cluster Operations teams to minimize system and service downtime post-failure through tooling, analysis, and hands-on debugging support. Enhance debug tools in partnership with the Debug Team to expedite failure analysis.... (additional responsibilities)

About the job

Cerebras Systems is revolutionizing the AI landscape with the world's largest AI chip, which is 56 times larger than traditional GPUs. Our innovative wafer-scale architecture delivers the computational power of multiple GPUs on a single chip, simplifying programming and enabling unparalleled training and inference speeds. This technology allows our users to run extensive machine learning applications seamlessly, eliminating the complexities associated with managing numerous GPUs or TPUs.

Our clientele includes leading model labs, global corporations, and pioneering AI startups. Recently, OpenAI announced a multi-year collaboration with Cerebras, aiming to deploy 750 megawatts of power, significantly enhancing their workloads with ultra-fast inference capabilities.

With our groundbreaking wafer-scale architecture, Cerebras Inference provides the fastest Generative AI inference solution globally, outperforming GPU-based hyperscale cloud services by over tenfold. This remarkable speed enhancement is transforming user experiences in AI applications, facilitating real-time iterations and amplifying intelligence through advanced computational capabilities.

About The Role

We are in search of a highly technical and hands-on Software Engineer to join our Kernel Reliability team. In this pivotal role, you will address the crucial task of enhancing the reliability of our advanced compute clusters, along with the inference, training, and internal production services. You will work closely with the code to develop solutions that scale alongside our rapidly evolving production systems and software services. If you possess strong foundations in systems, debugging, and failure analysis and have a passion for creating tools and solving complex reliability challenges, we would love to connect with you. New graduates are encouraged to apply.

About Cerebras Systems

Cerebras Systems is at the forefront of AI technology, known for developing the world's largest AI chip. Our innovative approach and cutting-edge technology enable unparalleled computational power, making us a leader in the AI industry. We serve a variety of clients, from top-tier model labs to dynamic startups, and are committed to advancing AI capabilities globally.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.