About the job
Cerebras Systems is at the forefront of AI innovation, developing the world's largest AI chip, which is 56 times larger than traditional GPUs. Our unique wafer-scale architecture delivers the AI computing power equivalent to dozens of GPUs on a single chip, providing the ease of programming as if it were a single device. This pioneering approach enables Cerebras to achieve unmatched training and inference speeds, allowing machine learning practitioners to run extensive ML applications effortlessly, without the cumbersome task of managing numerous GPUs or TPUs.
Cerebras' esteemed clientele comprises leading model labs, global corporations, and innovative AI-native startups. Recently, OpenAI announced a multi-year collaboration with Cerebras to harness 750 megawatts of scale, revolutionizing critical workloads with ultra-high-speed inference.
With our groundbreaking wafer-scale architecture, Cerebras Inference offers the swiftest Generative AI inference solution globally, exceeding the performance of GPU-based hyperscale cloud inference services by over tenfold. This remarkable enhancement in speed is revolutionizing the user experience of AI applications, facilitating real-time iteration and boosting intelligence through enhanced computational capabilities.
The Role:
In this dynamic position, you will oversee the bring-up and optimizations of Cerebras's Wafer Scale Engine (WSE). The ideal candidate will possess a robust background in delivering end-to-end solutions, collaborating closely with teams across chip design, system performance, software development, and productization.
Responsibilities:
- Develop and debug processes for Wafer Scale Engines, integrating well-tested and deployable optimizations into production workflows to mitigate time and costs.
- Refine AI Systems by navigating hardware/software design constraints such as di/dt, V-F characterization space, and current and temperature limits to enhance performance.
- Create and enhance the infrastructure to support silicon for real-world workload testing.
- Establish self-checking metrics and instrumentation for debugging and coverage analysis.

