About the job
Cerebras Systems is at the forefront of AI technology, having developed the world's largest AI chip—56 times the size of traditional GPUs. Our revolutionary wafer-scale architecture enables the processing power of multiple GPUs on a single chip, simplifying programming and enhancing efficiency. This innovation allows our clients to experience unparalleled training and inference speeds, facilitating the seamless execution of large-scale machine learning applications without the complexity of managing numerous GPUs or TPUs.
Cerebras serves a diverse clientele, including leading model labs, global corporations, and pioneering AI startups. Notably, OpenAI has formed a multi-year partnership with Cerebras to harness 750 megawatts of power, revolutionizing key workloads through ultra-high-speed inference.
Our cutting-edge wafer-scale architecture enables Cerebras Inference to provide the fastest Generative AI inference solution globally, achieving speeds over 10 times faster than GPU-based cloud inference services. This significant acceleration is transforming how users interact with AI applications, promoting real-time iterations and enhancing intelligence through advanced computational capabilities.
About The Role
As a leader in large-scale AI supercomputers, Cerebras Systems deploys multi-exaflop supercomputers in some of the largest data centers worldwide. Our supercomputers leverage Wafer-Scale Cluster technology, consisting of multiple Wafer Scale Engine (WSE) chips. The Cluster engineering team is tasked with delivering comprehensive software solutions for our clusters.
Responsibilities
- Automate the bare-metal configuration of networking, operating systems, and application software across extensive clusters of Cerebras WSE, servers, and switches.
- Implement additional streamlined workflows for cluster upgrades, downgrades, and security patching, with key performance metrics designed to minimize cluster downtime.
- Develop an orchestration and scheduling system for resource allocation and job submissions in a multi-user cluster environment.
- Provide seamless support for both on-premise and cloud-based deployment and operations.
- Create a robust monitoring system capable of detecting and addressing failures across various cluster resources, ensuring high availability.

