About the job
Cerebras Systems is at the forefront of AI innovation, engineering the world's largest AI chip, which is 56 times larger than traditional GPUs. Our revolutionary wafer-scale architecture delivers the computational power of dozens of GPUs on a single chip, simplifying programming and enabling users to run extensive ML applications seamlessly without managing multiple GPUs or TPUs.
We proudly serve a diverse range of customers, including leading model laboratories, global corporations, and pioneering AI startups. Recently, we established a multi-year collaboration with OpenAI, aiming to scale up to 750 megawatts and revolutionize workloads with ultra-fast inference.
Leveraging our innovative wafer-scale architecture, Cerebras Inference offers the fastest Generative AI solution globally, boasting speeds over 10 times quicker than conventional GPU-based hyperscale cloud inference services. This significant speed enhancement is transforming how users experience AI applications, facilitating real-time iterations and boosting intelligence through advanced computation.
About The Role
We are looking for a seasoned IT SRE Team Lead to establish and manage the reliability function for Cerebras' internal technology infrastructure.
As the IT SRE Team Lead, you will oversee the availability, performance, and operational quality of the systems that Cerebras employees depend on daily, which include identity management, endpoint management, collaboration tools, SaaS applications, and internal networking. The ideal candidate will adopt a software engineering perspective in IT operations, treating corporate infrastructure as code, defining measurable SLOs, automating remediation processes, and relentlessly minimizing toil.
You will build and lead a small, high-impact team of engineers responsible for developing tools, writing automation scripts, and troubleshooting issues as they arise. You will work closely with our security, networking, and infrastructure teams to ensure seamless operations.

