About the job
About the Role
We are in search of a highly skilled and strategic Principal Engineer, Observability to spearhead the architecture, development, and operational excellence of our Observability platform. In this pivotal role, you will shape how our customers monitor, troubleshoot, and manage their AI workloads at scale on the CoreWeave platform.
You will collaborate directly with customers and work closely with engineering leaders across various teams to ensure a cohesive Observability experience across all CoreWeave products, including metrics, logs, traces, and customer-facing insights.
What You’ll Do
- Lead the Observability strategy and roadmap, aligning it with business objectives, product direction, and performance/SLA targets.
- Design and deploy low-latency, high-scale telemetry pipelines and data storage solutions that enhance observability across all CoreWeave offerings.
- Create customer-centric experiences—including dashboards, alerts, and workflows—that facilitate rapid troubleshooting and provide deep insights into AI workloads and platform health.
- Promote reliability, durability, and self-healing within the Observability stack, taking ownership of key services in production to uphold high operational standards.
- Enhance customer visibility by establishing benchmarks for metrics, SLOs, and dashboards that clearly communicate system performance and reliability.
About CoreWeave
CoreWeave is at the forefront of cloud technology, specifically tailored for AI applications. Our team is composed of industry pioneers dedicated to delivering cutting-edge solutions that empower organizations to leverage AI effectively. We provide a robust infrastructure combined with expert guidance, enabling our clients to achieve their AI ambitions confidently. As a publicly traded company, we continue to innovate and adapt to the fast-paced tech landscape, striving to support the needs of our diverse clientele.

