About the job
About Etched
Etched is pioneering the development of the world's first AI inference system specifically designed for transformers, achieving over 10x the performance and significantly reduced cost and latency compared to conventional systems like the B200. With Etched's custom ASICs, we enable the creation of innovative products, such as real-time video generation models and advanced deep reasoning agents, that are unattainable with traditional GPUs. With substantial backing from prestigious investors and a team of elite engineers, Etched is transforming the infrastructure landscape of the most rapidly advancing industry.
Job Overview
As a key leader in our organization, you will spearhead a dynamic team dedicated to crafting a comprehensive suite of optimized kernels and deploying high-performance inference stacks for various cutting-edge transformer models (e.g., Llama-3, Llama-4, Deepseek-R1, Qwen-3, Stable Diffusion-3, etc.). Your role will involve managing and expanding a high-caliber team focused on developing innovative model mapping techniques while co-designing inference-time algorithms (e.g., speculative and parallel decoding, prefill-decode disaggregation, etc.).
Key Responsibilities
Architect Superior Inference Performance: Achieve continuous batching throughput that exceeds B200 by at least 10x for high-priority tasks.
Create High-Performance Inference Mega Kernels: Design intricate, fused kernels that optimize chip utilization and minimize inference latency, ensuring validation through benchmarking and regression testing in live production environments.
Develop Model Mapping Strategies: Implement system-level enhancements using a combination of tensor parallelism and expert parallelism to maximize performance.
Innovate Hardware-Software Co-design: Create and implement production-ready algorithmic enhancements in inference time (e.g., speculative decoding, prefill-decode disaggregation, KV cache offloading).
Build a Scalable Team: Recruit and retain a team of exceptional inference optimization engineers.
Align Cross-Functional Performance Goals: Ensure that inference stack and performance targets align with the software infrastructure teams (e.g., runtime support, scheduling).

