About the job
About Etched
Etched is pioneering the development of the world's first AI inference system explicitly engineered for transformers, providing over 10x greater performance while significantly reducing costs and latency compared to traditional solutions. Our Etched ASICs enable the creation of products that were previously unattainable with GPUs, such as real-time video generation models and deeply parallel chain-of-thought reasoning agents. With substantial backing from leading investors and a team of top-tier engineers, Etched is reshaping the infrastructure for one of the fastest-growing industries in history.
Job Summary
We invite you to join our innovative team as a Software Engineer - Performance Tools. In this pivotal role, you will spearhead the development of an advanced performance analysis tool specifically designed for Sohu. Your expertise will be crucial in creating essential tools that empower our ML engineers and clients to comprehend workload behaviors, pinpoint performance limitations, and fully harness the potential of Sohu in accelerating the most demanding ML applications globally. This unique opportunity allows you to influence the performance analysis landscape for groundbreaking hardware from its inception.
Key Responsibilities
Tool Architecture & Design: Lead the architecture and design of a robust performance analysis suite, incorporating data collection mechanisms, processing pipelines, analysis engines, and user interfaces (CLI and/or GUI).
Low-Level Data Collection: Create reliable methods to gather performance data directly from our custom ML accelerator hardware (e.g., hardware performance counters, execution unit status, memory access patterns) through driver interfaces or other means.
Host & System Tracing: Establish tracing for host-side API interactions (runtime libraries, driver communications) and system-level events (CPU activity, PCIe traffic, memory usage, network contention) associated with Sohu workloads.
Data Correlation & Synchronization: Develop and implement methodologies for accurately correlating performance events across host CPUs, device drivers, PCIe buses, multiple accelerators, and various hosts, ensuring precise time synchronization.
Performance Analysis Engine: Construct analysis modules to automatically interpret the collected trace and counter data, identifying key performance limits (e.g., compute-bound, memory bandwidth-bound, latency-bound, PCIe-bound).

