About the job
About Etched
Etched is pioneering the development of the world’s first AI inference system specifically designed for transformers, achieving performance metrics that exceed standard models by over 10x while significantly decreasing costs and latency compared to traditional GPUs. With our cutting-edge ASIC technology, we empower the creation of groundbreaking products, such as real-time video generation models and advanced reasoning agents that feature deep and parallel processing capabilities. Supported by substantial investments from leading venture capitalists and a team of top-tier engineers, Etched is at the forefront of transforming the infrastructure landscape in the rapidly evolving AI sector.
Job Summary
We are looking for enthusiastic and talented Supercomputing Engineers (Network) to enhance our dynamic team. This pivotal role involves the development, qualification, and optimization of high-performance networking solutions tailored for extensive inference workloads. As a Pod Software Engineer, your focus will be on creating and validating software that facilitates communication between Sohu inference nodes across multi-rack clusters. You will work in close collaboration with kernel, platform, and telemetry teams to maximize the efficiency of peer-to-peer RDMA communications.
Key Responsibilities
High Performance Peer to Peer Networking: Conceptualize, develop, and implement RDMA-based networking solutions that enable high bandwidth and low latency communication across PCIe nodes, both within and between racks. This role encompasses work across operating systems, kernel drivers, embedded software, and system software.
Test Development: Create and implement tests to validate host processors (x86), NICs, TORs, and device network interfaces for optimal performance.
Burn-in Integration: Provide burn-in teams with testing frameworks that simulate real-world use cases and workloads for device-to-device networking, including extreme-load stress testing.
Performance/Health Telemetry Design: Establish key metrics that system software should gather to ensure high availability and performance under demanding communication workloads.
Representative Projects
Evaluate performance deviations, refine network stack configurations, and suggest kernel tuning parameters for low-latency, high-bandwidth inference workloads.
Design and execute automated qualification tests for RDMA NICs and interconnects across a variety of server configurations.
Identify and troubleshoot network-related issues to enhance overall system performance.

