companyEtched logo

Supercomputing Engineer (Network) at Etched | San Jose

EtchedSan Jose
On-site Full-time $150K/yr - $275K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

QualificationsBachelor's degree in Computer Science, Electrical Engineering, or a related field. Experience with RDMA technologies and networking protocols. Proficient in programming languages such as C, C++, or Python. Strong analytical and troubleshooting skills. Excellent collaboration and communication abilities.

About the job

About Etched

Etched is pioneering the development of the world’s first AI inference system specifically designed for transformers, achieving performance metrics that exceed standard models by over 10x while significantly decreasing costs and latency compared to traditional GPUs. With our cutting-edge ASIC technology, we empower the creation of groundbreaking products, such as real-time video generation models and advanced reasoning agents that feature deep and parallel processing capabilities. Supported by substantial investments from leading venture capitalists and a team of top-tier engineers, Etched is at the forefront of transforming the infrastructure landscape in the rapidly evolving AI sector.

Job Summary

We are looking for enthusiastic and talented Supercomputing Engineers (Network) to enhance our dynamic team. This pivotal role involves the development, qualification, and optimization of high-performance networking solutions tailored for extensive inference workloads. As a Pod Software Engineer, your focus will be on creating and validating software that facilitates communication between Sohu inference nodes across multi-rack clusters. You will work in close collaboration with kernel, platform, and telemetry teams to maximize the efficiency of peer-to-peer RDMA communications.

Key Responsibilities

  • High Performance Peer to Peer Networking: Conceptualize, develop, and implement RDMA-based networking solutions that enable high bandwidth and low latency communication across PCIe nodes, both within and between racks. This role encompasses work across operating systems, kernel drivers, embedded software, and system software.

  • Test Development: Create and implement tests to validate host processors (x86), NICs, TORs, and device network interfaces for optimal performance.

  • Burn-in Integration: Provide burn-in teams with testing frameworks that simulate real-world use cases and workloads for device-to-device networking, including extreme-load stress testing.

  • Performance/Health Telemetry Design: Establish key metrics that system software should gather to ensure high availability and performance under demanding communication workloads.

Representative Projects

  • Evaluate performance deviations, refine network stack configurations, and suggest kernel tuning parameters for low-latency, high-bandwidth inference workloads.

  • Design and execute automated qualification tests for RDMA NICs and interconnects across a variety of server configurations.

  • Identify and troubleshoot network-related issues to enhance overall system performance.

About Etched

Etched is revolutionizing the AI landscape with innovative technology that enables unprecedented performance and efficiency for inference workloads. Join us and be a part of the future of AI infrastructure.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.