About the job
About Etched
Etched is at the forefront of innovation, creating the world’s first AI inference system specifically designed for transformers. Our technology delivers over 10x the performance and significantly reduces cost and latency compared to traditional systems like the B200. With our advanced ASICs, we empower the development of groundbreaking products including real-time video generation models and highly sophisticated chain-of-thought reasoning agents. Supported by substantial investment from top-tier VCs and staffed by a team of elite engineers, Etched is reshaping the infrastructure for the fastest-growing industry in history.
Job Summary
We are on the lookout for a driven and detail-oriented Supercomputing Engineer (Test) to join our dynamic team. This integral position is crucial for maintaining the reliability and stability of our high-performance inference server hardware and software. In this role, you will design, develop, and execute comprehensive burn-in test suites, analyze test results, and collaborate closely with both hardware and software engineering teams at Etched and our ODM partners to swiftly identify and rectify potential issues. You will play a vital role in ensuring that our server products uphold the highest quality standards before reaching our valued customers.
Key Responsibilities
- Test Development: Craft, develop, and implement automated burn-in test suites utilizing common scripting languages (Python, Go, Bash) and testing frameworks, covering all facets of System Operation including boot sequences, root-of-trust, system management, workload deployment, and performance.
- Test Execution: Conduct burn-in tests on server hardware, monitor system performance and health, and interpret test results.
- Failure Analysis: Delve into and troubleshoot hardware and software failures uncovered during testing, delivering detailed reports and mitigation strategies.
- Collaboration: Engage with both internal and external hardware and software engineering teams to pinpoint root causes of failures and implement corrective measures.
- Test Infrastructure: Aid in the creation and upkeep of the burn-in testing infrastructure, encompassing portable test environments and automation tools operable in any setting.
- Documentation: Generate and maintain thorough documentation for test plans, test cases, and results.
- Performance Analysis: Evaluate system performance metrics to identify areas for enhancement.

