company

Distributed Training Engineer - Member of Technical Staff

Liquid AISan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Entry Level

Qualifications

QualificationsMust possess a strong understanding of distributed systems, experience with GPU clusters, and a passion for building scalable infrastructure. Familiarity with performance optimization and debugging in complex environments is essential.

About the job

About Liquid AI

Originating from MIT CSAIL, Liquid AI specializes in the development of general-purpose AI systems designed to operate seamlessly across various platforms, including data center accelerators and on-device hardware. Our focus is on delivering low latency, efficient memory usage, privacy, and reliability. We collaborate with organizations in diverse sectors such as consumer electronics, automotive, life sciences, and financial services. As we experience rapid growth, we seek outstanding talent to join our mission.

The Opportunity

The Training Infrastructure team is at the forefront of building the distributed systems that empower our next-generation Liquid Foundation Models. As our operations expand, we aim to innovate, implement, and enhance the infrastructure crucial for large-scale training.

This role is centered around high ownership of training systems, emphasizing runtime, performance, and reliability rather than a typical platform or SRE function. You will collaborate within a small, agile team, creating vital systems from the ground up instead of working with pre-existing infrastructure.

While San Francisco and Boston are preferred, we are open to other locations.

What We're Looking For

We are seeking an individual who:

  • Embraces the complexity of distributed systems: Our team is dedicated to maintaining stability during extensive training runs, troubleshooting training failures across GPU clusters, and enhancing overall performance.
  • Is passionate about building: We value team members who take pride in developing robust, efficient, and reliable infrastructure.
  • Excels in uncertain environments: Our systems are designed to support evolving model architectures. You will be making decisions based on incomplete information and rapidly iterating.
  • Aligns with team goals and delivers results: The best engineers on our team align with collective priorities while providing data-driven feedback when challenges arise.

The Work

  • Design and develop core systems that ensure quick and reliable large training runs.
  • Create scalable distributed training infrastructure for GPU clusters.
  • Implement and refine parallelism and sharding strategies for evolving architectures.
  • Optimize distributed efficiency through topology-aware collectives, communication/compute overlap, and straggler mitigation.
  • Develop data loading systems to eliminate I/O bottlenecks for multimodal datasets.

About Liquid AI

Liquid AI, a pioneering company emerging from MIT CSAIL, is dedicated to crafting versatile AI systems that excel across diverse deployment platforms. Our commitment to innovation and collaboration with leading enterprises positions us as a key player in the AI landscape.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.