companySciforium logo

Distributed Training Engineer

SciforiumSan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

To be successful in this role, candidates should have a strong background in computer science or a related field, with proven experience in systems engineering and machine learning frameworks. Familiarity with CUDA and ROCm is essential, alongside a knack for troubleshooting and optimizing complex systems.

About the job

Sciforium is a pioneering AI infrastructure company dedicated to developing state-of-the-art multimodal AI models and a proprietary, high-efficiency serving platform. With substantial multi-million-dollar funding and direct collaboration from AMD engineers, our team is rapidly expanding to create the complete stack that drives cutting-edge AI models and real-time applications.

About the Role

We are on the lookout for a talented Distributed Training Engineer to develop, optimize, and maintain the essential software stack that supports our extensive AI training operations. In this role, you will engage with the entire machine learning infrastructure, ranging from low-level CUDA/ROCm runtimes to high-level frameworks such as JAX and PyTorch, ensuring that our distributed training systems are swift, scalable, stable, and efficient.

This opportunity is perfect for individuals passionate about deep systems engineering, troubleshooting complex hardware-software interactions, and enhancing performance at every level of the machine learning stack. You will significantly contribute to the training and deployment of next-generation LLMs and generative AI models.

Key Responsibilities

  • Software Stack Maintenance: Manage, update, and enhance critical ML libraries and frameworks, including JAX, PyTorch, CUDA, and ROCm across various environments and hardware configurations.
  • End-to-End Stack Ownership: Construct, sustain, and continually refine the entire ML software stack, from ROCm/CUDA drivers to high-level JAX/PyTorch tooling.
  • Distributed Training Optimization: Ensure optimal sharding, partitioning, and configuration of all model implementations for large-scale distributed training.
  • System Integration: Consistently integrate and validate modules for runtime correctness, memory efficiency, and scalability across multi-node GPU/accelerator clusters.
  • Profiling & Performance Analysis: Perform detailed profiling of compilation graphs, training workloads, and runtime execution to enhance performance and eliminate bottlenecks.
  • Debugging & Reliability: Diagnose intricate hardware-software interaction issues, including vLLM compilation failures on ROCm, CUDA memory leaks, distributed runtime failures, and kernel-level inconsistencies.
  • Collaborate with research, infrastructure, and kernel engineering teams to enhance system throughput, stability, and developer experience.

About Sciforium

Sciforium is at the forefront of AI infrastructure, committed to building the next generation of multimodal AI models and innovative serving platforms. Our collaboration with AMD not only brings invaluable resources but also a wealth of expertise that accelerates our growth and capability in the AI landscape.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.