About the job
Sciforium is a pioneering AI infrastructure company dedicated to developing state-of-the-art multimodal AI models and a proprietary, high-efficiency serving platform. With substantial multi-million-dollar funding and direct collaboration from AMD engineers, our team is rapidly expanding to create the complete stack that drives cutting-edge AI models and real-time applications.
About the Role
We are on the lookout for a talented Distributed Training Engineer to develop, optimize, and maintain the essential software stack that supports our extensive AI training operations. In this role, you will engage with the entire machine learning infrastructure, ranging from low-level CUDA/ROCm runtimes to high-level frameworks such as JAX and PyTorch, ensuring that our distributed training systems are swift, scalable, stable, and efficient.
This opportunity is perfect for individuals passionate about deep systems engineering, troubleshooting complex hardware-software interactions, and enhancing performance at every level of the machine learning stack. You will significantly contribute to the training and deployment of next-generation LLMs and generative AI models.
Key Responsibilities
- Software Stack Maintenance: Manage, update, and enhance critical ML libraries and frameworks, including JAX, PyTorch, CUDA, and ROCm across various environments and hardware configurations.
- End-to-End Stack Ownership: Construct, sustain, and continually refine the entire ML software stack, from ROCm/CUDA drivers to high-level JAX/PyTorch tooling.
- Distributed Training Optimization: Ensure optimal sharding, partitioning, and configuration of all model implementations for large-scale distributed training.
- System Integration: Consistently integrate and validate modules for runtime correctness, memory efficiency, and scalability across multi-node GPU/accelerator clusters.
- Profiling & Performance Analysis: Perform detailed profiling of compilation graphs, training workloads, and runtime execution to enhance performance and eliminate bottlenecks.
- Debugging & Reliability: Diagnose intricate hardware-software interaction issues, including vLLM compilation failures on ROCm, CUDA memory leaks, distributed runtime failures, and kernel-level inconsistencies.
- Collaborate with research, infrastructure, and kernel engineering teams to enhance system throughput, stability, and developer experience.

