company

High Performance Computing Software Engineer - Supercomputing

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Design and implement high-performance, distributed software solutions for large-scale AI/ML training. Optimize low-level system components including Linux kernel, GPU/accelerator kernels, and interconnects. Develop and refine communication libraries such as NCCL, MPI, UCX, RCCL, and RDMA-based systems. Collaborate with ML researchers and engineers to support frameworks like PyTorch, MegatronLM, and DeepSpeed in large-scale production environments. Contribute to our scheduling, orchestration, and job management systems, including Slurm and Kubernetes. Debug and resolve complex issues across the stack—from kernel to container to model. Work closely with hardware vendors, upstream open-source communities, and internal teams to enhance performance and reliability.

About the job

Join Our Innovative Team at the Institute of Foundation Models
At IFM, we are pioneers in developing, understanding, and managing foundation models. Our mission is to advance research, cultivate the next generation of AI innovators, and contribute significantly to a knowledge-driven economy.
 
As a member of our esteemed team, you will engage in the forefront of cutting-edge foundation model training, collaborating with top-tier researchers, data scientists, and engineers. Together, we will address the most significant and impactful challenges in AI development. You will play a crucial role in creating revolutionary AI solutions that have the potential to transform entire industries. Your strategic and innovative problem-solving abilities will be essential in establishing MBZUAI as a global leader in high-performance computing for deep learning, facilitating discoveries that will inspire future AI pioneers.
 
The Role
 
IFM is developing the foundational compute infrastructure that will drive future breakthroughs in AI and computational science. We are seeking a High Performance Computing Software Engineer to collaborate in designing, developing, and operating the software systems that manage our extensive AI workloads.
 
In this position, you will work at the crossroads of high-performance computing and machine learning. You will be part of a dedicated team focused on creating the software stack that supports the training of advanced ML models using over 1000 GPUs, while ensuring our infrastructure remains robust, efficient, and user-friendly.

About Institute of Foundation Models

The Institute of Foundation Models is a research lab committed to building, understanding, and leveraging foundation models to enhance AI development. Our focus is on advancing research, nurturing future AI innovators, and fostering contributions to a knowledge-driven economy.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.