About the job
Join our innovative team at Pinely as a Machine Learning Performance Engineer. We are on a mission to accelerate large-scale model training by optimizing our internal infrastructure and computing stack. In this pivotal role, you will engage with the entire training pipeline—from GPU kernels to system-wide throughput—utilizing profiling, CUDA-level tuning, and advanced distributed systems methodologies. Your contributions will be vital in minimizing training durations, enhancing iteration speeds, and maximizing computational efficiency.
As a key member of our growing team, you will help cultivate deep technical expertise in ML training systems.
Responsibilities:
- Enhance our model training pipeline to increase speed and reliability, facilitating quicker and more effective experimentation.
- Utilize GPU optimization techniques via tools like JAX, Triton, and low-level CUDA to elevate training performance and efficiency at scale.
- Diagnose and rectify performance bottlenecks throughout the ML pipeline—from data loading and preprocessing to CUDA kernels.
- Develop tools and expand our internal infrastructure to enable scalable, reproducible, and high-performance training workflows.
- Guide and mentor engineers and researchers in implementing performance best practices across the team.
- Assist in enhancing the team's capabilities in GPU and systems-level expertise, contributing to a culture of engineering excellence and rapid experimentation.
Requirements:
- Proven experience optimizing neural network training in production or large-scale research environments, such as reducing training time, enhancing hardware utilization, or expediting feedback cycles for ML researchers.
- Extensive hands-on experience with ML frameworks like PyTorch or JAX.
- Practical experience training and optimizing deep learning architectures, including LSTM and Transformer-based models with various attention mechanisms.
- Familiarity with CUDA, Triton, or other low-level GPU technologies for performance tuning.
- Expertise in profiling and debugging training pipelines using tools like Nsight, cprofiler, CUDA, gdb, or torch profiler.
- Comprehension of distributed training concepts including data/model/tensor/sequence/pipeline/context parallelism and memory-compute trade-offs.
- A collaborative and proactive approach, coupled with strong communication skills and the ability to mentor team members effectively.
- Strong proficiency in Python for developing infrastructure-level tools, debugging training systems, and integrating with ML frameworks and profiling tools.
What We Offer:
- Competitive salary and comprehensive social benefits.
- Attractive bonus structure; we are flexible in discussions regarding salary and employment conditions.
- Access to state-of-the-art hardware and software in production, alongside a highly skilled technical team.

