Qualifications
Key Responsibilities:Design and conduct large-scale training runs on our clusters. Develop and enhance distributed training infrastructure across extensive multi-node systems. Implement post-training pipelines efficiently at scale. Create data pipelines that process and filter trillions of tokens for pre-training. Research and integrate architectural advancements, scaling laws, and training optimizations. Troubleshoot training instabilities, loss fluctuations, and convergence challenges in prolonged jobs. Develop tools for cluster utilization, fault tolerance, and checkpoint management. Write custom CUDA/Triton kernels to optimize essential training operations (including attention, normalization, and activations). Collaborate on pioneering research that pushes the boundaries of foundation model training. Ideal Candidate:Proven experience in pre-training or post-training foundation models on large clusters. High proficiency in Python and ML frameworks (including PyTorch, JAX, and Torchtitan). Solid systems skills: experience with distributed training, FSDP/ZeRO, tensor parallelism, and pipeline parallelism. Experience in writing efficient CUDA or Triton kernels for ML workloads. Demonstrated history of executing stable multi-week training jobs and resolving distributed training issues. Understanding of cluster scheduling, networking bottlenecks, and GPU/TPU performance optimization.
About the job
Tzafon is at the forefront of machine intelligence, operating as a cutting-edge foundation model lab dedicated to building scalable computing systems. With offices in San Francisco, Zurich, and Tel Aviv, we have secured over $12 million in funding to propel our mission of expanding the boundaries of machine intelligence.
Our talented team comprises engineers and scientists with extensive expertise in ML infrastructure and research, founded by distinguished IOI and IMO medalists, PhD holders, and alumni from top tech firms such as Google DeepMind, Character, and NVIDIA. We specialize in training models and constructing infrastructure for swarms of agents to automate tasks across real-world environments.
In this role, you'll collaborate between our product and post-training teams to deploy Large Action Models that deliver results. Your responsibilities will include building evaluations, benchmarks, and fine-tuning pipelines, as well as defining optimal model behavior and achieving it at scale.
About Tzafon
Tzafon is a pioneering foundation model lab focused on advancing machine intelligence through scalable computing systems. Our mission is to break new ground in ML infrastructure and research, driven by a team of experts from top tech companies.