Qualifications
ResponsibilitiesDistributed Training Architecture & OptimizationCraft and execute large-scale distributed training systems tailored for diverse hardware, optimizing performance over low-bandwidth, high-latency networks. Innovate and refine model-parallel training techniques (data, tensor, pipeline parallelism) utilizing custom sharding methods to reduce communication overhead. Enhance GPU utilization, memory efficiency, and computational performance across distributed nodes. Establish robust checkpointing, state synchronization, and recovery strategies for long-duration, fault-prone training processes. Develop monitoring and metrics systems to assess training progress, model integrity, and identify system bottlenecks. Decentralized Networking & ResilienceDesign resilient training architectures that can withstand node failures, network partitions, and dynamic participant changes. Create and optimize peer-to-peer topologies for decentralized coordination among geographically dispersed nodes. Implement NAT traversal, peer discovery, dynamic routing, and connection lifecycle management. Analyze and optimize communication patterns to mitigate latency and bandwidth usage in multi-participant setups.
About the job
Pluralis Research is at the forefront of innovation in Protocol Learning, specializing in the collaborative training of foundational models. Our approach ensures that no single participant ever has or can obtain a complete version of the model. This initiative aims to create community-driven, collectively owned frontier models that operate on self-sustaining economic principles.
We are seeking experienced Senior or Staff Machine Learning Engineers with over 5 years of expertise in distributed systems and large-scale machine learning training. In this role, you will design and implement a groundbreaking substrate for training distributed ML models that function effectively over consumer-grade internet connections.
About Pluralis Research
Pluralis Research is dedicated to pioneering advancements in machine learning through innovative Protocol Learning techniques. Our mission is to empower communities with the tools and models to collaboratively train and own next-generation AI technologies, ensuring equitable access and sustainable economic frameworks.