company

Senior Machine Learning Engineer - Distributed ML Systems

Pluralis ResearchSan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

ResponsibilitiesDistributed Training Architecture & OptimizationCraft and execute large-scale distributed training systems tailored for diverse hardware, optimizing performance over low-bandwidth, high-latency networks. Innovate and refine model-parallel training techniques (data, tensor, pipeline parallelism) utilizing custom sharding methods to reduce communication overhead. Enhance GPU utilization, memory efficiency, and computational performance across distributed nodes. Establish robust checkpointing, state synchronization, and recovery strategies for long-duration, fault-prone training processes. Develop monitoring and metrics systems to assess training progress, model integrity, and identify system bottlenecks. Decentralized Networking & ResilienceDesign resilient training architectures that can withstand node failures, network partitions, and dynamic participant changes. Create and optimize peer-to-peer topologies for decentralized coordination among geographically dispersed nodes. Implement NAT traversal, peer discovery, dynamic routing, and connection lifecycle management. Analyze and optimize communication patterns to mitigate latency and bandwidth usage in multi-participant setups.

About the job

Pluralis Research is at the forefront of innovation in Protocol Learning, specializing in the collaborative training of foundational models. Our approach ensures that no single participant ever has or can obtain a complete version of the model. This initiative aims to create community-driven, collectively owned frontier models that operate on self-sustaining economic principles.

We are seeking experienced Senior or Staff Machine Learning Engineers with over 5 years of expertise in distributed systems and large-scale machine learning training. In this role, you will design and implement a groundbreaking substrate for training distributed ML models that function effectively over consumer-grade internet connections.

About Pluralis Research

Pluralis Research is dedicated to pioneering advancements in machine learning through innovative Protocol Learning techniques. Our mission is to empower communities with the tools and models to collaboratively train and own next-generation AI technologies, ensuring equitable access and sustainable economic frameworks.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.