company

Machine Learning Engineer - Decentralized ML Training Platform

Pluralis ResearchSan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

What You’ll BringWe prefer candidates with over 5 years of experience, showcasing a deep expertise in the following areas:Infrastructure & Platform Engineering: Demonstrable experience in production environments with infrastructure-as-code tools (Pulumi/Terraform/CloudFormation) for managing multi-cloud deployments, lifecycle orchestration, self-healing systems, and Docker/Kubernetes (EKS), including GPU workloads and heterogeneous clusters at scale. Distributed Systems & ML Infrastructure: A profound understanding of distributed training workflows, checkpointing, data sharding, model versioning, long-running job orchestration, and decentralized networking (P2P, NAT traversal, traffic shaping).

About the job

Pluralis Research is at the forefront of Protocol Learning, innovating a decentralized approach to train and deploy AI models that democratizes access beyond just well-funded corporations. By aggregating computational resources from diverse participants, we incentivize collaboration while safeguarding against centralized control of model weights, paving the way for a truly open and cooperative environment for advanced AI.

We are seeking a talented Machine Learning Training Platform Engineer to design, develop, and scale the core infrastructure that powers our decentralized ML training platform. In this role, you will have ownership over essential systems including infrastructure orchestration, distributed computing, and service integration, facilitating ongoing experimentation and large-scale model training.

Responsibilities

  • Multi-Cloud Infrastructure: Create resource management systems that provision and orchestrate computing resources across AWS, GCP, and Azure using infrastructure-as-code tools like Pulumi or Terraform. Manage dynamic scaling, state synchronization, and concurrent operations across hundreds of diverse nodes.

  • Distributed Training Systems: Design fault-tolerant infrastructure for distributed machine learning, including GPU clusters, NVIDIA runtime, S3 checkpointing, large dataset management and streaming, health monitoring, and resilient retry strategies.

  • Real-World Networking: Develop systems that simulate and manage real-world network conditions—such as bandwidth shaping, latency injection, and packet loss—while accommodating dynamic node churn and ensuring efficient data flow across workers with varying connectivity, as our training occurs on consumer nodes and non-co-located infrastructure.

About Pluralis Research

Pluralis Research is revolutionizing the AI landscape through cutting-edge innovations in decentralized model training. Our commitment to fostering collaboration and accessibility empowers individuals to contribute to frontier-scale AI developments without the constraints of corporate monopolies.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.