companyPragmatike logo

Principal Machine Learning Operations Engineer

PragmatikeCambridge
Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

We seek candidates with deep technical expertise and a proven track record in:Designing and maintaining production ML systems at scale. Strong foundation in ML Ops principles and practices. Experience with distributed systems and cloud infrastructure.

About the job

Location: Cambridge, MA (Eastern Time / UTC -4). Relocation support available, or remote considered for out-of-state candidates.
Start date: ASAP
Languages: English (required)

About Pragmatike

Pragmatike is a fast-growing AI startup, recognized by GTM Capital as a Top 10 GenAI company. Founded by researchers from MIT CSAIL, the team focuses on developing advanced AI systems for real-world impact.

Role Overview: Principal Machine Learning Operations Engineer

This senior role shapes the architecture and scaling of Pragmatike’s machine learning infrastructure. The Principal ML Ops Engineer leads the design, implementation, and optimization of production AI systems, managing the full lifecycle from model training and evaluation to deployment, monitoring, and ongoing improvement.

Expect close collaboration with AI researchers, GPU systems engineers, backend developers, and product teams. The work centers on building reliable, efficient, and automated ML platforms that support large-scale AI deployments.

Key Responsibilities

  • Architect, build, and improve the end-to-end ML Ops pipeline: training, fine-tuning, evaluation, rollout, and monitoring.
  • Design and maintain infrastructure for model deployment, version control, reproducibility, and orchestration across cloud and on-premises GPU clusters.
  • Optimize distributed systems for computational efficiency, including Kubernetes, autoscaling, caching, GPU allocation, and checkpointing workflows.
  • Establish observability for ML systems: monitor model drift, performance, throughput, reliability, and operational costs.
  • Automate workflows for dataset curation, labeling, feature engineering, evaluation, and CI/CD for ML models.
  • Work with researchers to bring models into production and improve training and inference pipelines.
  • Set internal ML Ops standards, best practices, and develop cross-team tools.
  • Mentor engineers and provide architectural guidance across the AI platform.

Requirements

  • Significant hands-on experience designing and operating production ML systems at scale (Staff or Principal level).
  • Deep knowledge of ML Ops, distributed systems, and cloud infrastructure.

About Pragmatike

Pragmatike is a pioneering AI startup that leverages advanced research from MIT CSAIL. Our commitment to innovation has positioned us among the top companies in the Generative AI sector, making us an exciting place to drive impactful change in the AI landscape.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.