companyPragmatike logo

Principal Machine Learning Operations Engineer

PragmatikeCambridge
Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

We seek candidates with deep technical expertise and a proven track record in:Designing and maintaining production ML systems at scale. Strong foundation in ML Ops principles and practices. Experience with distributed systems and cloud infrastructure.

About the job

Location: Cambridge, MA (Eastern Time / UTC -4). Relocation support is available. Remote work may be considered for candidates based outside Massachusetts.
Start date: ASAP
Languages: English (required)

Pragmatike is an AI startup founded by MIT CSAIL researchers, recognized by GTM Capital as a Top 10 GenAI company. The team develops advanced AI systems with a focus on real-world applications.

Role overview

The Principal Machine Learning Operations Engineer shapes the architecture and scaling of Pragmatike’s machine learning infrastructure. This senior position leads the design, implementation, and optimization of production AI systems, overseeing the full lifecycle from model training and evaluation to deployment, monitoring, and ongoing improvement.

Collaboration is central in this role. The Principal ML Ops Engineer works closely with AI researchers, GPU systems engineers, backend developers, and product teams. The main focus is building reliable, efficient, and automated ML platforms to support large-scale AI deployments.

Key responsibilities

  • Architect, build, and improve the end-to-end ML Ops pipeline, including training, fine-tuning, evaluation, rollout, and monitoring.
  • Design and maintain infrastructure for model deployment, version control, reproducibility, and orchestration across both cloud and on-premises GPU clusters.
  • Optimize distributed systems for computational efficiency, covering Kubernetes, autoscaling, caching, GPU allocation, and checkpointing workflows.
  • Establish observability for ML systems by monitoring model drift, performance, throughput, reliability, and operational costs.
  • Automate workflows for dataset curation, labeling, feature engineering, evaluation, and CI/CD for ML models.
  • Collaborate with researchers to bring models into production and improve training and inference pipelines.
  • Set internal ML Ops standards, best practices, and develop tools that support cross-team collaboration.
  • Mentor engineers and provide architectural guidance across the AI platform.

Requirements

  • Extensive hands-on experience designing and operating production ML systems at scale, ideally at the Staff or Principal level.
  • Deep knowledge of ML Ops, distributed systems, and cloud infrastructure.

About Pragmatike

Pragmatike is a pioneering AI startup that leverages advanced research from MIT CSAIL. Our commitment to innovation has positioned us among the top companies in the Generative AI sector, making us an exciting place to drive impactful change in the AI landscape.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.