About the job
Location: Cambridge, MA (Eastern Time / UTC -4). Relocation support available, or remote considered for out-of-state candidates.
Start date: ASAP
Languages: English (required)
About Pragmatike
Pragmatike is a fast-growing AI startup, recognized by GTM Capital as a Top 10 GenAI company. Founded by researchers from MIT CSAIL, the team focuses on developing advanced AI systems for real-world impact.
Role Overview: Principal Machine Learning Operations Engineer
This senior role shapes the architecture and scaling of Pragmatike’s machine learning infrastructure. The Principal ML Ops Engineer leads the design, implementation, and optimization of production AI systems, managing the full lifecycle from model training and evaluation to deployment, monitoring, and ongoing improvement.
Expect close collaboration with AI researchers, GPU systems engineers, backend developers, and product teams. The work centers on building reliable, efficient, and automated ML platforms that support large-scale AI deployments.
Key Responsibilities
- Architect, build, and improve the end-to-end ML Ops pipeline: training, fine-tuning, evaluation, rollout, and monitoring.
- Design and maintain infrastructure for model deployment, version control, reproducibility, and orchestration across cloud and on-premises GPU clusters.
- Optimize distributed systems for computational efficiency, including Kubernetes, autoscaling, caching, GPU allocation, and checkpointing workflows.
- Establish observability for ML systems: monitor model drift, performance, throughput, reliability, and operational costs.
- Automate workflows for dataset curation, labeling, feature engineering, evaluation, and CI/CD for ML models.
- Work with researchers to bring models into production and improve training and inference pipelines.
- Set internal ML Ops standards, best practices, and develop cross-team tools.
- Mentor engineers and provide architectural guidance across the AI platform.
Requirements
- Significant hands-on experience designing and operating production ML systems at scale (Staff or Principal level).
- Deep knowledge of ML Ops, distributed systems, and cloud infrastructure.

