About the job
Location: Cambridge, MA (Eastern Time / UTC -4). Relocation support is available. Remote work may be considered for candidates based outside Massachusetts.
Start date: ASAP
Languages: English (required)
Pragmatike is an AI startup founded by MIT CSAIL researchers, recognized by GTM Capital as a Top 10 GenAI company. The team develops advanced AI systems with a focus on real-world applications.
Role overview
The Principal Machine Learning Operations Engineer shapes the architecture and scaling of Pragmatike’s machine learning infrastructure. This senior position leads the design, implementation, and optimization of production AI systems, overseeing the full lifecycle from model training and evaluation to deployment, monitoring, and ongoing improvement.
Collaboration is central in this role. The Principal ML Ops Engineer works closely with AI researchers, GPU systems engineers, backend developers, and product teams. The main focus is building reliable, efficient, and automated ML platforms to support large-scale AI deployments.
Key responsibilities
- Architect, build, and improve the end-to-end ML Ops pipeline, including training, fine-tuning, evaluation, rollout, and monitoring.
- Design and maintain infrastructure for model deployment, version control, reproducibility, and orchestration across both cloud and on-premises GPU clusters.
- Optimize distributed systems for computational efficiency, covering Kubernetes, autoscaling, caching, GPU allocation, and checkpointing workflows.
- Establish observability for ML systems by monitoring model drift, performance, throughput, reliability, and operational costs.
- Automate workflows for dataset curation, labeling, feature engineering, evaluation, and CI/CD for ML models.
- Collaborate with researchers to bring models into production and improve training and inference pipelines.
- Set internal ML Ops standards, best practices, and develop tools that support cross-team collaboration.
- Mentor engineers and provide architectural guidance across the AI platform.
Requirements
- Extensive hands-on experience designing and operating production ML systems at scale, ideally at the Staff or Principal level.
- Deep knowledge of ML Ops, distributed systems, and cloud infrastructure.

