About the job
Join Us in Building a Safer World
At TRM Labs, we specialize in blockchain analytics and AI solutions designed to empower law enforcement, national security agencies, financial institutions, and cryptocurrency businesses in combating fraud and financial crime. Our innovative platforms harness blockchain intelligence and AI to trace financial flows, identify suspicious activities, and provide comprehensive threat assessments. Trusted by leading organizations globally, TRM is dedicated to creating a safer and more secure environment for all.
Our AI Engineering Team focuses on pioneering next-generation AI applications, particularly in the realm of Large Language Models (LLMs) and agentic systems. Our goal is to develop resilient pipelines and high-performance infrastructures that facilitate the swift, safe, and scalable deployment of AI systems.
We manage vast data pipelines, ensure rapid model serving, and maintain the observability and governance essential for making AI production-ready. Our team is actively engaged in evaluating and integrating state-of-the-art tools within the LLM and agent ecosystem, including open-source frameworks, vector databases, and orchestration tools that enhance TRM’s innovative capabilities.
As a Staff MLOps Engineer concentrating on LLMOps, you will play a pivotal role in constructing and scaling the technical infrastructure for our AI/ML systems. Your responsibilities will include:
- Developing reusable CI/CD workflows for model training, evaluation, and deployment, incorporating tools like Langfuse, GitHub Actions, and experiment tracking.
- Automating model versioning, approval workflows, and compliance checks across various environments.
- Building modular and scalable AI infrastructure stacks, including vector databases, feature stores, model registries, and observability tools.
- Collaborating with engineering and data science teams to integrate AI models and agents into real-time applications and workflows.
- Continuously assessing and adopting cutting-edge AI tools (e.g., LangChain, LlamaIndex, vLLM, MLflow, BentoML).
- Enhancing AI reliability and governance to enable experimentation while ensuring compliance, security, and operational uptime.
- Improving AI/ML model performance, ensuring data accuracy, consistency, and reliability for superior model training and inference.
- Deploying infrastructure to support both offline and online evaluations of LLMs.

