About the job
Join Us in Building a Safer World.
At TRM Labs, we specialize in blockchain analytics and AI solutions aimed at assisting law enforcement, national security agencies, financial institutions, and cryptocurrency businesses in identifying, investigating, and preventing crypto-related fraud and financial crime. Our innovative platforms leverage blockchain intelligence and AI technology to trace funds, detect illicit activity, and construct comprehensive threat profiles. Trusted by leading organizations worldwide, TRM Labs is committed to enabling a safer and more secure environment for all.
Our AI Engineering Team is dedicated to pioneering next-generation AI applications, particularly in the realm of Large Language Models (LLMs) and agentic systems. Our goal is to develop resilient pipelines and high-performance infrastructure that facilitate the swift, safe, and scalable deployment of AI systems.
We manage extensive petabyte-scale pipelines, ensuring model serving with millisecond latency while providing the necessary observability and governance to make AI production-ready. Our team actively evaluates and integrates leading-edge tools in the LLM and agent space, including open-source stacks, vector databases, evaluation frameworks, and orchestration tools to accelerate TRM’s innovation pace.
As a Senior or Staff ML Systems Engineer – LLM, you will play a pivotal role in constructing and scaling our technical infrastructure for AI/ML systems. Your responsibilities will include:
Creating reusable CI/CD workflows for model training, evaluation, and deployment, integrating tools such as Langfuse, GitHub Actions, and experiment tracking.
Automating model versioning, approval processes, and compliance checks across various environments.
Developing a modular and scalable AI infrastructure stack that encompasses vector databases, feature stores, model registries, and observability tools.
Collaborating with engineering and data science teams to embed AI models and agents into real-time applications and workflows.
Continuously assessing and incorporating state-of-the-art AI tools (e.g., LangChain, LlamaIndex, vLLM, MLflow, BentoML).
Promoting AI reliability and governance while enabling experimentation, ensuring compliance, security, and continuous uptime.
Enhancing AI/ML Model Performance and ensuring data accuracy and consistency, leading to improved model training and inference.
Implementing infrastructure to facilitate both offline and online evaluation of LLMs and agents.

