About the job
At Chakra Labs, we are dedicated to creating innovative environments for AI agents, focusing on systems that enhance and measure their productive capabilities.
Key Responsibilities
Agent Orchestration at Scale: Manage hundreds of agent executions simultaneously, each with its unique stateful environment. You'll oversee the dispatch layer, including SQS, concurrency management, and failure recovery.
Environment and Task Design: Develop realistic environments and challenging scenarios that push agents to their limits. Your role will involve constructing new evaluations and designing meaningful tasks that assess critical performance metrics.
Exploring New Frontiers: Stay on the cutting edge of agent evaluation by supporting new environment modalities and integrating external orchestration frameworks.
Observability: Implement Prometheus and OpenTelemetry across services, create Grafana dashboards, and manage structured logging.
Qualifications
Container Orchestration: Proficient in managing Kubernetes or similar technologies in production environments, including auto-scaling, pod lifecycle management, persistent storage, and networking.
Distributed Systems: Experience in building or maintaining message-driven architectures (e.g., SQS, Kafka). You understand how to manage job flows, implement retries without duplication, and handle failures gracefully.
LLM Infrastructure: Familiarity with running LLM workloads at scale, including token instrumentation, rate limit management, prompt caching, and multi-provider routing.
Experience: Approximately 3-5 years of relevant experience, though we are open to candidates who possess the required skills and knowledge.
Why Join Us?
Unique Focus: While this role is centered on infrastructure, the workload involves AI agents—monitoring model behaviors alongside pod health, and analyzing token throughput alongside network performance.
Engaging Clients: Collaborate directly with AI researchers and labs, contributing to the advancement of agent capabilities and building the foundational infrastructure they rely on.
Dynamic Team Environment: Take ownership of entire systems rather than just tasks, with opportunities to impact various projects and initiatives.

