About the job
About Us:
At Cohere, we are dedicated to scaling intelligence to enhance human experience. We specialize in training and deploying cutting-edge AI models for developers and businesses, empowering them to create extraordinary applications such as content generation, semantic search, retrieval-augmented generation (RAG), and intelligent agents. We believe our innovative work is pivotal in driving the adoption of AI across various sectors.
Our team is passionate and meticulous about what we create. Every team member plays a crucial role in enhancing our models' capabilities and the value they deliver to our clients. We prioritize hard work and agility to serve our customers effectively.
Cohere comprises a diverse team of researchers, engineers, designers, and industry experts, all committed to excellence in their respective fields. We understand that a variety of perspectives is essential for developing outstanding products.
Join us in our mission to shape the future of AI!
Why This Position Matters:
If you thrive on building high-performance, scalable, and reliable machine learning systems, and you are excited about defining the future of AI platforms that power advanced NLP applications, we want you on our Model Serving team at Cohere. As a Site Reliability Engineer, you will be instrumental in developing, deploying, and managing our AI platform, which delivers Cohere's large language models via user-friendly API endpoints. You will collaborate with multiple teams to deploy optimized NLP models in environments characterized by low latency, high throughput, and high availability. This role also offers the chance to engage with customers and create tailored deployments that address their unique requirements.
Your Responsibilities:
Design and build self-service systems that streamline the management, deployment, and operation of services.
Develop custom Kubernetes operators that facilitate language model deployments.
Automate observability and resilience within the environment, empowering developers to troubleshoot and resolve issues efficiently.
Ensure adherence to defined Service Level Objectives (SLOs), which includes participating in an on-call rotation.
Foster strong relationships with internal developers and help guide the Infrastructure team’s roadmap based on their feedback.
Contribute to the development of our team through knowledge sharing and an active review process.

