companyCohere logo

Site Reliability Engineer, Inference Infrastructure

CohereToronto
On-site FullTime

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Qualifications

Ideal Candidate Profile:Minimum of 5 years of engineering experience in product development or operational environments. Proficiency in cloud services and container orchestration, especially Kubernetes. Strong understanding of site reliability principles and practices. Experience with monitoring, logging, and observability tools. Excellent problem-solving skills and a collaborative mindset.

About the job

About Us:

At Cohere, we are dedicated to scaling intelligence to enhance human experience. We specialize in training and deploying cutting-edge AI models for developers and businesses, empowering them to create extraordinary applications such as content generation, semantic search, retrieval-augmented generation (RAG), and intelligent agents. We believe our innovative work is pivotal in driving the adoption of AI across various sectors.

Our team is passionate and meticulous about what we create. Every team member plays a crucial role in enhancing our models' capabilities and the value they deliver to our clients. We prioritize hard work and agility to serve our customers effectively.

Cohere comprises a diverse team of researchers, engineers, designers, and industry experts, all committed to excellence in their respective fields. We understand that a variety of perspectives is essential for developing outstanding products.

Join us in our mission to shape the future of AI!

Why This Position Matters:

If you thrive on building high-performance, scalable, and reliable machine learning systems, and you are excited about defining the future of AI platforms that power advanced NLP applications, we want you on our Model Serving team at Cohere. As a Site Reliability Engineer, you will be instrumental in developing, deploying, and managing our AI platform, which delivers Cohere's large language models via user-friendly API endpoints. You will collaborate with multiple teams to deploy optimized NLP models in environments characterized by low latency, high throughput, and high availability. This role also offers the chance to engage with customers and create tailored deployments that address their unique requirements.

Your Responsibilities:

  • Design and build self-service systems that streamline the management, deployment, and operation of services.

  • Develop custom Kubernetes operators that facilitate language model deployments.

  • Automate observability and resilience within the environment, empowering developers to troubleshoot and resolve issues efficiently.

  • Ensure adherence to defined Service Level Objectives (SLOs), which includes participating in an on-call rotation.

  • Foster strong relationships with internal developers and help guide the Infrastructure team’s roadmap based on their feedback.

  • Contribute to the development of our team through knowledge sharing and an active review process.

About Cohere

Cohere is a pioneering organization that focuses on advancing AI technology for developers and enterprises. Our commitment to building transformative AI solutions is driven by a team that values creativity, innovation, and a diverse range of perspectives. We believe in the power of collaboration and continuous improvement to achieve our mission of making AI accessible and impactful for all.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.