companyCognition logo

Site Reliability Engineer

CognitionSan Francisco Bay Area
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Entry Level

Qualifications

QualificationsWe are looking for candidates who possess a strong foundation in software engineering and system administration. Ideal candidates should have experience with:Strong understanding of SRE principles and practices. Experience with CI/CD pipelines and deployment automation. Familiarity with cloud infrastructures, preferably AWS or GCP. Proficient in monitoring tools and incident response strategies. Excellent problem-solving skills and the ability to thrive in high-pressure situations.

About the job

Join Our Team

At Cognition, we are at the forefront of applied AI innovation, developing cutting-edge software agents that redefine the engineering landscape. Our flagship products, Devin, the pioneering AI software engineer, and Windsurf, an AI-native IDE, embody our commitment to creating AI that collaborates with engineers as a true partner.

Our team is composed of elite talent including competitive programming champions, visionary founders, and researchers from top AI institutions such as Scale AI, Palantir, Cursor, Google DeepMind, and more.

Your Mission

As a Site Reliability Engineer, you will play a crucial role in ensuring the reliability of our user-focused products, which are utilized by hundreds of thousands of developers daily. Your mission is to preemptively address potential issues and swiftly resolve any incidents that may arise, maintaining a seamless experience for our users.

You will be responsible for overseeing production reliability and enhancing our platform engineering practices, encompassing SLOs, incident response, and on-call duties, alongside CI/CD pipelines, deployment infrastructure, and developer tools. At Cognition, we believe in integrating reliability into our systems rather than treating it as an afterthought, and we strive to cultivate a culture that reflects this philosophy.

Your Achievements

  • Production Reliability: Establish and manage SLOs, SLIs, and error budgets for our products. Develop robust monitoring, alerting, and observability systems to maintain a transparent view of service health.

  • Incident Management: Spearhead incident response with precision and promptness. Conduct blameless postmortems to derive actionable insights from outages, and create effective runbooks and tools to enhance on-call sustainability.

  • Platform Engineering: Oversee deployment pipelines and internal developer tools, ensuring rapid, reliable shipping of code while minimizing unnecessary toil for engineers.

  • Infrastructure as Code: Manage cloud infrastructure via code, creating reproducible, auditable environments that can scale with product demands and mitigate configuration drift.

  • Capacity Planning: Analyze growth trends, anticipate resource requirements, and ensure our infrastructure is always ahead of user demand, optimizing system performance proactively.

  • Security and Reliability: Integrate security protocols with reliability practices to create a robust framework that safeguards our infrastructure.

About Cognition

Cognition is a leading applied AI lab focused on creating innovative software solutions that empower engineers. Our team is made up of top talents in the field, committed to pioneering advancements in AI technology.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.