About the job
About the Role
Peloton Interactive, Inc. is committed to building a platform that matches the quality and ambition of its products. The platform supports rapid development and continuous learning, freeing engineers to deliver new features and improvements. With a strong focus on data, the team identifies where to invest effort for the greatest impact on members. The platform spans hardware, firmware, web, mobile, backend, data, messaging, content, streaming, and machine learning, serving millions of users worldwide.
The Site Reliability Engineer (SRE) will join a growing team in New York, working closely with colleagues across disciplines. The main focus: support and develop a monitorable, reliable, and highly scalable deployment platform. The team manages thousands of nodes and pods across many deployments, addressing large-scale operational challenges every day.
What You Will Do
- Implement rapid auto-scaling for live rides and major events
- Maintain infrastructure to deliver a seamless experience for members across tens of thousands of pods in multiple clusters
- Support a platform that enables machine learning and other complex workloads, helping developers move quickly
- Promote best practices for building and running reliable systems
- Act as a subject matter expert in observability and monitoring
- Advise on system design to meet reliability and capacity goals
- Automate processes, from infrastructure management to daily operations
- Lead post-mortem analysis after infrastructure incidents
- Support operational security and compliance efforts
- Identify and address potential security and reliability risks
- Work with tools such as Amazon Web Services, Chef, Python, Ubuntu, Nginx, Jenkins, and Terraform
Location
This role is based in New York, New York.

