Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Senior Level Manager
Qualifications
We are seeking candidates with a strong background in software engineering and site reliability practices. Ideal qualifications include:Proven experience in managing engineering teams. Strong knowledge of cloud infrastructure, particularly AWS or similar platforms. Experience with monitoring and observability tools. Excellent problem-solving skills and a passion for operational excellence. Effective communication and collaboration abilities.
About the job
About the Role
Betterment is looking for a Senior Engineering Manager to guide the Site Reliability Engineering (SRE) team at our New York City headquarters. This leader will oversee a skilled group focused on keeping services reliable, high-performing, and able to scale as we grow.
What You'll Do
Lead and support engineers dedicated to service reliability and performance.
Collaborate with teams across the company to improve infrastructure and operations.
Promote a culture that values technical excellence and continuous improvement.
Location
This position is based at Betterment HQ in New York City.
About Betterment
Betterment is a leading digital investment platform that empowers individuals to make better financial decisions. With a focus on technology and customer-centric solutions, we strive to create a more inclusive and accessible investing experience. Join our team to help redefine the future of finance.
Full-time|$180K/yr - $200K/yr|On-site|New York, New York
About Us:At Parabola, we empower teams to transform and streamline complex data workflows with ease. Our innovative workflow builder allows users to automate tasks that were previously manual, including data from PDFs, emails, and spreadsheets. Forward-thinking companies such as Brooklinen, On Running, and Flexport leverage Parabola to enhance their productivity and tackle ambitious projects. Our platform enables teams to automate processes, saving valuable time and resources, all without requiring extensive engineering support.Supported by prominent investors like OpenView Partners, Matrix Partners, and Thrive Capital, we are committed to continuous innovation and growth.About the Role:As a Senior Site Reliability Engineer, you will be integral to our dynamic team, focusing on measuring and enhancing the performance of our software systems. Your role will encompass critical aspects such as performance optimization, security compliance, and foundational architecture. With our compact team structure, your contributions will significantly impact our operations and customer satisfaction.What You Will Do:Oversee the monitoring of our core software systems, both during on-call scenarios and in routine operations, to establish and refine service-level objectives (SLOs) and agreements (SLAs).Enhance and monitor our infrastructure stack, ensuring it meets the demands of our business-critical services.Maintain a comprehensive mental and documented model of our systems to effectively assess risks, plan projects, and troubleshoot issues.Engage with the engineering team and leadership, providing insights from your expertise in site reliability and advocating for best practices.Contribute to the development of our core orchestration logic, which supports the efficient execution of thousands of workflows concurrently, utilizing our new orchestration built on Temporal.Support multiple backend engineering projects during planning and execution phases, joining as needed for hands-on development.Focus on optimizing service scalability, stability, and observability.
About the Role Betterment is looking for a Senior Engineering Manager to guide the Site Reliability Engineering (SRE) team at our New York City headquarters. This leader will oversee a skilled group focused on keeping services reliable, high-performing, and able to scale as we grow. What You'll Do Lead and support engineers dedicated to service reliability and performance. Collaborate with teams across the company to improve infrastructure and operations. Promote a culture that values technical excellence and continuous improvement. Location This position is based at Betterment HQ in New York City.
Role overview ro is looking for a Senior Site Reliability Engineer based in New York, NY. This role focuses on maintaining and improving the reliability, availability, and performance of our cloud infrastructure and applications. The position supports ongoing enhancements and encourages a culture of continuous improvement across the engineering team.
Full-time|$238.5K/yr - $293K/yr|On-site|New York, New York
ABOUT THE ROLE At Peloton, we are dedicated to crafting an unparalleled experience for our members. To ensure this, our internal systems—including Finance, HR, Supply Chain, and Legal—must operate with the same finesse as our premium fitness content. As the Senior Manager of Site Reliability Engineering (SRE) for Internal Systems, you will spearhead a team focused on the critical lifecycles of 'Order-to-Cash', 'Procure-to-Pay', and 'Record-to-Report'. You will not simply manage infrastructure; you will be the architect of our business continuity strategy. Your leadership will guide a team of skilled SREs to uphold the resilience, observability, and scalability of our global SaaS ecosystem (NetSuite, Coupa, Workday) and the supporting network infrastructure.
As a Cloud Site Reliability Engineer, you will be responsible for deploying innovative solutions within the public cloud environment, specifically utilizing AWS services. You will create and manage configuration templates designed for scalable infrastructure, including AWS components like EFS, EC2, and RDS. Collaborating closely with the Scrum Master, you will ensure the project requirements are met within an agile development setting.Key Responsibilities:• Contribute to architectural design to enhance system consistency, security, maintainability, and flexibility.• Assist architects in creating highly scalable and automated deployments for diverse applications.• Develop configuration templates using established architectural blueprints.• Ensure the development of robust and scalable services across public cloud platforms, including AWS and GCP.• Monitor and assess system performance to ensure optimal operation.
About Chalkboard:Chalkboard is pioneering the next generation of sports gaming. Our mission is to seamlessly merge watching and playing by transforming real-money sports gaming into a dynamic, social experience designed for fans eager to win. We are redefining how sports enthusiasts connect with the games they cherish.At our essence, we are a team of passionate, sports-loving innovators who prioritize transparency, equity, and the excitement of empowering fans to turn insights into actionable strategies.The Role:We are on the lookout for a Principal Site Reliability Engineer to join our ranks at Chalkboard, contributing to the creation of a platform that is not only reliable and scalable but also user-friendly for our development teams.In this pivotal role, you will collaborate with Engineering, Product, and Data teams, significantly impacting how millions of fans engage with sports in real time. If you thrive in a fast-paced environment, love to build robust solutions from the ground up, and aim to achieve team success rather than individual accolades, we want to hear from you!Your Game Plan:Take ownership of platform reliability from start to finish, proactively identifying and mitigating risks before they affect users.Develop and enhance observability (metrics, logs, tracing) to facilitate rapid issue detection, diagnosis, and resolution.Anticipate infrastructure needs by identifying bottlenecks and implementing sustainable architectural improvements.Minimize developer friction by refining CI/CD pipelines, deployment workflows, and internal tools.Lead incident responses and root cause analyses, focusing on systemic solutions rather than temporary fixes.Establish and uphold best practices for infrastructure, deployments, and system reliability.Create reusable, self-service infrastructure that empowers teams to deploy quickly and securely.Continuously enhance systems through automation and Infrastructure-as-Code methodologies.What You Bring to the Team:Experience with Cloud Infrastructure (preferably GCP): including networking, IAM, databases, and storage.Proficiency in Kubernetes: managing cluster operations and workloads.Skilled in Infrastructure as Code tools: Terraform, Helm.Familiarity with CI/CD practices: using GitHub Actions or similar tools.Knowledge of observability practices: metrics, logging, tracing, and alerting.
About Legora Legora develops technology for the legal sector, working directly with legal professionals to ensure practical, relevant solutions. The company’s AI-native workspace helps users streamline workflows, ask better questions, and uncover new insights. Clients include major global law firms such as Cleary Gottlieb, Goodwin, Bird & Bird, and Linklaters, spanning over 40 countries. Legora’s team values collaboration and aims to build tools that genuinely improve the way lawyers work. Role Overview: Senior Site Reliability Engineer This Senior Site Reliability Engineer role sits within Legora’s core SRE team at the New York City engineering hub. The position focuses on building and maintaining reliable services, partnering with engineering teams both locally and in Stockholm. The team’s goal: raise reliability standards across Legora’s platform and ensure smooth, dependable operations. Location requirement: This is a full-time, in-office role based in New York City. Attendance is required five days a week to support close collaboration and innovation. What You Will Do Design, deploy, and manage essential platform services, taking full responsibility for their reliability. Build and maintain observability systems (metrics, logs, traces) to generate actionable insights. Set and improve service level indicators (SLIs), service level objectives (SLOs), alerting, and reliability metrics for key systems. Refine on-call procedures and incident response, including escalation processes and post-incident analysis. Drive ongoing improvements to system reliability and performance.
Location: NYC Global HQ (Hybrid: 3 days in office) DoubleVerify delivers digital performance solutions for advertisers and agencies, enabling independent verification, campaign optimization, and measurement of business impact. Since 2008, DV has partnered with Fortune 500 brands, agencies, publishers, and digital ad platforms to bring greater transparency and improved outcomes to digital advertising. More details are available at www.doubleverify.com. Role overview The Senior Site Reliability Engineer I will focus on strengthening the reliability, scalability, and performance of DoubleVerify's digital media measurement platforms. This hybrid position is based at the NYC Global HQ, with an expectation of three days per week in the office. What you will do Enhance reliability, scalability, and performance for digital media measurement systems. Establish and refine observability practices, including setting up metrics, dashboards, and alerting to enable proactive reliability improvements. Reduce Mean Time to Recovery (MTTR) for critical incidents by automating processes, improving observability, and advancing monitoring capabilities. Lead incident response for high-severity (Sev1 and Sev2) events and drive resolutions. Maintain high availability across infrastructure and services in GCP, AWS, OCI, and on-premises environments. Guide technical projects from planning through deployment, collaborating with teams and keeping stakeholders informed. Design and deploy automation tools to reduce manual work and improve efficiency in deployment workflows, validation scripts, and self-service tooling. Utilize AI-assisted development tools for faster automation and troubleshooting. Build integrations and Monitoring Control Plane (MCP) servers to support monitoring platforms and AI-driven analysis. Apply Infrastructure-as-Code practices using Terraform, Helm charts, Python scripts, and configuration management tools for consistent, version-controlled deployments. Develop and maintain documentation, runbooks, and Standard Operating Procedures (SOPs) in Confluence to support consistent incident response.
Full-time|$133.1K/yr - $148K/yr|Remote|New York City, NY
Site Reliability Engineer Overview: Join Weedmaps as a Site Reliability Engineer and collaborate across departments, including application, infrastructure, and quality teams, to elevate the performance, reliability, resilience, and scalability of our web services at Weedmaps.com. As a cloud-native organization, we run 100% of our services in Docker on Kubernetes within AWS's public cloud. Our operations utilize observability, monitoring, CI/CD automation, and custom tooling, enabling us to deploy multiple production releases daily. Your daily responsibilities will focus on applying your engineering expertise to enhance system monitoring, minimize developer toil, configure CI workflows, and optimize our deployment pipelines. You will serve as a knowledge reference for development teams, ensuring they utilize consistent tools for metrics, logging, building, and deployment. Collaborating closely with both development and infrastructure teams, you will identify critical service-specific metrics that require monitoring, and you will help application development teams create libraries for seamless service instrumentation. The impact you'll make: Collaborate with stakeholders to establish and promote best practices for monitoring and CI/CD pipelines. Troubleshoot issues related to deployment within our CI pipeline. Actively promote the DevOps culture at Weedmaps. Identify opportunities for automation and advocate for the codification of processes. Promote best practices regarding collaboration, reliability, security, and performance across all partner teams. Take ownership of application configuration and scaling for specified services, ensuring adherence to organizational practices. Develop and optimize synthetic monitoring flows. What you've accomplished: A minimum of 2 years of development experience in startup or mid-sized environments. Proficiency in programming languages such as Python, Go, Node, Ruby, or Elixir. Knowledge of containerization technologies, particularly Docker (Kubernetes experience is a plus). Strong communication skills, a positive demeanor, and the ability to provide and receive constructive feedback. Professional experience with cloud-native observability standards including OpenMetrics, OpenTracing, and OpenCensus. Expertise in using and configuring modern CI/CD workflows. Deep understanding of SLIs, SLOs, and SLAs at both service and business levels. Familiarity with golden signals and their significance in monitoring.
Full-time|$170K/yr - $200K/yr|Hybrid|VEG Headquarters, White Plains, NY
ABOUT VETERINARY EMERGENCY GROUP Founded in 2014, the Veterinary Emergency Group (VEG) is dedicated to transforming the emergency care experience for pets and their owners. With a vision to redefine norms and improve the ER experience, we have rapidly expanded our network of hospitals that operate 24/7/365 across the nation. Our commitment to understanding the needs of pets and their families drives our continuous innovation. We prioritize not only the wellbeing of our patients but also of our team members (VEGgies), empowering them to achieve greatness and fostering a culture of growth and belonging. At VEG, we are reimagining emergency care in every aspect—from hospital operations to the support systems for our teams. Our headquarters team is pivotal in this transformation, whether it's through developing innovative technology to enhance hospital efficiency, recruiting exceptional talent, or effectively showcasing our brand through marketing. Our headquarters team ensures that our hospitals are equipped with the necessary resources to deliver outstanding care to pets and their families. VEG has been recognized as a Great Place to Work® for 2025 and 2026. THE ROLE We are seeking a Senior Site Reliability Engineer who recognizes the critical importance of reliability at VEG; our proprietary platform, DogByte, is essential to the survival of pets. As the primary architect of our platform's resilience, you will engineer our infrastructure to be self-healing, enabling our medical teams to provide life-saving care around the clock. Your role will be a blend of high-level architectural strategy and hands-on technical execution, ensuring our engineering teams can rapidly develop while maintaining a solid foundation. Your efforts will focus on evolving and enhancing existing systems to support VEG’s hospital expansion, ensuring that our infrastructure is never a limiting factor in our ability to open new hospitals or deliver medical care. You will take ownership of DogByte's ongoing stability, scaling it into a robust enterprise platform where individual hospital traffic is isolated to prevent impact on others. This position offers the flexibility to work at our headquarters in White Plains or remotely. KEY RESPONSIBILITIES Develop short- and long-term strategies to ensure DogByte can handle increasing volume year-over-year, particularly addressing traffic isolation between hospitals. Collaborate with engineering teams to ensure that data flows—from client to API to database—are optimized for high availability and performance.
About the RoleJoin Hopper's dynamic Cloud FinOps team as a Senior Site Reliability Engineer. We oversee an extensive infrastructure within Google Cloud, empowering hundreds of engineers to deliver exceptional experiences to millions of users globally.If you are enthusiastic about automation and optimizing systems for performance and reliability, we want to hear from you.You will focus on building scalable, secure, and optimized infrastructure while solving practical problems with straightforward, cost-effective solutions.Daily ResponsibilitiesEngage in projects that enhance cost efficiency, such as:Minimizing network egress costs by eliminating unnecessary headers.Optimizing data storage solutions based on usage patterns, such as implementing cold storage for infrequently accessed data.Ensuring optimal autoscaling configurations for databases and compute resources.Enhance current cost attribution processes to provide transparency for all teams regarding their expenditures.Participate in incident support, including on-call rotation for platform incidents, collaborating with teams across the Americas and Europe to ensure continuous support.Contribute to a small but highly efficient team of SREs.
Role overview Medal seeks a Site Reliability Engineer - Infrastructure Specialist in New York City. The focus is on strengthening the company’s infrastructure and ensuring the stability of Medal’s systems. This role works within a collaborative team to design, build, and maintain the technical foundation that enables the company’s growth and efficiency. What you will do Design and implement infrastructure solutions that can scale as demand increases Maintain and improve system reliability to help minimize downtime Monitor and optimize system performance to keep applications running smoothly Collaborate with team members to address ongoing infrastructure requirements
Join Tabs as a Staff Site Reliability Engineer to lead the charge in enhancing our systems for maximum reliability and performance. In this pivotal role, you will collaborate with cross-functional teams to design, implement, and maintain robust infrastructure solutions. You will ensure our systems are scalable, secure, and efficient, ultimately providing an unparalleled experience for our users.Your expertise in cloud technologies and automation will be vital as you drive initiatives to improve operational efficiency and system resilience. If you are passionate about creating reliable systems and thrive in a fast-paced environment, we want to hear from you!
Join CoreWeave as a Senior Site Reliability Engineer specializing in Data Infrastructure. In this pivotal role, you will ensure the reliability and sustainability of our data systems, working closely with our development teams to optimize performance and availability. You will be instrumental in enhancing our infrastructure to support the growing needs of our clients.
Role overview The Site Reliability Engineer at Mistral plays a key part in keeping systems stable, available, and performing well. This position requires close collaboration with teams throughout the company to support and improve the infrastructure that powers Mistral’s services. What you will do Maintain and improve system reliability and uptime Partner with other teams to design and build scalable infrastructure Implement monitoring tools, automation, and incident response processes Location This role is based in New York, NY.
Join Spotify as a Senior Site Reliability Engineer, where you'll play a crucial role in maintaining the reliability and performance of our services. This position involves collaborating with cross-functional teams to enhance our infrastructure and ensure a seamless experience for our users.As a key member of our engineering team, you will be responsible for monitoring system health, implementing automation processes, and troubleshooting issues to improve system performance. Your expertise will be instrumental in driving our mission to deliver an exceptional streaming service.
Full-time|$111K/yr - $218K/yr|Hybrid|New York City
The Site Reliability Engineering team at MongoDB supports the infrastructure behind the MongoDB Atlas platform. With Atlas serving customers worldwide, the team addresses the demands of delivering fast, reliable service across multiple regions while meeting data sovereignty requirements. Role overview This Site Reliability Engineer 3 position centers on designing and maintaining scalable systems. The work involves reducing manual tasks, improving monitoring, and increasing visibility into system health. Infrastructure-as-code is a key principle, and the team invests in automation and self-healing systems to minimize disruptions. Collaboration Teamwork is essential in this role. Site Reliability Engineers regularly partner with other engineering groups, sharing responsibilities and working together to achieve common objectives. Location This role is based in New York City and follows a hybrid work schedule.
Full-time|$127K/yr - $249K/yr|Hybrid|New York City; United States
About the Atlas SRE Team The Atlas team at MongoDB, Inc. is based at our New York City headquarters, with options for hybrid work or fully remote arrangements from the Eastern or Central time zones. The group focuses on building, maintaining, and scaling the Atlas platform, which supports customers' most important workloads. Role Overview This senior-level Site Reliability Engineer (SRE) position calls for deep experience in designing and building complex systems. The role offers significant autonomy and expects ownership from start to finish. The work is hands-on and technical, with a focus on creating and improving systems that support Atlas at scale. Collaboration and Impact The SRE Atlas team works closely with multiple Atlas software engineering groups. Responsibilities include: Managing large-scale systems Developing new tools and automation Performing essential maintenance for the Atlas fleet Efforts in this role have a direct effect on the reliability and performance of Atlas for customers across the globe.
Kontakt.io is revolutionizing care operations through innovative platform solutions.Our mission is to reduce waste, enhance efficiency, and drive profitability by optimizing throughput, asset utilization, and workforce productivity. Leveraging AI, Real-Time Location Systems (RTLS), and Electronic Health Records (EHR) data, we empower self-learning agents to automate workflows, adjust in real-time, and coordinate comprehensive care delivery operations.Efficiently deployable and scalable, our platform provides clear visibility into spaces, equipment, and personnel, effectively eliminating inefficiencies and significantly enhancing the patient experience. With a proven 10X ROI and over 20 successful use cases, Kontakt.io stands out as the preferred choice for advancing care delivery operations.We are seeking a Lead Software Engineer - SRE who possesses a robust foundation in software engineering and a strategic mindset to enhance the reliability, scalability, and performance of our platform. This pivotal role within our Infrastructure Engineering team will be instrumental in shaping the architecture and strategic direction of our Site Reliability Engineering function.The ideal candidate will have extensive knowledge of software engineering principles as applied to infrastructure. Rather than merely maintaining systems, you will lead the design and construction of these systems, focusing on developing automation, tooling, and resilient architectures that ensure high availability and fault tolerance across our entire AWS-based platform.You will engage hands-on in designing robust systems, refining deployment pipelines, and enhancing incident management practices. As a technical leader, you will also mentor junior engineers, influence technical strategy, and foster a culture of accountability, ownership, and continuous improvement throughout the organization.
Full-time|$170.1K/yr - $283.6K/yr|On-site|New York, NY, United States of America
At Block, we are more than just a company; we are a collective of diverse teams united by a common mission of economic empowerment. Our foundational teams — including People, Finance, Counsel, Hardware, Information Security, and Platform Infrastructure Engineering — collaborate across various business sectors and global time zones to create inclusive policies, provide financial forecasting, deliver legal support, secure our systems, and nurture innovative initiatives. Every challenge we face opens new opportunities, and we value diverse perspectives to uncover them. We invite you to bring yours to Block. The Role As a vital member of our Site Reliability Engineering (SRE) team, you will take on the dual responsibility of proactively enhancing and reactively managing the reliability of Block's platform and critical infrastructure. You are driven by metrics, possess a systems-oriented mindset, and are dedicated to building distributed platforms that facilitate safe, scalable product development. You will utilize and continuously refine AI-driven tools and automation to boost observability, expedite incident detection and response, and minimize operational toil. This includes applying AI techniques to incident analysis, alert tuning, and operational workflows. Your role will also involve primary platform on-call duties (12 hours a day, one week every few weeks, depending on team size), supporting Block's most critical (Tier 0) services. In this capacity, you will lead incident command, coordinate mitigation efforts, and ensure effective escalation during high-severity incidents. You Will Build and extend platforms to enhance system reliability. Collaborate on team objectives that prioritize reliability across the entire company. Standardize reliability tools across multiple platforms and departments. Triaging, coordinating, and leading stabilization efforts for severity 0–1 incidents. Serve as the primary on-call engineer, maintaining clear escalation paths and demonstrating leadership during escalations. Drive improvements in platform-wide reliability, shared operational tools, and safe deployment patterns. Leverage AI-driven systems to enhance signal detection, reduce noise, and accelerate root cause analysis. Design and implement safe deployment strategies (including progressive delivery, automated rollback, and guardrails). You Have A strong inclination towards identifying root causes in complex systems and implementing necessary fixes. Proven technical initiative and leadership on prior projects, particularly those focused on backend/platform. Experience with AI-driven tools for observability, incident analysis, or automation. A mindset that naturally re-evaluates existing processes to drive continual improvement.
Apr 9, 2026
Sign in to browse more jobs
Create account — see all 6,117 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.