Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Senior
Qualifications
The ideal candidate will possess:Extensive experience with cloud platforms such as AWS, GCP, or AzureStrong proficiency in programming languages like Python, Go, or JavaIn-depth knowledge of container orchestration tools like KubernetesExperience in monitoring and logging tools (e.g., Prometheus, Grafana)Excellent problem-solving skills and a proactive mindset
About the job
ro is looking for a Senior Site Reliability Engineer based in New York, NY. This role focuses on maintaining and improving the reliability, availability, and performance of our cloud infrastructure and applications. The position supports ongoing enhancements and encourages a culture of continuous improvement across the engineering team.
About ro
At ro, we are pioneering innovative solutions in the tech industry, committed to driving excellence and creating value for our clients. Our collaborative environment empowers our engineers to push the boundaries of technology.
Similar jobs
1 - 20 of 14,714 Jobs
Search for Site Reliability Engineer At Nexthink New York
NBCUniversal Media, LLC seeks a Site Reliability Engineer to support Nexthink in New York. This position centers on keeping systems stable and performing well, which helps deliver dependable digital experiences to users. Role overview The Site Reliability Engineer will work to maintain and enhance the reliability of key systems. Efforts in this role contribute to consistent performance and minimize disruptions for users. What you will do Monitor and improve system performance Address reliability issues as they arise Support initiatives that strengthen digital user experiences Location This role is based in New York.
Role overview The Site Reliability Engineer at Mistral plays a key part in keeping systems stable, available, and performing well. This position requires close collaboration with teams throughout the company to support and improve the infrastructure that powers Mistral’s services. What you will do Maintain and improve system reliability and uptime Partner with other teams to design and build scalable infrastructure Implement monitoring tools, automation, and incident response processes Location This role is based in New York, NY.
Full-time|$100K/yr - $250K/yr|On-site|New York Office
About KalshiKalshi is pioneering a new frontier in finance with its unique prediction markets platform, empowering individuals to trade on the outcomes of various events and transform any future question into a financial opportunity. We have worked diligently to legalize prediction markets in the United States, making history as the fastest-growing financial market in the country, with a diverse range of markets spanning politics, economics, finance, weather, technology, AI, culture, and beyond.At Kalshi, we envision a future where prediction markets emerge as the largest financial marketplace globally, enabling everyone to turn their insights into financial positions.Our Vision: To construct the largest financial market on Earth.Our Mission: To foster greater truth in the world through the power of markets.Our culture thrives on attracting top-tier talent, embracing hard work, and celebrating our collective journey. We are on the lookout for exceptional and driven individuals to join our compact team as we build the future of financial markets.Your RoleAs a key member of Kalshi's engineering team, you will play a crucial role in developing the next-generation financial ecosystem, akin to establishing a new NYSE or CME from the ground up. In our agile and dynamic environment, your responsibilities will quickly expand, and the impact of your work will be highly visible. Much of our infrastructure is still in its early stages, giving you the opportunity to design, own, and evolve entire systems.Key ResponsibilitiesEnhance observability, reliability, and service availability by defining and measuring critical metrics.Develop automation and systems to eliminate toil and lessen operational burdens.Work collaboratively with core infrastructure engineers to optimize cloud deployments (Docker, Terraform, Kubernetes, EC2, etc.).Partner with product teams to minimize service disruptions and automate incident response.Identify and analyze reliability issues across the stack, implementing software solutions for substantial, long-term improvements.Mentor engineers and cultivate a culture where reliability is a fundamental engineering principle.Produce high-quality, thoroughly tested code that meets both internal and external customer requirements.Troubleshoot complex technical challenges to enhance system usability, operability, and diagnosability.
Join Our Team as a Site Reliability EngineerAt Claylabs, our mission is to empower organizations to transform their growth ideas into reality. We believe growth is a creative endeavor rather than a mere formula. Identifying and engaging with your ideal customers requires innovative thinking and continuous experimentation.As artificial intelligence accelerates execution and simplifies tactics, creativity remains our unique advantage. We proudly serve thousands of clients, including industry leaders such as Anthropic, Notion, Google, and Ramp, providing them with unparalleled data, insights, and AI-driven research to successfully enter the market.In 2025, we achieved over $100 million in revenue and successfully raised a $100 million Series C at a valuation of $5 billion, supported by esteemed investors like Sequoia, CapitalG, and First Round. We also completed our second employee tender offer and launched a community equity round for our valued customers, agency partners, and club members.Here are some highlights about our company:Our community consists of over 11,000 customers, 150+ integration partners, 125+ agencies, and more than 30,000 Slack members.We boast a unique culture that extends beyond work; our team members include DJs, activists, writers, marathoners, and more.All employees have the opportunity to collaborate with world-class coaches specializing in creativity, management, and other fields.Our operating principles, such as negative maintenance and non-attached action, guide our work. Discover more about them here.
About RadarRadar stands at the forefront of geolocation technology, offering cutting-edge geofencing SDKs, maps APIs, and AI-driven solutions tailored for marketing, fraud detection, and operational excellence.Why Join Radar?Collaborate with some of the most respected companies globally, ranging from innovative startups to established Fortune 500 giants.Experience significant scale with over 1 billion API calls processed daily from hundreds of millions of devices.Benefit from robust resources, having secured $85.5 million in funding from top-tier investors like Accel and Insight Partners.Thrive in a high-performance culture, surrounded by ambitious and entrepreneurial colleagues.Enjoy our newly relocated office in the vibrant Flatiron district of Manhattan, NYC.Be part of a team recognized as one of the top 10 best workplaces in NYC by Crain's.Despite our impressive growth, we are just getting started, and we need your expertise!About the RoleWe are seeking skilled Site Reliability Engineers to enhance our production infrastructure. Radar is a high-throughput, data-intensive application that manages over 1 billion API calls per day and supports usage from over 100 million devices globally. We operate within a multi-availability zone architecture and are actively working towards expanding our deployment capabilities to a multi-region setup.Technology Stack:Our infrastructure is managed using Terraform, and we deploy to AWS via EKS. We utilize MongoDB on Atlas, implement CI and deployments through CircleCI, and monitor production with tools like CloudWatch, Grafana, Pingdom, and PagerDuty. DNS management is handled by CloudFlare. Most engineering team members participate in the on-call rotation. Our primary server languages include TypeScript and Rust, while our data pipelines are powered by Airflow and Scala Spark. Additionally, we proudly sponsor OpenStreetMaps, MapLibre, and OpenAddresses.Team Dynamics:Our engineering team comprises former technical co-founders and exceptional interns from renowned institutions like Waterloo and CMU. Engineers at Radar typically fit one of two profiles: staff-level expertise in a specific stack or multi-stack proficiency across various technologies.
Full-time|$176.8K/yr - $209.1K/yr|On-site|New York, New York
About the Role Peloton Interactive, Inc. is committed to building a platform that matches the quality and ambition of its products. The platform supports rapid development and continuous learning, freeing engineers to deliver new features and improvements. With a strong focus on data, the team identifies where to invest effort for the greatest impact on members. The platform spans hardware, firmware, web, mobile, backend, data, messaging, content, streaming, and machine learning, serving millions of users worldwide. The Site Reliability Engineer (SRE) will join a growing team in New York, working closely with colleagues across disciplines. The main focus: support and develop a monitorable, reliable, and highly scalable deployment platform. The team manages thousands of nodes and pods across many deployments, addressing large-scale operational challenges every day. What You Will Do Implement rapid auto-scaling for live rides and major events Maintain infrastructure to deliver a seamless experience for members across tens of thousands of pods in multiple clusters Support a platform that enables machine learning and other complex workloads, helping developers move quickly Promote best practices for building and running reliable systems Act as a subject matter expert in observability and monitoring Advise on system design to meet reliability and capacity goals Automate processes, from infrastructure management to daily operations Lead post-mortem analysis after infrastructure incidents Support operational security and compliance efforts Identify and address potential security and reliability risks Work with tools such as Amazon Web Services, Chef, Python, Ubuntu, Nginx, Jenkins, and Terraform Location This role is based in New York, New York.
Cloaked is an innovative privacy startup committed to restoring consumer confidence in the handling of personal data. Our mission is to build an internet that prioritizes user needs, placing individual privacy and opt-in choices at the forefront. Our flagship product serves as a virtual 'cloak' for users, enabling them to navigate websites like Facebook and Amazon while controlling the sharing of their private information according to their preferences.
Join Spotify as a Senior Site Reliability Engineer, where you'll play a crucial role in maintaining the reliability and performance of our services. This position involves collaborating with cross-functional teams to enhance our infrastructure and ensure a seamless experience for our users.As a key member of our engineering team, you will be responsible for monitoring system health, implementing automation processes, and troubleshooting issues to improve system performance. Your expertise will be instrumental in driving our mission to deliver an exceptional streaming service.
About Legora Legora builds AI-driven solutions for the legal sector, partnering directly with legal professionals to create tools that support better insights and decision-making. Our platform is trusted by major global firms, including Cleary Gottlieb and Goodwin, and is used in over 40 countries. We focus on continuous improvement and innovation, working closely with users to ensure our technology truly meets their needs. Site Reliability Engineer – New York City (On-site) Legora is looking for a Site Reliability Engineer to join the founding SRE team at our New York City engineering hub. This role is based fully on-site, five days a week. The position centers on maintaining and improving the reliability and performance of our platform as we expand. Expect to work side by side with experienced engineers, focusing on production systems, observability, incident response, and automation. What You Will Do Oversee and improve production services, including deployments, monitoring, and system health. Develop and maintain observability tools for metrics, logs, and traces, aiming for high-quality signals and minimal noise. Help define Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and set up alerting and reliability metrics for key services. Participate in on-call rotations, contribute to post-incident reviews, and help implement measures to prevent future issues. Location Requirement This role requires working on-site at Legora’s New York City office, Monday through Friday. In-person collaboration is core to how we work and deliver results.
Full-time|$170.1K/yr - $283.6K/yr|On-site|New York, NY, United States of America
At Block, we are more than just a company; we are a collective of diverse teams united by a common mission of economic empowerment. Our foundational teams — including People, Finance, Counsel, Hardware, Information Security, and Platform Infrastructure Engineering — collaborate across various business sectors and global time zones to create inclusive policies, provide financial forecasting, deliver legal support, secure our systems, and nurture innovative initiatives. Every challenge we face opens new opportunities, and we value diverse perspectives to uncover them. We invite you to bring yours to Block. The Role As a vital member of our Site Reliability Engineering (SRE) team, you will take on the dual responsibility of proactively enhancing and reactively managing the reliability of Block's platform and critical infrastructure. You are driven by metrics, possess a systems-oriented mindset, and are dedicated to building distributed platforms that facilitate safe, scalable product development. You will utilize and continuously refine AI-driven tools and automation to boost observability, expedite incident detection and response, and minimize operational toil. This includes applying AI techniques to incident analysis, alert tuning, and operational workflows. Your role will also involve primary platform on-call duties (12 hours a day, one week every few weeks, depending on team size), supporting Block's most critical (Tier 0) services. In this capacity, you will lead incident command, coordinate mitigation efforts, and ensure effective escalation during high-severity incidents. You Will Build and extend platforms to enhance system reliability. Collaborate on team objectives that prioritize reliability across the entire company. Standardize reliability tools across multiple platforms and departments. Triaging, coordinating, and leading stabilization efforts for severity 0–1 incidents. Serve as the primary on-call engineer, maintaining clear escalation paths and demonstrating leadership during escalations. Drive improvements in platform-wide reliability, shared operational tools, and safe deployment patterns. Leverage AI-driven systems to enhance signal detection, reduce noise, and accelerate root cause analysis. Design and implement safe deployment strategies (including progressive delivery, automated rollback, and guardrails). You Have A strong inclination towards identifying root causes in complex systems and implementing necessary fixes. Proven technical initiative and leadership on prior projects, particularly those focused on backend/platform. Experience with AI-driven tools for observability, incident analysis, or automation. A mindset that naturally re-evaluates existing processes to drive continual improvement.
Join Alloy as a Site Reliability Engineer and play a crucial role in ensuring the reliability, availability, and performance of our systems. You will work closely with development teams to design and implement robust infrastructure solutions that enable seamless user experiences. Your expertise will be vital in maintaining our high standards for uptime and efficiency.
Full-time|$160K/yr - $160K/yr|On-site|New York Office
As energy companies face increasing challenges from severe weather events and the need for infrastructure modernization, Treeswift is at the forefront, enabling them to transform their field operations. Our innovative solutions utilize advanced sensors deployed on backpacks and vehicles, generating vast amounts of LiDAR and imagery data. This data is processed through sophisticated AI models, providing our clients with actionable insights via our web platform.Since initiating our first utility pilot in June 2024, we have rapidly grown, collaborating with three of the five largest utilities in the United States and continuing to expand our client base and use cases.Our team is composed of passionate experts from top institutions and companies in robotics and software development, and we are backed by prominent investors like Penny Pritzker’s Inspired Capital. We are headquartered in lower Manhattan with an additional office in Philadelphia, and we encourage our team members, including software engineers, to engage directly with our customers at their sites.Join us and be part of our mission to shape the future of energy management.
Full-time|$160K/yr - $180K/yr|On-site|USA - New York, NY
Gen Digital brings together brands like Norton, Avast, LifeLock, and MoneyLion to deliver cybersecurity, privacy, identity protection, and financial wellness solutions to nearly 500 million users in over 150 countries. The company is committed to helping people protect and manage their digital and financial lives, encouraging the use of AI as a collaborative tool to achieve results. Gen Digital values open discussion, experimentation, and continuous learning. The team welcomes diverse backgrounds and perspectives, emphasizing respect and support for every member. Flexible work options, generous time off, competitive pay, and comprehensive benefits are part of the company’s approach to supporting career growth. Role overview As a Senior Site Reliability Engineer for Engine by MoneyLion, the focus will be on scaling the platform and ensuring high standards for security, reliability, and performance. This leadership role involves partnering with top financial institutions to deliver a broad range of personalized financial products to consumers. The position centers on guiding the evolution of DevOps and SRE architecture, establishing best practices for cloud-native infrastructure, and mentoring engineers across teams. Deep technical expertise, sound architectural judgment, and effective collaboration with colleagues around the world are essential for success in this role. Location New York, NY, USA
As a Cloud Site Reliability Engineer, you will be responsible for deploying innovative solutions within the public cloud environment, specifically utilizing AWS services. You will create and manage configuration templates designed for scalable infrastructure, including AWS components like EFS, EC2, and RDS. Collaborating closely with the Scrum Master, you will ensure the project requirements are met within an agile development setting.Key Responsibilities:• Contribute to architectural design to enhance system consistency, security, maintainability, and flexibility.• Assist architects in creating highly scalable and automated deployments for diverse applications.• Develop configuration templates using established architectural blueprints.• Ensure the development of robust and scalable services across public cloud platforms, including AWS and GCP.• Monitor and assess system performance to ensure optimal operation.
Role overview ro is looking for a Senior Site Reliability Engineer based in New York, NY. This role focuses on maintaining and improving the reliability, availability, and performance of our cloud infrastructure and applications. The position supports ongoing enhancements and encourages a culture of continuous improvement across the engineering team.
Role overview Medal seeks a Site Reliability Engineer - Infrastructure Specialist in New York City. The focus is on strengthening the company’s infrastructure and ensuring the stability of Medal’s systems. This role works within a collaborative team to design, build, and maintain the technical foundation that enables the company’s growth and efficiency. What you will do Design and implement infrastructure solutions that can scale as demand increases Maintain and improve system reliability to help minimize downtime Monitor and optimize system performance to keep applications running smoothly Collaborate with team members to address ongoing infrastructure requirements
Join Tabs as a Staff Site Reliability Engineer to lead the charge in enhancing our systems for maximum reliability and performance. In this pivotal role, you will collaborate with cross-functional teams to design, implement, and maintain robust infrastructure solutions. You will ensure our systems are scalable, secure, and efficient, ultimately providing an unparalleled experience for our users.Your expertise in cloud technologies and automation will be vital as you drive initiatives to improve operational efficiency and system resilience. If you are passionate about creating reliable systems and thrive in a fast-paced environment, we want to hear from you!
About WRITERWRITER is the premier platform where leading enterprises harness the power of AI to streamline their operations. Our mission is to enhance human potential through advanced superintelligence, demonstrating its feasibility with a trustworthy AI solution that bridges IT and business teams, facilitating transformative change across organizations. WRITER’s comprehensive platform empowers hundreds of companies, including Mars, Marriott, Uber, and Vanguard, to develop and deploy AI agents tailored to their unique datasets, supported by our enterprise-grade LLMs. With a valuation of $1.9B and support from top-tier investors such as Premji Invest, Radical Ventures, and ICONIQ Growth, WRITER is quickly establishing itself as the frontrunner in the field of enterprise generative AI.Founded in 2020, with offices in San Francisco, New York City, Austin, Chicago, and London, we are a dynamic team focused on innovation and speed. We seek intelligent, dedicated builders and innovators to join us in shaping the future of work powered by AI. About the RoleAs a Site Reliability Engineer at WRITER, you will play a critical role in ensuring the availability, performance, and reliability of our platform, which is essential for our mission to enhance human capabilities with superintelligence. Your work will directly influence every enterprise customer reliant on our AI-powered workflows. This position goes beyond routine maintenance; it involves proactively identifying and resolving intricate systemic challenges and establishing the framework necessary for our rapid growth and the evolving needs of enterprise generative AI. You will develop resilient systems, automate processes throughout the stack, and advocate for reliability best practices, directly contributing to our ambitious product roadmap and ensuring our clients have continuous access to the powerful tools they require.This is a hybrid role based in either our New York City or London office, reporting to the Director of Engineering. ResponsibilitiesAutomate operational tasks and infrastructure management by creating robust tools and platforms using languages such as Python, Go, or similar, significantly minimizing manual workload across our production environment.Design and implement scalable, fault-tolerant infrastructure solutions on leading public cloud platforms (AWS, GCP, Azure) to support WRITER's swiftly growing, high-traffic AI platform.Take ownership of the reliability, performance, and efficiency of WRITER’s core services, establishing and maintaining rigorous Service Level Objectives (SLOs) and Error Budgets.
Full-time|$165K/yr - $225K/yr|On-site|United States, New York
Dataiku is the leading platform for AI success, serving as the enterprise orchestration layer for building, deploying, and governing AI solutions. In a unified environment, teams can design and operate analytics, machine learning, and AI agents with the transparency, collaboration, and control that enterprises demand. Dataiku integrates seamlessly with various data platforms, cloud infrastructures, and AI services, enabling businesses to execute AI strategies across diverse vendor environments while maintaining centralized governance.The world's top companies trust Dataiku to operationalize AI, transforming it into a key driver of business performance that delivers measurable value. For more insights, explore the Dataiku blog, LinkedIn, X, and YouTube.Why Engineering at Dataiku?Dataiku’s platform, whether deployed on-premise, in the cloud, or as SaaS, embodies our dedication to quality and innovation by connecting various data science technologies. Our technology stack reflects our commitment to integrating the best data and AI technologies, ensuring that we select tools that genuinely enhance our product. From utilizing the latest large language models (LLMs) to supporting open-source communities, you'll work with a dynamic array of technologies and contribute to the collective knowledge of global tech innovators. Discover more about engineering at Dataiku here.Your Impact:As a Site Reliability Engineer (SRE) with advanced networking and security skills, you will join our Cloud team focused on developing and operating the Dataiku managed offering. Your responsibilities will encompass a wide range of tasks, including architecting and maintaining robust network security measures (such as PrivateLink and IPSec), ensuring compliance with industry standards and regulations, and monitoring and deploying our cloud offerings.You will be tasked with building and operating a reliable, secure, and cost-efficient infrastructure to support the Dataiku SaaS offerings. This role presents a unique opportunity to engage in a project central to our company’s vision, with a strong and direct impact on our operations.
Kontakt.io is revolutionizing care operations through innovative platform solutions.Our mission is to reduce waste, enhance efficiency, and drive profitability by optimizing throughput, asset utilization, and workforce productivity. Leveraging AI, Real-Time Location Systems (RTLS), and Electronic Health Records (EHR) data, we empower self-learning agents to automate workflows, adjust in real-time, and coordinate comprehensive care delivery operations.Efficiently deployable and scalable, our platform provides clear visibility into spaces, equipment, and personnel, effectively eliminating inefficiencies and significantly enhancing the patient experience. With a proven 10X ROI and over 20 successful use cases, Kontakt.io stands out as the preferred choice for advancing care delivery operations.We are seeking a Lead Software Engineer - SRE who possesses a robust foundation in software engineering and a strategic mindset to enhance the reliability, scalability, and performance of our platform. This pivotal role within our Infrastructure Engineering team will be instrumental in shaping the architecture and strategic direction of our Site Reliability Engineering function.The ideal candidate will have extensive knowledge of software engineering principles as applied to infrastructure. Rather than merely maintaining systems, you will lead the design and construction of these systems, focusing on developing automation, tooling, and resilient architectures that ensure high availability and fault tolerance across our entire AWS-based platform.You will engage hands-on in designing robust systems, refining deployment pipelines, and enhancing incident management practices. As a technical leader, you will also mentor junior engineers, influence technical strategy, and foster a culture of accountability, ownership, and continuous improvement throughout the organization.
Aug 25, 2025
Sign in to browse more jobs
Create account — see all 14,714 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.