Senior Site Reliability Engineer At Perchwell Remote Or Nyc Hybrid jobs in New York City – Browse 2,190 openings on RoboApply Jobs
Senior Site Reliability Engineer At Perchwell Remote Or Nyc Hybrid jobs in New York City
Open roles matching “Senior Site Reliability Engineer At Perchwell Remote Or Nyc Hybrid” with location signals for New York City. 2,190 active listings on RoboApply Jobs.
2,190 jobs found
Senior Site Reliability Engineer at Perchwell | Remote or NYC - Hybrid
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Senior
Qualifications
The ideal candidate will possess a strong background in cloud services, automation, and system monitoring. Key qualifications include:Expertise in cloud platforms (AWS, GCP, or Azure)Proficiency in scripting languages (Python, Bash, etc.)Experience with containerization and orchestration tools (Docker, Kubernetes)Strong troubleshooting and analytical skillsExcellent communication and collaboration capabilities
About the job
Join Perchwell as a Senior Site Reliability Engineer, where you'll play a crucial role in maintaining and enhancing our scalable infrastructure. In this position, you will collaborate with cross-functional teams to ensure the reliability and performance of our systems. Your expertise will help drive our commitment to delivering exceptional service and innovative solutions to our clients.
About Perchwell
Perchwell is a cutting-edge technology company revolutionizing the real estate industry through innovative solutions and data-driven insights. We foster a dynamic and inclusive work environment where creativity and collaboration thrive.
Join Perchwell as a Senior Site Reliability Engineer, where you'll play a crucial role in maintaining and enhancing our scalable infrastructure. In this position, you will collaborate with cross-functional teams to ensure the reliability and performance of our systems. Your expertise will help drive our commitment to delivering exceptional service and innovative solutions to our clients.
About Legora Legora develops technology for the legal sector, working directly with legal professionals to ensure practical, relevant solutions. The company’s AI-native workspace helps users streamline workflows, ask better questions, and uncover new insights. Clients include major global law firms such as Cleary Gottlieb, Goodwin, Bird & Bird, and Linklaters, spanning over 40 countries. Legora’s team values collaboration and aims to build tools that genuinely improve the way lawyers work. Role Overview: Senior Site Reliability Engineer This Senior Site Reliability Engineer role sits within Legora’s core SRE team at the New York City engineering hub. The position focuses on building and maintaining reliable services, partnering with engineering teams both locally and in Stockholm. The team’s goal: raise reliability standards across Legora’s platform and ensure smooth, dependable operations. Location requirement: This is a full-time, in-office role based in New York City. Attendance is required five days a week to support close collaboration and innovation. What You Will Do Design, deploy, and manage essential platform services, taking full responsibility for their reliability. Build and maintain observability systems (metrics, logs, traces) to generate actionable insights. Set and improve service level indicators (SLIs), service level objectives (SLOs), alerting, and reliability metrics for key systems. Refine on-call procedures and incident response, including escalation processes and post-incident analysis. Drive ongoing improvements to system reliability and performance.
Full-time|$133.1K/yr - $148K/yr|Remote|New York City, NY
Site Reliability Engineer Overview: Join Weedmaps as a Site Reliability Engineer and collaborate across departments, including application, infrastructure, and quality teams, to elevate the performance, reliability, resilience, and scalability of our web services at Weedmaps.com. As a cloud-native organization, we run 100% of our services in Docker on Kubernetes within AWS's public cloud. Our operations utilize observability, monitoring, CI/CD automation, and custom tooling, enabling us to deploy multiple production releases daily. Your daily responsibilities will focus on applying your engineering expertise to enhance system monitoring, minimize developer toil, configure CI workflows, and optimize our deployment pipelines. You will serve as a knowledge reference for development teams, ensuring they utilize consistent tools for metrics, logging, building, and deployment. Collaborating closely with both development and infrastructure teams, you will identify critical service-specific metrics that require monitoring, and you will help application development teams create libraries for seamless service instrumentation. The impact you'll make: Collaborate with stakeholders to establish and promote best practices for monitoring and CI/CD pipelines. Troubleshoot issues related to deployment within our CI pipeline. Actively promote the DevOps culture at Weedmaps. Identify opportunities for automation and advocate for the codification of processes. Promote best practices regarding collaboration, reliability, security, and performance across all partner teams. Take ownership of application configuration and scaling for specified services, ensuring adherence to organizational practices. Develop and optimize synthetic monitoring flows. What you've accomplished: A minimum of 2 years of development experience in startup or mid-sized environments. Proficiency in programming languages such as Python, Go, Node, Ruby, or Elixir. Knowledge of containerization technologies, particularly Docker (Kubernetes experience is a plus). Strong communication skills, a positive demeanor, and the ability to provide and receive constructive feedback. Professional experience with cloud-native observability standards including OpenMetrics, OpenTracing, and OpenCensus. Expertise in using and configuring modern CI/CD workflows. Deep understanding of SLIs, SLOs, and SLAs at both service and business levels. Familiarity with golden signals and their significance in monitoring.
About the Role Betterment is looking for a Senior Engineering Manager to guide the Site Reliability Engineering (SRE) team at our New York City headquarters. This leader will oversee a skilled group focused on keeping services reliable, high-performing, and able to scale as we grow. What You'll Do Lead and support engineers dedicated to service reliability and performance. Collaborate with teams across the company to improve infrastructure and operations. Promote a culture that values technical excellence and continuous improvement. Location This position is based at Betterment HQ in New York City.
Full-time|$127K/yr - $249K/yr|Hybrid|New York City; United States
About the Atlas SRE Team The Atlas team at MongoDB, Inc. is based at our New York City headquarters, with options for hybrid work or fully remote arrangements from the Eastern or Central time zones. The group focuses on building, maintaining, and scaling the Atlas platform, which supports customers' most important workloads. Role Overview This senior-level Site Reliability Engineer (SRE) position calls for deep experience in designing and building complex systems. The role offers significant autonomy and expects ownership from start to finish. The work is hands-on and technical, with a focus on creating and improving systems that support Atlas at scale. Collaboration and Impact The SRE Atlas team works closely with multiple Atlas software engineering groups. Responsibilities include: Managing large-scale systems Developing new tools and automation Performing essential maintenance for the Atlas fleet Efforts in this role have a direct effect on the reliability and performance of Atlas for customers across the globe.
About Chalkboard:Chalkboard is pioneering the next generation of sports gaming. Our mission is to seamlessly merge watching and playing by transforming real-money sports gaming into a dynamic, social experience designed for fans eager to win. We are redefining how sports enthusiasts connect with the games they cherish.At our essence, we are a team of passionate, sports-loving innovators who prioritize transparency, equity, and the excitement of empowering fans to turn insights into actionable strategies.The Role:We are on the lookout for a Principal Site Reliability Engineer to join our ranks at Chalkboard, contributing to the creation of a platform that is not only reliable and scalable but also user-friendly for our development teams.In this pivotal role, you will collaborate with Engineering, Product, and Data teams, significantly impacting how millions of fans engage with sports in real time. If you thrive in a fast-paced environment, love to build robust solutions from the ground up, and aim to achieve team success rather than individual accolades, we want to hear from you!Your Game Plan:Take ownership of platform reliability from start to finish, proactively identifying and mitigating risks before they affect users.Develop and enhance observability (metrics, logs, tracing) to facilitate rapid issue detection, diagnosis, and resolution.Anticipate infrastructure needs by identifying bottlenecks and implementing sustainable architectural improvements.Minimize developer friction by refining CI/CD pipelines, deployment workflows, and internal tools.Lead incident responses and root cause analyses, focusing on systemic solutions rather than temporary fixes.Establish and uphold best practices for infrastructure, deployments, and system reliability.Create reusable, self-service infrastructure that empowers teams to deploy quickly and securely.Continuously enhance systems through automation and Infrastructure-as-Code methodologies.What You Bring to the Team:Experience with Cloud Infrastructure (preferably GCP): including networking, IAM, databases, and storage.Proficiency in Kubernetes: managing cluster operations and workloads.Skilled in Infrastructure as Code tools: Terraform, Helm.Familiarity with CI/CD practices: using GitHub Actions or similar tools.Knowledge of observability practices: metrics, logging, tracing, and alerting.
Full-time|$111K/yr - $218K/yr|Hybrid|New York City
The Site Reliability Engineering team at MongoDB supports the infrastructure behind the MongoDB Atlas platform. With Atlas serving customers worldwide, the team addresses the demands of delivering fast, reliable service across multiple regions while meeting data sovereignty requirements. Role overview This Site Reliability Engineer 3 position centers on designing and maintaining scalable systems. The work involves reducing manual tasks, improving monitoring, and increasing visibility into system health. Infrastructure-as-code is a key principle, and the team invests in automation and self-healing systems to minimize disruptions. Collaboration Teamwork is essential in this role. Site Reliability Engineers regularly partner with other engineering groups, sharing responsibilities and working together to achieve common objectives. Location This role is based in New York City and follows a hybrid work schedule.
Location: NYC Global HQ (Hybrid: 3 days in office) DoubleVerify delivers digital performance solutions for advertisers and agencies, enabling independent verification, campaign optimization, and measurement of business impact. Since 2008, DV has partnered with Fortune 500 brands, agencies, publishers, and digital ad platforms to bring greater transparency and improved outcomes to digital advertising. More details are available at www.doubleverify.com. Role overview The Senior Site Reliability Engineer I will focus on strengthening the reliability, scalability, and performance of DoubleVerify's digital media measurement platforms. This hybrid position is based at the NYC Global HQ, with an expectation of three days per week in the office. What you will do Enhance reliability, scalability, and performance for digital media measurement systems. Establish and refine observability practices, including setting up metrics, dashboards, and alerting to enable proactive reliability improvements. Reduce Mean Time to Recovery (MTTR) for critical incidents by automating processes, improving observability, and advancing monitoring capabilities. Lead incident response for high-severity (Sev1 and Sev2) events and drive resolutions. Maintain high availability across infrastructure and services in GCP, AWS, OCI, and on-premises environments. Guide technical projects from planning through deployment, collaborating with teams and keeping stakeholders informed. Design and deploy automation tools to reduce manual work and improve efficiency in deployment workflows, validation scripts, and self-service tooling. Utilize AI-assisted development tools for faster automation and troubleshooting. Build integrations and Monitoring Control Plane (MCP) servers to support monitoring platforms and AI-driven analysis. Apply Infrastructure-as-Code practices using Terraform, Helm charts, Python scripts, and configuration management tools for consistent, version-controlled deployments. Develop and maintain documentation, runbooks, and Standard Operating Procedures (SOPs) in Confluence to support consistent incident response.
Role overview Medal seeks a Site Reliability Engineer - Infrastructure Specialist in New York City. The focus is on strengthening the company’s infrastructure and ensuring the stability of Medal’s systems. This role works within a collaborative team to design, build, and maintain the technical foundation that enables the company’s growth and efficiency. What you will do Design and implement infrastructure solutions that can scale as demand increases Maintain and improve system reliability to help minimize downtime Monitor and optimize system performance to keep applications running smoothly Collaborate with team members to address ongoing infrastructure requirements
Full-time|$127K/yr - $249K/yr|Remote|Boston; Miami; New York City; Pittsburgh; Raleigh; United States
Join MongoDB’s innovative Storage Layer Services (SLS) team as we redefine the MongoDB cloud storage layer. This dynamic team is at the forefront of developing high-performance, multi-tenant distributed storage solutions that not only enhance our existing Atlas storage framework but also empower our customers' workloads to operate with remarkable efficiency. In this pivotal role, you will collaborate closely with teams dedicated to building these storage services, defining Service Level Objectives (SLOs), shaping capacity plans, and ensuring the reliability, durability, and operational safety of the foundational storage layer that supports Atlas. As one of the founding members of this small but experienced team of Site Reliability Engineers (SREs), you will play a vital role in executing a multi-year vision for MongoDB’s cloud storage architecture. This position offers flexibility in location, allowing you to work from our offices in Boston, New York City, Raleigh, Miami, or Pittsburgh, or remotely from anywhere in the United States, provided you are based in the Eastern or Central time zones.
Join Tabs as a Staff Site Reliability Engineer to lead the charge in enhancing our systems for maximum reliability and performance. In this pivotal role, you will collaborate with cross-functional teams to design, implement, and maintain robust infrastructure solutions. You will ensure our systems are scalable, secure, and efficient, ultimately providing an unparalleled experience for our users.Your expertise in cloud technologies and automation will be vital as you drive initiatives to improve operational efficiency and system resilience. If you are passionate about creating reliable systems and thrive in a fast-paced environment, we want to hear from you!
Kontakt.io is revolutionizing care operations through innovative platform solutions.Our mission is to reduce waste, enhance efficiency, and drive profitability by optimizing throughput, asset utilization, and workforce productivity. Leveraging AI, Real-Time Location Systems (RTLS), and Electronic Health Records (EHR) data, we empower self-learning agents to automate workflows, adjust in real-time, and coordinate comprehensive care delivery operations.Efficiently deployable and scalable, our platform provides clear visibility into spaces, equipment, and personnel, effectively eliminating inefficiencies and significantly enhancing the patient experience. With a proven 10X ROI and over 20 successful use cases, Kontakt.io stands out as the preferred choice for advancing care delivery operations.We are seeking a Lead Software Engineer - SRE who possesses a robust foundation in software engineering and a strategic mindset to enhance the reliability, scalability, and performance of our platform. This pivotal role within our Infrastructure Engineering team will be instrumental in shaping the architecture and strategic direction of our Site Reliability Engineering function.The ideal candidate will have extensive knowledge of software engineering principles as applied to infrastructure. Rather than merely maintaining systems, you will lead the design and construction of these systems, focusing on developing automation, tooling, and resilient architectures that ensure high availability and fault tolerance across our entire AWS-based platform.You will engage hands-on in designing robust systems, refining deployment pipelines, and enhancing incident management practices. As a technical leader, you will also mentor junior engineers, influence technical strategy, and foster a culture of accountability, ownership, and continuous improvement throughout the organization.
About Legora Legora builds AI-powered tools for legal professionals, working side by side with lawyers to ensure technology fits real-world needs. The platform helps legal teams work more efficiently, ask better questions, and find new insights. Clients include leading global firms such as Cleary Gottlieb, Goodwin, Bird & Bird, and Linklaters, with Legora’s reach spanning over 40 countries. The company values rapid shipping, thoughtful iteration, and scaling with purpose. Legora’s team is committed to high standards, always aiming to deliver technology that truly empowers lawyers. The culture rewards those who want to build from scratch, work with talented colleagues, and help shape the future of legal work. Staff Site Reliability Engineer , New York City (Onsite) This role joins the founding SRE team at Legora’s new engineering hub in New York City. The Staff Site Reliability Engineer leads reliability efforts across multiple teams, sets infrastructure architecture standards, and drives operational excellence for the platform. The position works closely with colleagues in Stockholm and requires in-office presence five days a week. What You Will Do Design and manage reliability and infrastructure strategies for several teams and services Oversee observability, capacity planning, and monitoring for distributed systems Develop and refine SLI/SLO frameworks, error budgets, and production readiness standards Lead incident management, create escalation protocols, and drive improvements from post-mortem analysis Work with engineering teams to integrate reliability best practices into their workflows Location Requirement This position is based in New York City and requires onsite work five days per week.
About WRITERWRITER is the premier platform where leading enterprises harness the power of AI to streamline their operations. Our mission is to enhance human potential through advanced superintelligence, demonstrating its feasibility with a trustworthy AI solution that bridges IT and business teams, facilitating transformative change across organizations. WRITER’s comprehensive platform empowers hundreds of companies, including Mars, Marriott, Uber, and Vanguard, to develop and deploy AI agents tailored to their unique datasets, supported by our enterprise-grade LLMs. With a valuation of $1.9B and support from top-tier investors such as Premji Invest, Radical Ventures, and ICONIQ Growth, WRITER is quickly establishing itself as the frontrunner in the field of enterprise generative AI.Founded in 2020, with offices in San Francisco, New York City, Austin, Chicago, and London, we are a dynamic team focused on innovation and speed. We seek intelligent, dedicated builders and innovators to join us in shaping the future of work powered by AI. About the RoleAs a Site Reliability Engineer at WRITER, you will play a critical role in ensuring the availability, performance, and reliability of our platform, which is essential for our mission to enhance human capabilities with superintelligence. Your work will directly influence every enterprise customer reliant on our AI-powered workflows. This position goes beyond routine maintenance; it involves proactively identifying and resolving intricate systemic challenges and establishing the framework necessary for our rapid growth and the evolving needs of enterprise generative AI. You will develop resilient systems, automate processes throughout the stack, and advocate for reliability best practices, directly contributing to our ambitious product roadmap and ensuring our clients have continuous access to the powerful tools they require.This is a hybrid role based in either our New York City or London office, reporting to the Director of Engineering. ResponsibilitiesAutomate operational tasks and infrastructure management by creating robust tools and platforms using languages such as Python, Go, or similar, significantly minimizing manual workload across our production environment.Design and implement scalable, fault-tolerant infrastructure solutions on leading public cloud platforms (AWS, GCP, Azure) to support WRITER's swiftly growing, high-traffic AI platform.Take ownership of the reliability, performance, and efficiency of WRITER’s core services, establishing and maintaining rigorous Service Level Objectives (SLOs) and Error Budgets.
About Legora Legora builds AI-driven solutions for the legal sector, partnering directly with legal professionals to create tools that support better insights and decision-making. Our platform is trusted by major global firms, including Cleary Gottlieb and Goodwin, and is used in over 40 countries. We focus on continuous improvement and innovation, working closely with users to ensure our technology truly meets their needs. Site Reliability Engineer – New York City (On-site) Legora is looking for a Site Reliability Engineer to join the founding SRE team at our New York City engineering hub. This role is based fully on-site, five days a week. The position centers on maintaining and improving the reliability and performance of our platform as we expand. Expect to work side by side with experienced engineers, focusing on production systems, observability, incident response, and automation. What You Will Do Oversee and improve production services, including deployments, monitoring, and system health. Develop and maintain observability tools for metrics, logs, and traces, aiming for high-quality signals and minimal noise. Help define Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and set up alerting and reliability metrics for key services. Participate in on-call rotations, contribute to post-incident reviews, and help implement measures to prevent future issues. Location Requirement This role requires working on-site at Legora’s New York City office, Monday through Friday. In-person collaboration is core to how we work and deliver results.
Join Alloy as a Site Reliability Engineer and play a crucial role in ensuring the reliability, availability, and performance of our systems. You will work closely with development teams to design and implement robust infrastructure solutions that enable seamless user experiences. Your expertise will be vital in maintaining our high standards for uptime and efficiency.
At GPTZero, we're dedicated to reinstating trust and transparency in the digital world. As the premier AI detection platform, we empower educators, students, journalists, marketers, and writers to effectively navigate the dynamic realm of AI-generated content. With millions of users and institutions placing their trust in us, we are shaping a pioneering company that stands at the intersection of AI and information integrity.Our team boasts members from high-performing engineering environments, including Meta, Perplexity, AWS, Affirm, and top-tier AI research institutions like Princeton, Caltech, and Vector Institute.What We're Looking ForWe are seeking a motivated Software Engineering Intern to help us develop the next-generation platform aimed at verifying the origin, quality, and accuracy of global information. The ideal candidate is an enthusiastic learner with a proven track record of building applications from scratch and adept at resolving complex challenges.You will collaborate with a fast-paced team of dedicated builders, working closely with our Machine Learning and design teams to create software that has already garnered over 2 million users worldwide. Past intern contributions have been highlighted in demonstrations to venture capitalists and policymakers at the state level.Key ResponsibilitiesDevelop and launch high-impact, user-friendly, AI-driven web applications using React, Node.js, and Tailwind CSS.Implement top-requested features based on user feedback for our dashboard, Chrome extensions, and API.Leverage product analytics to inform data-driven product decisions.Collaborate with teams across Machine Learning, design, and business sectors to innovate new product initiatives.Adapt to various roles and work throughout the product stack.QualificationsProficiency in building comprehensive applications from backend systems to front-end styling with CSS.A minimum of 2 years of experience with modern web frameworks such as Express, Next.js, TypeScript, and React.At least 1 year of experience working with databases like PostgreSQL and AWS RDS.Strong motivation to contribute positively to societal impact.Ability to work with a minimum of 5 hours overlap with Eastern Standard Time.Bonus:A robust open-source portfolio.Experience in an early-stage startup environment.Background as a peer-reviewed writer.
About AdaptiveAdaptiveSecurity is at the forefront of cybersecurity innovation, being the only AI-focused investment from NVIDIA and OpenAI. Our mission is to combat AI-driven cyber threats effectively.Founded in December 2025, Adaptive secured an impressive $81M in Series B funding, spearheaded by NVIDIA and Bain Capital Ventures, alongside contributions from Capital One Ventures and Citi Ventures, with ongoing support from Andreessen Horowitz (a16z), the OpenAI Startup Fund, and Abstract Ventures. This funding round marked NVIDIA’s inaugural investment in AI cybersecurity.Our founders, Brian Long and Andrew Jones, are seasoned entrepreneurs with a proven track record in scaling transformative companies. They previously co-founded Attentive, which achieved over $500M in annual revenue and a valuation exceeding $10B, and TapCommerce, which was acquired by Twitter. Their extensive experience in building high-growth, product-oriented businesses drives Adaptive's ambition to create a robust security layer for the AI era.Trusted by top-tier banks, technology firms, and healthcare providers, Adaptive defends against emerging threats such as deepfakes, smishing, and AI-enhanced voice scams. With increasing enterprise adoption and a market potential exceeding $200B, we are just beginning our journey.Role OverviewWe are on the lookout for a Senior Software Engineer who thrives in a fast-paced startup atmosphere, where each engineer plays a pivotal role in shaping the product and the company’s future. You will take charge of significant features from inception to deployment, make long-lasting architectural decisions, and establish practices that enhance team productivity.This position is ideal for someone eager to construct, mentor, and influence our operational methodologies.Key ResponsibilitiesDesign and implement large-scale features and systems from start to finish.Establish practices that improve team efficiency and ensure codebase maintainability.Proactively drive change without waiting for direction.Collaborate closely with product and design teams from day one, merging roles and guiding project directions.Anticipate upcoming complexities and architect scalable systems to counter evolving threats.Assist in the hiring process by conducting interviews and identifying exceptional talent.Qualifications5+ years of experience in developing customer-facing software solutions.Bachelor's degree in Computer Science, Software Engineering, or a related field, or equivalent practical experience.Strong problem-solving skills and a proactive attitude.Experience in collaborative team environments.Ability to mentor junior engineers and foster a culture of learning.
Full-time|Hybrid|Hybrid NYC preferred, open to remote as well
iMentor is on the lookout for a talented Senior Front End Engineer with expertise in React to enhance our innovative mentoring platform. In this full-time position, you will be instrumental in developing and maintaining exceptional front-end experiences within a cutting-edge MERN stack environment.As a key member of our mission-driven technology team, you will work on tools that facilitate mentoring relationships across the U.S. You will report directly to the Senior Director of Engineering and collaborate closely with product partners and fellow engineers.The ideal candidate will possess extensive technical knowledge of React, complemented by a thoughtful approach focused on usability, stability, and long-term maintainability.About the PlatformThe iMentor platform is a custom-built ecosystem designed to support mentors, mentees, and program staff nationwide. Key features include:A multi-step volunteer application processMessaging and communication toolsLearning interactions between mentors and studentsData-driven workflows and integrations with partner servicesUtilizing React for the front end and powered by Node, Express, and MongoDB for backend services, the platform also integrates with services such as Twilio and other third-party platforms.Working alongside a diverse, multinational engineering team, the Senior React Engineer will play a crucial role in contributing to a secure, scalable platform that aligns with iMentor’s technical roadmap and program objectives.
Flagler Health is an innovative health technology company at the forefront of revolutionizing healthcare delivery. Our mission is to empower healthcare organizations through AI-driven workflow automation, enhancing remote patient engagement, and optimizing chronic care management. With a robust platform that has positively impacted over 1.5 million patients, we are trusted by healthcare providers and payers to enhance operational efficiency, reduce costs, and achieve superior patient outcomes. Positioned uniquely with a freemium model and limited competition, we are set to capture a significant portion of the $4.5 trillion U.S. healthcare market.Key ResponsibilitiesDesign, develop, and maintain scalable backend services and web applications.Engage in the development of real-time web applications for audio and SMS communication with patients.Implement and manage fault-tolerant, long-running workflows for asynchronous and background processing.Collaborate effectively with frontend, product, and infrastructure teams to deliver robust and compliant systems.
Join Perchwell as a Senior Site Reliability Engineer, where you'll play a crucial role in maintaining and enhancing our scalable infrastructure. In this position, you will collaborate with cross-functional teams to ensure the reliability and performance of our systems. Your expertise will help drive our commitment to delivering exceptional service and innovative solutions to our clients.
About Legora Legora develops technology for the legal sector, working directly with legal professionals to ensure practical, relevant solutions. The company’s AI-native workspace helps users streamline workflows, ask better questions, and uncover new insights. Clients include major global law firms such as Cleary Gottlieb, Goodwin, Bird & Bird, and Linklaters, spanning over 40 countries. Legora’s team values collaboration and aims to build tools that genuinely improve the way lawyers work. Role Overview: Senior Site Reliability Engineer This Senior Site Reliability Engineer role sits within Legora’s core SRE team at the New York City engineering hub. The position focuses on building and maintaining reliable services, partnering with engineering teams both locally and in Stockholm. The team’s goal: raise reliability standards across Legora’s platform and ensure smooth, dependable operations. Location requirement: This is a full-time, in-office role based in New York City. Attendance is required five days a week to support close collaboration and innovation. What You Will Do Design, deploy, and manage essential platform services, taking full responsibility for their reliability. Build and maintain observability systems (metrics, logs, traces) to generate actionable insights. Set and improve service level indicators (SLIs), service level objectives (SLOs), alerting, and reliability metrics for key systems. Refine on-call procedures and incident response, including escalation processes and post-incident analysis. Drive ongoing improvements to system reliability and performance.
Full-time|$133.1K/yr - $148K/yr|Remote|New York City, NY
Site Reliability Engineer Overview: Join Weedmaps as a Site Reliability Engineer and collaborate across departments, including application, infrastructure, and quality teams, to elevate the performance, reliability, resilience, and scalability of our web services at Weedmaps.com. As a cloud-native organization, we run 100% of our services in Docker on Kubernetes within AWS's public cloud. Our operations utilize observability, monitoring, CI/CD automation, and custom tooling, enabling us to deploy multiple production releases daily. Your daily responsibilities will focus on applying your engineering expertise to enhance system monitoring, minimize developer toil, configure CI workflows, and optimize our deployment pipelines. You will serve as a knowledge reference for development teams, ensuring they utilize consistent tools for metrics, logging, building, and deployment. Collaborating closely with both development and infrastructure teams, you will identify critical service-specific metrics that require monitoring, and you will help application development teams create libraries for seamless service instrumentation. The impact you'll make: Collaborate with stakeholders to establish and promote best practices for monitoring and CI/CD pipelines. Troubleshoot issues related to deployment within our CI pipeline. Actively promote the DevOps culture at Weedmaps. Identify opportunities for automation and advocate for the codification of processes. Promote best practices regarding collaboration, reliability, security, and performance across all partner teams. Take ownership of application configuration and scaling for specified services, ensuring adherence to organizational practices. Develop and optimize synthetic monitoring flows. What you've accomplished: A minimum of 2 years of development experience in startup or mid-sized environments. Proficiency in programming languages such as Python, Go, Node, Ruby, or Elixir. Knowledge of containerization technologies, particularly Docker (Kubernetes experience is a plus). Strong communication skills, a positive demeanor, and the ability to provide and receive constructive feedback. Professional experience with cloud-native observability standards including OpenMetrics, OpenTracing, and OpenCensus. Expertise in using and configuring modern CI/CD workflows. Deep understanding of SLIs, SLOs, and SLAs at both service and business levels. Familiarity with golden signals and their significance in monitoring.
About the Role Betterment is looking for a Senior Engineering Manager to guide the Site Reliability Engineering (SRE) team at our New York City headquarters. This leader will oversee a skilled group focused on keeping services reliable, high-performing, and able to scale as we grow. What You'll Do Lead and support engineers dedicated to service reliability and performance. Collaborate with teams across the company to improve infrastructure and operations. Promote a culture that values technical excellence and continuous improvement. Location This position is based at Betterment HQ in New York City.
Full-time|$127K/yr - $249K/yr|Hybrid|New York City; United States
About the Atlas SRE Team The Atlas team at MongoDB, Inc. is based at our New York City headquarters, with options for hybrid work or fully remote arrangements from the Eastern or Central time zones. The group focuses on building, maintaining, and scaling the Atlas platform, which supports customers' most important workloads. Role Overview This senior-level Site Reliability Engineer (SRE) position calls for deep experience in designing and building complex systems. The role offers significant autonomy and expects ownership from start to finish. The work is hands-on and technical, with a focus on creating and improving systems that support Atlas at scale. Collaboration and Impact The SRE Atlas team works closely with multiple Atlas software engineering groups. Responsibilities include: Managing large-scale systems Developing new tools and automation Performing essential maintenance for the Atlas fleet Efforts in this role have a direct effect on the reliability and performance of Atlas for customers across the globe.
About Chalkboard:Chalkboard is pioneering the next generation of sports gaming. Our mission is to seamlessly merge watching and playing by transforming real-money sports gaming into a dynamic, social experience designed for fans eager to win. We are redefining how sports enthusiasts connect with the games they cherish.At our essence, we are a team of passionate, sports-loving innovators who prioritize transparency, equity, and the excitement of empowering fans to turn insights into actionable strategies.The Role:We are on the lookout for a Principal Site Reliability Engineer to join our ranks at Chalkboard, contributing to the creation of a platform that is not only reliable and scalable but also user-friendly for our development teams.In this pivotal role, you will collaborate with Engineering, Product, and Data teams, significantly impacting how millions of fans engage with sports in real time. If you thrive in a fast-paced environment, love to build robust solutions from the ground up, and aim to achieve team success rather than individual accolades, we want to hear from you!Your Game Plan:Take ownership of platform reliability from start to finish, proactively identifying and mitigating risks before they affect users.Develop and enhance observability (metrics, logs, tracing) to facilitate rapid issue detection, diagnosis, and resolution.Anticipate infrastructure needs by identifying bottlenecks and implementing sustainable architectural improvements.Minimize developer friction by refining CI/CD pipelines, deployment workflows, and internal tools.Lead incident responses and root cause analyses, focusing on systemic solutions rather than temporary fixes.Establish and uphold best practices for infrastructure, deployments, and system reliability.Create reusable, self-service infrastructure that empowers teams to deploy quickly and securely.Continuously enhance systems through automation and Infrastructure-as-Code methodologies.What You Bring to the Team:Experience with Cloud Infrastructure (preferably GCP): including networking, IAM, databases, and storage.Proficiency in Kubernetes: managing cluster operations and workloads.Skilled in Infrastructure as Code tools: Terraform, Helm.Familiarity with CI/CD practices: using GitHub Actions or similar tools.Knowledge of observability practices: metrics, logging, tracing, and alerting.
Full-time|$111K/yr - $218K/yr|Hybrid|New York City
The Site Reliability Engineering team at MongoDB supports the infrastructure behind the MongoDB Atlas platform. With Atlas serving customers worldwide, the team addresses the demands of delivering fast, reliable service across multiple regions while meeting data sovereignty requirements. Role overview This Site Reliability Engineer 3 position centers on designing and maintaining scalable systems. The work involves reducing manual tasks, improving monitoring, and increasing visibility into system health. Infrastructure-as-code is a key principle, and the team invests in automation and self-healing systems to minimize disruptions. Collaboration Teamwork is essential in this role. Site Reliability Engineers regularly partner with other engineering groups, sharing responsibilities and working together to achieve common objectives. Location This role is based in New York City and follows a hybrid work schedule.
Location: NYC Global HQ (Hybrid: 3 days in office) DoubleVerify delivers digital performance solutions for advertisers and agencies, enabling independent verification, campaign optimization, and measurement of business impact. Since 2008, DV has partnered with Fortune 500 brands, agencies, publishers, and digital ad platforms to bring greater transparency and improved outcomes to digital advertising. More details are available at www.doubleverify.com. Role overview The Senior Site Reliability Engineer I will focus on strengthening the reliability, scalability, and performance of DoubleVerify's digital media measurement platforms. This hybrid position is based at the NYC Global HQ, with an expectation of three days per week in the office. What you will do Enhance reliability, scalability, and performance for digital media measurement systems. Establish and refine observability practices, including setting up metrics, dashboards, and alerting to enable proactive reliability improvements. Reduce Mean Time to Recovery (MTTR) for critical incidents by automating processes, improving observability, and advancing monitoring capabilities. Lead incident response for high-severity (Sev1 and Sev2) events and drive resolutions. Maintain high availability across infrastructure and services in GCP, AWS, OCI, and on-premises environments. Guide technical projects from planning through deployment, collaborating with teams and keeping stakeholders informed. Design and deploy automation tools to reduce manual work and improve efficiency in deployment workflows, validation scripts, and self-service tooling. Utilize AI-assisted development tools for faster automation and troubleshooting. Build integrations and Monitoring Control Plane (MCP) servers to support monitoring platforms and AI-driven analysis. Apply Infrastructure-as-Code practices using Terraform, Helm charts, Python scripts, and configuration management tools for consistent, version-controlled deployments. Develop and maintain documentation, runbooks, and Standard Operating Procedures (SOPs) in Confluence to support consistent incident response.
Role overview Medal seeks a Site Reliability Engineer - Infrastructure Specialist in New York City. The focus is on strengthening the company’s infrastructure and ensuring the stability of Medal’s systems. This role works within a collaborative team to design, build, and maintain the technical foundation that enables the company’s growth and efficiency. What you will do Design and implement infrastructure solutions that can scale as demand increases Maintain and improve system reliability to help minimize downtime Monitor and optimize system performance to keep applications running smoothly Collaborate with team members to address ongoing infrastructure requirements
Full-time|$127K/yr - $249K/yr|Remote|Boston; Miami; New York City; Pittsburgh; Raleigh; United States
Join MongoDB’s innovative Storage Layer Services (SLS) team as we redefine the MongoDB cloud storage layer. This dynamic team is at the forefront of developing high-performance, multi-tenant distributed storage solutions that not only enhance our existing Atlas storage framework but also empower our customers' workloads to operate with remarkable efficiency. In this pivotal role, you will collaborate closely with teams dedicated to building these storage services, defining Service Level Objectives (SLOs), shaping capacity plans, and ensuring the reliability, durability, and operational safety of the foundational storage layer that supports Atlas. As one of the founding members of this small but experienced team of Site Reliability Engineers (SREs), you will play a vital role in executing a multi-year vision for MongoDB’s cloud storage architecture. This position offers flexibility in location, allowing you to work from our offices in Boston, New York City, Raleigh, Miami, or Pittsburgh, or remotely from anywhere in the United States, provided you are based in the Eastern or Central time zones.
Join Tabs as a Staff Site Reliability Engineer to lead the charge in enhancing our systems for maximum reliability and performance. In this pivotal role, you will collaborate with cross-functional teams to design, implement, and maintain robust infrastructure solutions. You will ensure our systems are scalable, secure, and efficient, ultimately providing an unparalleled experience for our users.Your expertise in cloud technologies and automation will be vital as you drive initiatives to improve operational efficiency and system resilience. If you are passionate about creating reliable systems and thrive in a fast-paced environment, we want to hear from you!
Kontakt.io is revolutionizing care operations through innovative platform solutions.Our mission is to reduce waste, enhance efficiency, and drive profitability by optimizing throughput, asset utilization, and workforce productivity. Leveraging AI, Real-Time Location Systems (RTLS), and Electronic Health Records (EHR) data, we empower self-learning agents to automate workflows, adjust in real-time, and coordinate comprehensive care delivery operations.Efficiently deployable and scalable, our platform provides clear visibility into spaces, equipment, and personnel, effectively eliminating inefficiencies and significantly enhancing the patient experience. With a proven 10X ROI and over 20 successful use cases, Kontakt.io stands out as the preferred choice for advancing care delivery operations.We are seeking a Lead Software Engineer - SRE who possesses a robust foundation in software engineering and a strategic mindset to enhance the reliability, scalability, and performance of our platform. This pivotal role within our Infrastructure Engineering team will be instrumental in shaping the architecture and strategic direction of our Site Reliability Engineering function.The ideal candidate will have extensive knowledge of software engineering principles as applied to infrastructure. Rather than merely maintaining systems, you will lead the design and construction of these systems, focusing on developing automation, tooling, and resilient architectures that ensure high availability and fault tolerance across our entire AWS-based platform.You will engage hands-on in designing robust systems, refining deployment pipelines, and enhancing incident management practices. As a technical leader, you will also mentor junior engineers, influence technical strategy, and foster a culture of accountability, ownership, and continuous improvement throughout the organization.
About Legora Legora builds AI-powered tools for legal professionals, working side by side with lawyers to ensure technology fits real-world needs. The platform helps legal teams work more efficiently, ask better questions, and find new insights. Clients include leading global firms such as Cleary Gottlieb, Goodwin, Bird & Bird, and Linklaters, with Legora’s reach spanning over 40 countries. The company values rapid shipping, thoughtful iteration, and scaling with purpose. Legora’s team is committed to high standards, always aiming to deliver technology that truly empowers lawyers. The culture rewards those who want to build from scratch, work with talented colleagues, and help shape the future of legal work. Staff Site Reliability Engineer , New York City (Onsite) This role joins the founding SRE team at Legora’s new engineering hub in New York City. The Staff Site Reliability Engineer leads reliability efforts across multiple teams, sets infrastructure architecture standards, and drives operational excellence for the platform. The position works closely with colleagues in Stockholm and requires in-office presence five days a week. What You Will Do Design and manage reliability and infrastructure strategies for several teams and services Oversee observability, capacity planning, and monitoring for distributed systems Develop and refine SLI/SLO frameworks, error budgets, and production readiness standards Lead incident management, create escalation protocols, and drive improvements from post-mortem analysis Work with engineering teams to integrate reliability best practices into their workflows Location Requirement This position is based in New York City and requires onsite work five days per week.
About WRITERWRITER is the premier platform where leading enterprises harness the power of AI to streamline their operations. Our mission is to enhance human potential through advanced superintelligence, demonstrating its feasibility with a trustworthy AI solution that bridges IT and business teams, facilitating transformative change across organizations. WRITER’s comprehensive platform empowers hundreds of companies, including Mars, Marriott, Uber, and Vanguard, to develop and deploy AI agents tailored to their unique datasets, supported by our enterprise-grade LLMs. With a valuation of $1.9B and support from top-tier investors such as Premji Invest, Radical Ventures, and ICONIQ Growth, WRITER is quickly establishing itself as the frontrunner in the field of enterprise generative AI.Founded in 2020, with offices in San Francisco, New York City, Austin, Chicago, and London, we are a dynamic team focused on innovation and speed. We seek intelligent, dedicated builders and innovators to join us in shaping the future of work powered by AI. About the RoleAs a Site Reliability Engineer at WRITER, you will play a critical role in ensuring the availability, performance, and reliability of our platform, which is essential for our mission to enhance human capabilities with superintelligence. Your work will directly influence every enterprise customer reliant on our AI-powered workflows. This position goes beyond routine maintenance; it involves proactively identifying and resolving intricate systemic challenges and establishing the framework necessary for our rapid growth and the evolving needs of enterprise generative AI. You will develop resilient systems, automate processes throughout the stack, and advocate for reliability best practices, directly contributing to our ambitious product roadmap and ensuring our clients have continuous access to the powerful tools they require.This is a hybrid role based in either our New York City or London office, reporting to the Director of Engineering. ResponsibilitiesAutomate operational tasks and infrastructure management by creating robust tools and platforms using languages such as Python, Go, or similar, significantly minimizing manual workload across our production environment.Design and implement scalable, fault-tolerant infrastructure solutions on leading public cloud platforms (AWS, GCP, Azure) to support WRITER's swiftly growing, high-traffic AI platform.Take ownership of the reliability, performance, and efficiency of WRITER’s core services, establishing and maintaining rigorous Service Level Objectives (SLOs) and Error Budgets.
About Legora Legora builds AI-driven solutions for the legal sector, partnering directly with legal professionals to create tools that support better insights and decision-making. Our platform is trusted by major global firms, including Cleary Gottlieb and Goodwin, and is used in over 40 countries. We focus on continuous improvement and innovation, working closely with users to ensure our technology truly meets their needs. Site Reliability Engineer – New York City (On-site) Legora is looking for a Site Reliability Engineer to join the founding SRE team at our New York City engineering hub. This role is based fully on-site, five days a week. The position centers on maintaining and improving the reliability and performance of our platform as we expand. Expect to work side by side with experienced engineers, focusing on production systems, observability, incident response, and automation. What You Will Do Oversee and improve production services, including deployments, monitoring, and system health. Develop and maintain observability tools for metrics, logs, and traces, aiming for high-quality signals and minimal noise. Help define Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and set up alerting and reliability metrics for key services. Participate in on-call rotations, contribute to post-incident reviews, and help implement measures to prevent future issues. Location Requirement This role requires working on-site at Legora’s New York City office, Monday through Friday. In-person collaboration is core to how we work and deliver results.
Join Alloy as a Site Reliability Engineer and play a crucial role in ensuring the reliability, availability, and performance of our systems. You will work closely with development teams to design and implement robust infrastructure solutions that enable seamless user experiences. Your expertise will be vital in maintaining our high standards for uptime and efficiency.
At GPTZero, we're dedicated to reinstating trust and transparency in the digital world. As the premier AI detection platform, we empower educators, students, journalists, marketers, and writers to effectively navigate the dynamic realm of AI-generated content. With millions of users and institutions placing their trust in us, we are shaping a pioneering company that stands at the intersection of AI and information integrity.Our team boasts members from high-performing engineering environments, including Meta, Perplexity, AWS, Affirm, and top-tier AI research institutions like Princeton, Caltech, and Vector Institute.What We're Looking ForWe are seeking a motivated Software Engineering Intern to help us develop the next-generation platform aimed at verifying the origin, quality, and accuracy of global information. The ideal candidate is an enthusiastic learner with a proven track record of building applications from scratch and adept at resolving complex challenges.You will collaborate with a fast-paced team of dedicated builders, working closely with our Machine Learning and design teams to create software that has already garnered over 2 million users worldwide. Past intern contributions have been highlighted in demonstrations to venture capitalists and policymakers at the state level.Key ResponsibilitiesDevelop and launch high-impact, user-friendly, AI-driven web applications using React, Node.js, and Tailwind CSS.Implement top-requested features based on user feedback for our dashboard, Chrome extensions, and API.Leverage product analytics to inform data-driven product decisions.Collaborate with teams across Machine Learning, design, and business sectors to innovate new product initiatives.Adapt to various roles and work throughout the product stack.QualificationsProficiency in building comprehensive applications from backend systems to front-end styling with CSS.A minimum of 2 years of experience with modern web frameworks such as Express, Next.js, TypeScript, and React.At least 1 year of experience working with databases like PostgreSQL and AWS RDS.Strong motivation to contribute positively to societal impact.Ability to work with a minimum of 5 hours overlap with Eastern Standard Time.Bonus:A robust open-source portfolio.Experience in an early-stage startup environment.Background as a peer-reviewed writer.
About AdaptiveAdaptiveSecurity is at the forefront of cybersecurity innovation, being the only AI-focused investment from NVIDIA and OpenAI. Our mission is to combat AI-driven cyber threats effectively.Founded in December 2025, Adaptive secured an impressive $81M in Series B funding, spearheaded by NVIDIA and Bain Capital Ventures, alongside contributions from Capital One Ventures and Citi Ventures, with ongoing support from Andreessen Horowitz (a16z), the OpenAI Startup Fund, and Abstract Ventures. This funding round marked NVIDIA’s inaugural investment in AI cybersecurity.Our founders, Brian Long and Andrew Jones, are seasoned entrepreneurs with a proven track record in scaling transformative companies. They previously co-founded Attentive, which achieved over $500M in annual revenue and a valuation exceeding $10B, and TapCommerce, which was acquired by Twitter. Their extensive experience in building high-growth, product-oriented businesses drives Adaptive's ambition to create a robust security layer for the AI era.Trusted by top-tier banks, technology firms, and healthcare providers, Adaptive defends against emerging threats such as deepfakes, smishing, and AI-enhanced voice scams. With increasing enterprise adoption and a market potential exceeding $200B, we are just beginning our journey.Role OverviewWe are on the lookout for a Senior Software Engineer who thrives in a fast-paced startup atmosphere, where each engineer plays a pivotal role in shaping the product and the company’s future. You will take charge of significant features from inception to deployment, make long-lasting architectural decisions, and establish practices that enhance team productivity.This position is ideal for someone eager to construct, mentor, and influence our operational methodologies.Key ResponsibilitiesDesign and implement large-scale features and systems from start to finish.Establish practices that improve team efficiency and ensure codebase maintainability.Proactively drive change without waiting for direction.Collaborate closely with product and design teams from day one, merging roles and guiding project directions.Anticipate upcoming complexities and architect scalable systems to counter evolving threats.Assist in the hiring process by conducting interviews and identifying exceptional talent.Qualifications5+ years of experience in developing customer-facing software solutions.Bachelor's degree in Computer Science, Software Engineering, or a related field, or equivalent practical experience.Strong problem-solving skills and a proactive attitude.Experience in collaborative team environments.Ability to mentor junior engineers and foster a culture of learning.
Full-time|Hybrid|Hybrid NYC preferred, open to remote as well
iMentor is on the lookout for a talented Senior Front End Engineer with expertise in React to enhance our innovative mentoring platform. In this full-time position, you will be instrumental in developing and maintaining exceptional front-end experiences within a cutting-edge MERN stack environment.As a key member of our mission-driven technology team, you will work on tools that facilitate mentoring relationships across the U.S. You will report directly to the Senior Director of Engineering and collaborate closely with product partners and fellow engineers.The ideal candidate will possess extensive technical knowledge of React, complemented by a thoughtful approach focused on usability, stability, and long-term maintainability.About the PlatformThe iMentor platform is a custom-built ecosystem designed to support mentors, mentees, and program staff nationwide. Key features include:A multi-step volunteer application processMessaging and communication toolsLearning interactions between mentors and studentsData-driven workflows and integrations with partner servicesUtilizing React for the front end and powered by Node, Express, and MongoDB for backend services, the platform also integrates with services such as Twilio and other third-party platforms.Working alongside a diverse, multinational engineering team, the Senior React Engineer will play a crucial role in contributing to a secure, scalable platform that aligns with iMentor’s technical roadmap and program objectives.
Flagler Health is an innovative health technology company at the forefront of revolutionizing healthcare delivery. Our mission is to empower healthcare organizations through AI-driven workflow automation, enhancing remote patient engagement, and optimizing chronic care management. With a robust platform that has positively impacted over 1.5 million patients, we are trusted by healthcare providers and payers to enhance operational efficiency, reduce costs, and achieve superior patient outcomes. Positioned uniquely with a freemium model and limited competition, we are set to capture a significant portion of the $4.5 trillion U.S. healthcare market.Key ResponsibilitiesDesign, develop, and maintain scalable backend services and web applications.Engage in the development of real-time web applications for audio and SMS communication with patients.Implement and manage fault-tolerant, long-running workflows for asynchronous and background processing.Collaborate effectively with frontend, product, and infrastructure teams to deliver robust and compliant systems.
Dec 18, 2025
Sign in to browse more jobs
Create account — see all 2,190 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.