Staff Site Reliability Engineer

TabsNew York City, NY

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

To succeed in this role, you should possess:Proven experience in Site Reliability Engineering or a related field. Strong proficiency in cloud platforms such as AWS, Azure, or Google Cloud. Expertise in containerization and orchestration technologies (e.g., Docker, Kubernetes). Solid understanding of scripting and programming languages (e.g., Python, Go). Experience with monitoring and logging tools to ensure system health. Excellent problem-solving skills and the ability to work independently and in a team.

About the job

Join Tabs as a Staff Site Reliability Engineer to lead the charge in enhancing our systems for maximum reliability and performance. In this pivotal role, you will collaborate with cross-functional teams to design, implement, and maintain robust infrastructure solutions. You will ensure our systems are scalable, secure, and efficient, ultimately providing an unparalleled experience for our users.

Your expertise in cloud technologies and automation will be vital as you drive initiatives to improve operational efficiency and system resilience. If you are passionate about creating reliable systems and thrive in a fast-paced environment, we want to hear from you!

About Tabs

At Tabs, we are dedicated to revolutionizing the way users interact with technology. Our innovative solutions and commitment to excellence set us apart in the industry. Join our dynamic team and help us shape the future of technology!

Similar jobs

1 - 20 of 2,801 Jobs

Search for Platform Developer Network Architecture Site Reliability

2,801 results

Select all on this page (20)

Apply

Platform Developer - Network Architecture & Site Reliability

Judi Health

Full-time|$130K/yr - $160K/yr|On-site|New York, New York, United States

About Judi HealthJudi Health is a leading enterprise health technology company that offers a robust suite of solutions tailored for employers and health plans. Our offerings include:Capital Rx, a public benefit corporation that provides comprehensive pharmacy benefit management (PBM) solutions to self-insured employers,Judi Health™, which delivers full-service health benefit management solutions for employers, TPAs, and health plans, andJudi®, our proprietary Enterprise Health Platform (EHP) that streamlines all claim administration workflows within a single, scalable, and secure platform.We are dedicated to rebuilding trust in healthcare in the U.S. alongside our clients, deploying the essential infrastructure to ensure quality care for everyone. To learn more, visit www.judi.health.About the TeamJoin our dynamic team as a Platform Engineer, specializing in network architecture and site reliability. In this pivotal role, you will take charge of designing and implementing our cloud network architecture, ensuring that our platform remains resilient, secure, and scalable across various AWS accounts, environments, and regions. You will tackle intricate networking challenges, such as hierarchical CIDR allocation across accounts and integrations, develop disaster recovery and regional redundancy strategies, and set reliability practices that are vital for maintaining our healthcare platform's operations. Collaborating closely with leadership and cross-functional teams, you will build the foundational infrastructure that supports our rapidly expanding platform while ensuring it can effectively manage failures and recover swiftly.

Mar 5, 2026

Apply

Network Architecture Specialist

Sonsoft Inc.

Contract|On-site|New York

Join Sonsoft Inc. as a Network Architecture Specialist and play a crucial role in designing and implementing innovative network solutions. You will work with a talented team to enhance our infrastructure, ensuring security, scalability, and efficiency. Your expertise will guide the deployment of next-generation networking technologies that align with our business goals.

Oct 5, 2016

Apply

Site Reliability Engineer II at Dataiku | New York

Dataiku

Full-time|$165K/yr - $225K/yr|On-site|United States, New York

Dataiku is the leading platform for AI success, serving as the enterprise orchestration layer for building, deploying, and governing AI solutions. In a unified environment, teams can design and operate analytics, machine learning, and AI agents with the transparency, collaboration, and control that enterprises demand. Dataiku integrates seamlessly with various data platforms, cloud infrastructures, and AI services, enabling businesses to execute AI strategies across diverse vendor environments while maintaining centralized governance.The world's top companies trust Dataiku to operationalize AI, transforming it into a key driver of business performance that delivers measurable value. For more insights, explore the Dataiku blog, LinkedIn, X, and YouTube.Why Engineering at Dataiku?Dataiku’s platform, whether deployed on-premise, in the cloud, or as SaaS, embodies our dedication to quality and innovation by connecting various data science technologies. Our technology stack reflects our commitment to integrating the best data and AI technologies, ensuring that we select tools that genuinely enhance our product. From utilizing the latest large language models (LLMs) to supporting open-source communities, you'll work with a dynamic array of technologies and contribute to the collective knowledge of global tech innovators. Discover more about engineering at Dataiku here.Your Impact:As a Site Reliability Engineer (SRE) with advanced networking and security skills, you will join our Cloud team focused on developing and operating the Dataiku managed offering. Your responsibilities will encompass a wide range of tasks, including architecting and maintaining robust network security measures (such as PrivateLink and IPSec), ensuring compliance with industry standards and regulations, and monitoring and deploying our cloud offerings.You will be tasked with building and operating a reliable, secure, and cost-efficient infrastructure to support the Dataiku SaaS offerings. This role presents a unique opportunity to engage in a project central to our company’s vision, with a strong and direct impact on our operations.

Mar 12, 2026

Apply

Director, Stadium Network Architecture

NFL Films

Full-time|$185K/yr - $215K/yr|On-site|New York, New York, United States

The National Football League (NFL) is seeking a highly skilled Director of Network Architecture for Stadium Operations to join our innovative team of engineering and architecture experts. This pivotal role requires a deep understanding of network design, engineering, architecture, and operations, complemented by expertise in Wireless Design and Spectrum Analysis. As the Senior Network Architect, you will spearhead the creation of network architecture diagrams, outline technical requirements, and act as the technical lead for football technology network infrastructure. We are looking for a candidate who excels in implementing and maintaining sophisticated network services, ensuring superior reliability, performance, and uptime. This hands-on leadership role will oversee the Infrastructure Engineering Team, addressing a variety of areas including the Football Technology Network, NFL temporary Fan Event Setups, and sideline wireless technology. This position is based in either our Manhattan or Mount Laurel, NJ offices and involves 30% to 50% travel both domestically and internationally.

Feb 5, 2026

Apply

Senior Site Reliability Engineer, Platform & Cloud FinOps

Hopper

Full-time|Remote|New York - Remote

About the RoleJoin Hopper's dynamic Cloud FinOps team as a Senior Site Reliability Engineer. We oversee an extensive infrastructure within Google Cloud, empowering hundreds of engineers to deliver exceptional experiences to millions of users globally.If you are enthusiastic about automation and optimizing systems for performance and reliability, we want to hear from you.You will focus on building scalable, secure, and optimized infrastructure while solving practical problems with straightforward, cost-effective solutions.Daily ResponsibilitiesEngage in projects that enhance cost efficiency, such as:Minimizing network egress costs by eliminating unnecessary headers.Optimizing data storage solutions based on usage patterns, such as implementing cold storage for infrequently accessed data.Ensuring optimal autoscaling configurations for databases and compute resources.Enhance current cost attribution processes to provide transparency for all teams regarding their expenditures.Participate in incident support, including on-call rotation for platform incidents, collaborating with teams across the Americas and Europe to ensure continuous support.Contribute to a small but highly efficient team of SREs.

Mar 5, 2026

Apply

Principal Site Reliability Engineer

Chalkboard

Full-time|On-site|New York City

About Chalkboard:Chalkboard is pioneering the next generation of sports gaming. Our mission is to seamlessly merge watching and playing by transforming real-money sports gaming into a dynamic, social experience designed for fans eager to win. We are redefining how sports enthusiasts connect with the games they cherish.At our essence, we are a team of passionate, sports-loving innovators who prioritize transparency, equity, and the excitement of empowering fans to turn insights into actionable strategies.The Role:We are on the lookout for a Principal Site Reliability Engineer to join our ranks at Chalkboard, contributing to the creation of a platform that is not only reliable and scalable but also user-friendly for our development teams.In this pivotal role, you will collaborate with Engineering, Product, and Data teams, significantly impacting how millions of fans engage with sports in real time. If you thrive in a fast-paced environment, love to build robust solutions from the ground up, and aim to achieve team success rather than individual accolades, we want to hear from you!Your Game Plan:Take ownership of platform reliability from start to finish, proactively identifying and mitigating risks before they affect users.Develop and enhance observability (metrics, logs, tracing) to facilitate rapid issue detection, diagnosis, and resolution.Anticipate infrastructure needs by identifying bottlenecks and implementing sustainable architectural improvements.Minimize developer friction by refining CI/CD pipelines, deployment workflows, and internal tools.Lead incident responses and root cause analyses, focusing on systemic solutions rather than temporary fixes.Establish and uphold best practices for infrastructure, deployments, and system reliability.Create reusable, self-service infrastructure that empowers teams to deploy quickly and securely.Continuously enhance systems through automation and Infrastructure-as-Code methodologies.What You Bring to the Team:Experience with Cloud Infrastructure (preferably GCP): including networking, IAM, databases, and storage.Proficiency in Kubernetes: managing cluster operations and workloads.Skilled in Infrastructure as Code tools: Terraform, Helm.Familiarity with CI/CD practices: using GitHub Actions or similar tools.Knowledge of observability practices: metrics, logging, tracing, and alerting.

Apr 6, 2026

Apply

Cloud Site Reliability Engineer

AYR Global IT Solutions

Full-time|On-site|New York

As a Cloud Site Reliability Engineer, you will be responsible for deploying innovative solutions within the public cloud environment, specifically utilizing AWS services. You will create and manage configuration templates designed for scalable infrastructure, including AWS components like EFS, EC2, and RDS. Collaborating closely with the Scrum Master, you will ensure the project requirements are met within an agile development setting.Key Responsibilities:• Contribute to architectural design to enhance system consistency, security, maintainability, and flexibility.• Assist architects in creating highly scalable and automated deployments for diverse applications.• Develop configuration templates using established architectural blueprints.• Ensure the development of robust and scalable services across public cloud platforms, including AWS and GCP.• Monitor and assess system performance to ensure optimal operation.

Aug 8, 2017

Apply

Senior Site Reliability Engineer

Full-time|On-site|New York, NY

Role overview ro is looking for a Senior Site Reliability Engineer based in New York, NY. This role focuses on maintaining and improving the reliability, availability, and performance of our cloud infrastructure and applications. The position supports ongoing enhancements and encourages a culture of continuous improvement across the engineering team.

Apr 16, 2026

Apply

Staff Site Reliability Engineer

Tabs

Full-time|On-site|New York City, NY

Join Tabs as a Staff Site Reliability Engineer to lead the charge in enhancing our systems for maximum reliability and performance. In this pivotal role, you will collaborate with cross-functional teams to design, implement, and maintain robust infrastructure solutions. You will ensure our systems are scalable, secure, and efficient, ultimately providing an unparalleled experience for our users.Your expertise in cloud technologies and automation will be vital as you drive initiatives to improve operational efficiency and system resilience. If you are passionate about creating reliable systems and thrive in a fast-paced environment, we want to hear from you!

Feb 24, 2026

Apply

Remote Site Reliability Engineer at Weedmaps

Weedmaps

Full-time|$133.1K/yr - $148K/yr|Remote|New York City, NY

Site Reliability Engineer Overview: Join Weedmaps as a Site Reliability Engineer and collaborate across departments, including application, infrastructure, and quality teams, to elevate the performance, reliability, resilience, and scalability of our web services at Weedmaps.com. As a cloud-native organization, we run 100% of our services in Docker on Kubernetes within AWS's public cloud. Our operations utilize observability, monitoring, CI/CD automation, and custom tooling, enabling us to deploy multiple production releases daily. Your daily responsibilities will focus on applying your engineering expertise to enhance system monitoring, minimize developer toil, configure CI workflows, and optimize our deployment pipelines. You will serve as a knowledge reference for development teams, ensuring they utilize consistent tools for metrics, logging, building, and deployment. Collaborating closely with both development and infrastructure teams, you will identify critical service-specific metrics that require monitoring, and you will help application development teams create libraries for seamless service instrumentation. The impact you'll make: Collaborate with stakeholders to establish and promote best practices for monitoring and CI/CD pipelines. Troubleshoot issues related to deployment within our CI pipeline. Actively promote the DevOps culture at Weedmaps. Identify opportunities for automation and advocate for the codification of processes. Promote best practices regarding collaboration, reliability, security, and performance across all partner teams. Take ownership of application configuration and scaling for specified services, ensuring adherence to organizational practices. Develop and optimize synthetic monitoring flows. What you've accomplished: A minimum of 2 years of development experience in startup or mid-sized environments. Proficiency in programming languages such as Python, Go, Node, Ruby, or Elixir. Knowledge of containerization technologies, particularly Docker (Kubernetes experience is a plus). Strong communication skills, a positive demeanor, and the ability to provide and receive constructive feedback. Professional experience with cloud-native observability standards including OpenMetrics, OpenTracing, and OpenCensus. Expertise in using and configuring modern CI/CD workflows. Deep understanding of SLIs, SLOs, and SLAs at both service and business levels. Familiarity with golden signals and their significance in monitoring.

Apr 3, 2026

Apply

Site Reliability Engineer - Infrastructure Specialist

Medal

Full-time|On-site|New York City

Role overview Medal seeks a Site Reliability Engineer - Infrastructure Specialist in New York City. The focus is on strengthening the company’s infrastructure and ensuring the stability of Medal’s systems. This role works within a collaborative team to design, build, and maintain the technical foundation that enables the company’s growth and efficiency. What you will do Design and implement infrastructure solutions that can scale as demand increases Maintain and improve system reliability to help minimize downtime Monitor and optimize system performance to keep applications running smoothly Collaborate with team members to address ongoing infrastructure requirements

Apr 24, 2026

Apply

Staff Site Reliability Engineer

Legora

Full-time|On-site|New York City

About Legora Legora builds AI-powered tools for legal professionals, working side by side with lawyers to ensure technology fits real-world needs. The platform helps legal teams work more efficiently, ask better questions, and find new insights. Clients include leading global firms such as Cleary Gottlieb, Goodwin, Bird & Bird, and Linklaters, with Legora’s reach spanning over 40 countries. The company values rapid shipping, thoughtful iteration, and scaling with purpose. Legora’s team is committed to high standards, always aiming to deliver technology that truly empowers lawyers. The culture rewards those who want to build from scratch, work with talented colleagues, and help shape the future of legal work. Staff Site Reliability Engineer , New York City (Onsite) This role joins the founding SRE team at Legora’s new engineering hub in New York City. The Staff Site Reliability Engineer leads reliability efforts across multiple teams, sets infrastructure architecture standards, and drives operational excellence for the platform. The position works closely with colleagues in Stockholm and requires in-office presence five days a week. What You Will Do Design and manage reliability and infrastructure strategies for several teams and services Oversee observability, capacity planning, and monitoring for distributed systems Develop and refine SLI/SLO frameworks, error budgets, and production readiness standards Lead incident management, create escalation protocols, and drive improvements from post-mortem analysis Work with engineering teams to integrate reliability best practices into their workflows Location Requirement This position is based in New York City and requires onsite work five days per week.

Apr 15, 2026

Apply

Site Reliability Engineer at Mistral | New York

Mistral

Full-time|On-site|New York, NY

Role overview The Site Reliability Engineer at Mistral plays a key part in keeping systems stable, available, and performing well. This position requires close collaboration with teams throughout the company to support and improve the infrastructure that powers Mistral’s services. What you will do Maintain and improve system reliability and uptime Partner with other teams to design and build scalable infrastructure Implement monitoring tools, automation, and incident response processes Location This role is based in New York, NY.

Apr 21, 2026

Apply

Site Reliability Operations Analyst - Commercial

Palantir Technologies

Full-time|On-site|New York, NY

Join a Pioneering CompanyAt Palantir, we create cutting-edge software that revolutionizes data-driven decision-making and operational efficiency. Our platforms enable partners to make significant advancements—from developing life-saving pharmaceuticals to predicting supply chain challenges and reuniting families with missing children.Role OverviewThe Site Reliability Operations Analyst plays a crucial role in ensuring the seamless deployment of Palantir solutions. Your mission is to design, implement, and optimize workflows that enhance operational efficiency and minimize obstacles. You will monitor and stabilize projects, proactively identify and address challenges, and anticipate client needs, allowing our engineers to dedicate their expertise to solving complex technical issues.This role demands a blend of project management, process improvement, and execution capabilities. We seek individuals who are passionate about problem-solving, embrace innovative ideas, and thrive in collaborative environments.

Nov 28, 2023

Apply

Senior Software Engineer - Site Reliability

Parabola

Full-time|$180K/yr - $200K/yr|On-site|New York, New York

About Us:At Parabola, we empower teams to transform and streamline complex data workflows with ease. Our innovative workflow builder allows users to automate tasks that were previously manual, including data from PDFs, emails, and spreadsheets. Forward-thinking companies such as Brooklinen, On Running, and Flexport leverage Parabola to enhance their productivity and tackle ambitious projects. Our platform enables teams to automate processes, saving valuable time and resources, all without requiring extensive engineering support.Supported by prominent investors like OpenView Partners, Matrix Partners, and Thrive Capital, we are committed to continuous innovation and growth.About the Role:As a Senior Site Reliability Engineer, you will be integral to our dynamic team, focusing on measuring and enhancing the performance of our software systems. Your role will encompass critical aspects such as performance optimization, security compliance, and foundational architecture. With our compact team structure, your contributions will significantly impact our operations and customer satisfaction.What You Will Do:Oversee the monitoring of our core software systems, both during on-call scenarios and in routine operations, to establish and refine service-level objectives (SLOs) and agreements (SLAs).Enhance and monitor our infrastructure stack, ensuring it meets the demands of our business-critical services.Maintain a comprehensive mental and documented model of our systems to effectively assess risks, plan projects, and troubleshoot issues.Engage with the engineering team and leadership, providing insights from your expertise in site reliability and advocating for best practices.Contribute to the development of our core orchestration logic, which supports the efficient execution of thousands of workflows concurrently, utilizing our new orchestration built on Temporal.Support multiple backend engineering projects during planning and execution phases, joining as needed for hands-on development.Focus on optimizing service scalability, stability, and observability.

Oct 28, 2025

Apply

Director of Engineering - Site Reliability and Infrastructure

FuboTV Inc.

Full-time|$225K/yr - $275K/yr|Hybrid|New York, NY

About FuboTV: FuboTV Inc. is a pioneering live TV streaming service dedicated to enhancing the viewing experience for consumers, shaping the future of television. Recognized by the Financial Times as one of The Americas’ Fastest-Growing Companies in 2025, FuboTV operates platforms such as Hulu + Live TV (entertainment), Fubo (sports), and Molotov (entertainment and sports), delivering diverse content to audiences globally. Our Mission: We strive to provide top-notch sports, news, and entertainment programming through an exceptional user experience that promises increased choice, flexibility, and value. About the Role: *This role is a hybrid position based in New York City. Candidates must be located in NYC and be prepared to work in the office three times a week (Tuesday, Wednesday, Thursday).* We are currently seeking a Director of Site Reliability to lead our global infrastructure and site reliability teams. These teams are essential for ensuring the reliability, scalability, and performance of our products worldwide. This position will facilitate the integration of software development and system operations on a global scale, guaranteeing seamless product delivery to our users. Responsibilities: Develop and implement a long-term vision for infrastructure, cloud architecture, and site reliability in a multi-cloud environment (GCP, Azure, AWS) across various continents. Manage, mentor, and expand a global team of infrastructure, SRE/DevOps, Managers, and FinOps engineers, driving the organization towards its objectives. Integrate AI-driven monitoring and incident response tools to transition from reactive alerts to predictive anomaly detection and automated remediation. Establish industry best practices in reliability, observability, and incident management. Champion platform principles across the engineering organization, overseeing projects that enable teams to innovate swiftly without compromising user experience. Guide the design and development of a scalable, secure, and highly available cloud infrastructure. Optimize systems for enhanced performance, scalability, and cost efficiency. Lead capacity planning and disaster recovery initiatives. Steer the incident response process and ongoing improvement efforts. Collaborate closely with engineering and product teams to enhance system reliability and developer experience.

Mar 31, 2026

Apply

Senior Site Reliability Engineer

Veterinary Emergency Group

Full-time|$170K/yr - $200K/yr|Hybrid|VEG Headquarters, White Plains, NY

ABOUT VETERINARY EMERGENCY GROUP Founded in 2014, the Veterinary Emergency Group (VEG) is dedicated to transforming the emergency care experience for pets and their owners. With a vision to redefine norms and improve the ER experience, we have rapidly expanded our network of hospitals that operate 24/7/365 across the nation. Our commitment to understanding the needs of pets and their families drives our continuous innovation. We prioritize not only the wellbeing of our patients but also of our team members (VEGgies), empowering them to achieve greatness and fostering a culture of growth and belonging. At VEG, we are reimagining emergency care in every aspect—from hospital operations to the support systems for our teams. Our headquarters team is pivotal in this transformation, whether it's through developing innovative technology to enhance hospital efficiency, recruiting exceptional talent, or effectively showcasing our brand through marketing. Our headquarters team ensures that our hospitals are equipped with the necessary resources to deliver outstanding care to pets and their families. VEG has been recognized as a Great Place to Work® for 2025 and 2026. THE ROLE We are seeking a Senior Site Reliability Engineer who recognizes the critical importance of reliability at VEG; our proprietary platform, DogByte, is essential to the survival of pets. As the primary architect of our platform's resilience, you will engineer our infrastructure to be self-healing, enabling our medical teams to provide life-saving care around the clock. Your role will be a blend of high-level architectural strategy and hands-on technical execution, ensuring our engineering teams can rapidly develop while maintaining a solid foundation. Your efforts will focus on evolving and enhancing existing systems to support VEG’s hospital expansion, ensuring that our infrastructure is never a limiting factor in our ability to open new hospitals or deliver medical care. You will take ownership of DogByte's ongoing stability, scaling it into a robust enterprise platform where individual hospital traffic is isolated to prevent impact on others. This position offers the flexibility to work at our headquarters in White Plains or remotely. KEY RESPONSIBILITIES Develop short- and long-term strategies to ensure DogByte can handle increasing volume year-over-year, particularly addressing traffic isolation between hospitals. Collaborate with engineering teams to ensure that data flows—from client to API to database—are optimized for high availability and performance.

Mar 23, 2026

Apply

Site Reliability Engineer 3

MongoDB, Inc.

Full-time|$111K/yr - $218K/yr|Hybrid|New York City

The Site Reliability Engineering team at MongoDB supports the infrastructure behind the MongoDB Atlas platform. With Atlas serving customers worldwide, the team addresses the demands of delivering fast, reliable service across multiple regions while meeting data sovereignty requirements. Role overview This Site Reliability Engineer 3 position centers on designing and maintaining scalable systems. The work involves reducing manual tasks, improving monitoring, and increasing visibility into system health. Infrastructure-as-code is a key principle, and the team invests in automation and self-healing systems to minimize disruptions. Collaboration Teamwork is essential in this role. Site Reliability Engineers regularly partner with other engineering groups, sharing responsibilities and working together to achieve common objectives. Location This role is based in New York City and follows a hybrid work schedule.

Apr 21, 2026

Apply

Network Engineer - Reliability & Observability at Fluidstack | New York, NY

Fluidstack

Full-time|$150K/yr - $250K/yr|On-site|New York, NY

Join Fluidstack as a Network Engineer!At Fluidstack, we are at the forefront of building cutting-edge infrastructure designed for abundant intelligence. Collaborating with leading AI labs, government entities, and major enterprises such as Mistral, Poolside, and Meta, we strive to deliver compute capabilities at unprecedented speeds.Our mission is to accelerate the realization of Artificial General Intelligence (AGI). We are urgently seeking passionate individuals who are committed to delivering exceptional infrastructure. At Fluidstack, we take pride in our work, treating our customer outcomes as our own. If you are driven by purpose and excellence, and ready to put in the effort necessary to shape the future of intelligence, we invite you to join us!Position OverviewFluidstack is on the lookout for a Network Engineer specializing in Reliability & Observability. In this pivotal role, you will act as a reliability engineer, leading the charge in developing processes, collecting data, and establishing reliability metrics aimed at enhancing the quality and dependability of AI networks throughout all operational phases.Your primary focus will be on creating systems, tools, and data pipelines to boost network quality, while also automating metrics reporting (24/7) and generating periodic reliability assessments for both internal teams and customers.This position is perfect for seasoned network operators who possess a deep passion for reliability and have experience in designing and implementing full lifecycle software, including conducting Quality Assurance audits and analyzing failure rates. A strong interest in hardware (both electronics and optics) and software development is essential, alongside a commitment to leveraging data for informed decision-making in deployment and operations.We encourage experienced Site Reliability Engineers (SREs) with a strong networking background to apply!Key ResponsibilitiesQuality Assurance Ownership: Design and implement QA processes tailored for network hardware and networks.Data Pipelines: Develop and deploy both serverless and manually triggered workflows to generate network quality and reliability observability for our clients.Deployment and Operations Assistance: Collaborate with various teams to support full lifecycle data collection, analysis, and process enhancements aimed at meeting service level agreements (SLAs) and objectives (SLOs).Process Engineering: Innovate and implement process improvements to streamline deployment and operational workflows.

Feb 10, 2026

Apply

Site Reliability Engineer at WRITER | New York City

WRITER

Full-time|Hybrid|New York City, NY

About WRITERWRITER is the premier platform where leading enterprises harness the power of AI to streamline their operations. Our mission is to enhance human potential through advanced superintelligence, demonstrating its feasibility with a trustworthy AI solution that bridges IT and business teams, facilitating transformative change across organizations. WRITER’s comprehensive platform empowers hundreds of companies, including Mars, Marriott, Uber, and Vanguard, to develop and deploy AI agents tailored to their unique datasets, supported by our enterprise-grade LLMs. With a valuation of $1.9B and support from top-tier investors such as Premji Invest, Radical Ventures, and ICONIQ Growth, WRITER is quickly establishing itself as the frontrunner in the field of enterprise generative AI.Founded in 2020, with offices in San Francisco, New York City, Austin, Chicago, and London, we are a dynamic team focused on innovation and speed. We seek intelligent, dedicated builders and innovators to join us in shaping the future of work powered by AI. About the RoleAs a Site Reliability Engineer at WRITER, you will play a critical role in ensuring the availability, performance, and reliability of our platform, which is essential for our mission to enhance human capabilities with superintelligence. Your work will directly influence every enterprise customer reliant on our AI-powered workflows. This position goes beyond routine maintenance; it involves proactively identifying and resolving intricate systemic challenges and establishing the framework necessary for our rapid growth and the evolving needs of enterprise generative AI. You will develop resilient systems, automate processes throughout the stack, and advocate for reliability best practices, directly contributing to our ambitious product roadmap and ensuring our clients have continuous access to the powerful tools they require.This is a hybrid role based in either our New York City or London office, reporting to the Director of Engineering.‍ ResponsibilitiesAutomate operational tasks and infrastructure management by creating robust tools and platforms using languages such as Python, Go, or similar, significantly minimizing manual workload across our production environment.Design and implement scalable, fault-tolerant infrastructure solutions on leading public cloud platforms (AWS, GCP, Azure) to support WRITER's swiftly growing, high-traffic AI platform.Take ownership of the reliability, performance, and efficiency of WRITER’s core services, establishing and maintaining rigorous Service Level Objectives (SLOs) and Error Budgets.

Feb 12, 2026

Create account — see all 2,801 results