Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Manager
Qualifications
To succeed in this role, you should possess:A strong background in security systems engineering, with hands-on experience in reliability engineering. Proven leadership skills, demonstrating the ability to manage and mentor a team effectively. Excellent problem-solving skills and a proactive approach to system reliability. Strong communication skills to convey complex technical concepts to non-technical stakeholders. Experience with incident management and response, along with a solid understanding of security protocols.
About the job
Join Cloudflare as a Security Systems Reliability Engineering Manager and lead a team dedicated to enhancing the reliability of our security systems. In this hybrid role, you will drive initiatives that ensure our security infrastructure is robust and resilient, addressing critical challenges within our operations.
As a leader, you will collaborate with cross-functional teams to enhance system performance and reliability, ensuring that our security systems meet the high standards expected by our users. Your expertise will be pivotal in maintaining the integrity and availability of our services.
About Cloudflare, Inc.
Cloudflare is a leading web performance and security company, helping to build a better Internet. With a commitment to innovation and excellence, we provide our customers with a suite of services designed to enhance the performance and security of their online assets.
Similar jobs
1 - 20 of 8,329 Jobs
Search for Security Systems Reliability Engineering Manager
Join Cloudflare as a Security Systems Reliability Engineering Manager and lead a team dedicated to enhancing the reliability of our security systems. In this hybrid role, you will drive initiatives that ensure our security infrastructure is robust and resilient, addressing critical challenges within our operations.As a leader, you will collaborate with cross-functional teams to enhance system performance and reliability, ensuring that our security systems meet the high standards expected by our users. Your expertise will be pivotal in maintaining the integrity and availability of our services.
Full-time|On-site|Austin; San Francisco; Seattle; United States
Join MongoDB as a Senior Site Reliability Engineer specializing in Infrastructure Security. In this pivotal role, you'll be at the forefront of ensuring the reliability and security of our cloud infrastructure. Your expertise will help us to design and maintain systems that are robust, efficient, and secure, providing critical support to our engineering teams.Your responsibilities will include monitoring system performance, implementing security protocols, and troubleshooting incidents to maintain high availability. You will collaborate with cross-functional teams to enhance our security posture, ensuring that our services are resilient and secure.
The City and County of San Francisco is seeking a dynamic and experienced Principal Information Systems Engineer with a specialization in Security. This role spans multiple departments citywide, providing the opportunity to make a significant impact on the security posture of our information systems.In this position, you will lead initiatives to enhance cybersecurity measures, implement best practices, and ensure compliance with industry standards. You will collaborate with various stakeholders to identify risks and develop strategies to mitigate them. Your expertise will be crucial in safeguarding sensitive data and maintaining the integrity of our systems.
About Our TeamThe Frontier Systems team at OpenAI is at the forefront of technological innovation, responsible for designing, deploying, and maintaining state-of-the-art supercomputers that power our most advanced model training initiatives. We transform innovative data center designs into fully functional systems and develop the necessary software to support extensive frontier model training.Our mission is to ensure the stability and efficiency of these hyperscale supercomputers, providing an uninterrupted environment for the training of frontier models.About the OpportunityWe are seeking passionate engineers to manage the next generation of compute clusters that fuel OpenAI’s leading-edge research. This role merges distributed systems engineering with practical infrastructure expertise across our expansive data centers. You will be tasked with scaling Kubernetes clusters to unprecedented levels, automating bare-metal deployments, and creating software solutions that simplify interactions across a multitude of nodes in various data centers.You will operate at the confluence of hardware and software, where speed and reliability are of utmost importance. Prepare to oversee dynamic operations, swiftly diagnose and resolve critical issues, and continuously enhance automation and system uptime.Key Responsibilities:Deploy and scale substantial Kubernetes clusters, implementing automation for provisioning, bootstrapping, and lifecycle management.Create software abstractions that integrate multiple clusters, delivering a seamless interface for training workloads.Oversee node deployment from bare metal to firmware upgrades, ensuring swift and repeatable processes at scale.Enhance operational metrics, striving to minimize cluster restart times (e.g., reducing from hours to minutes) and expedite firmware or OS upgrades.Integrate networking and hardware health systems to ensure comprehensive reliability across servers, switches, and data center infrastructure.Develop monitoring and observability systems that proactively identify issues and maintain cluster stability under peak loads.Be prepared to perform at the level of a software engineer in execution and problem-solving.You May Be a Great Fit If You:Possess extensive experience in operating or scaling Kubernetes clusters or similar container orchestration systems.
Full-time|$165K/yr - $242K/yr|On-site|Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA/ San Francisco, CA
CoreWeave is seeking a Security Engineering Manager to lead the Platform Security team. This position is based in Livingston, NJ, New York, NY, Sunnyvale, CA, Bellevue, WA, or San Francisco, CA. The team’s mission is to embed security into CoreWeave’s Kubernetes-based platform and public cloud environments, supporting high-performance infrastructure for AI and machine learning workloads. Role overview This manager will oversee and expand the Platform Security engineering team, reporting to the Senior Director of Security Foundations. The focus is on hands-on leadership and technical execution, with an emphasis on building and implementing security controls rather than policy development. The role requires close collaboration with Infrastructure, Platform Engineering, Site Reliability Engineering, and other security teams to ensure security measures keep pace with business growth and evolving needs. What you will do Lead and grow the Platform Security engineering team. Integrate security into Kubernetes infrastructure and public cloud platforms such as AWS, GCP, and Azure. Define and execute strategies for cloud security posture, workload isolation, platform guardrails, image integrity, and multi-cloud security. Develop and implement security controls across CoreWeave’s infrastructure. Work closely with other technical teams to align platform security with business needs. The Platform Security team The Platform Security team at CoreWeave engineers systems that enforce security at the infrastructure layer. Their work spans both CoreWeave’s own Kubernetes-based platform and third-party public cloud environments. The team supports GPU-accelerated infrastructure for demanding AI and machine learning workloads, ensuring that both customer and internal services remain secure as CoreWeave’s global presence expands.
Full-time|$227.2K/yr - $324.5K/yr|Hybrid|San Francisco, CA (Hybrid)
About the Role: At Tubi, our Site Reliability Engineering (SRE) team transcends traditional operations. We embody a software engineering ethos, leveraging a developer's toolkit to tackle the complexities of large-scale, distributed systems. Our core mission focuses on building resilience from the ground up, empowering our product teams to innovate swiftly while delivering an exceptional user experience. We oversee the availability, latency, performance, and capacity of our platform, driven by a culture of data-informed decision-making, blameless learning, and relentless automation. We are on the lookout for a seasoned and visionary Senior Manager of SRE to lead and expand our newly formed Site Reliability Engineering team. You will be more than just a people manager or tech lead; you will be the strategic architect behind our reliability roadmap. Your role will involve building and mentoring a team of skilled engineers, cultivating an environment of blameless learning and continuous improvement, while advocating for the engineering practices that balance rapid innovation with unwavering stability. You will play a pivotal role within our engineering leadership, collaborating with peers across the organization to embed reliability as a shared responsibility and a fundamental principle of our engineering culture.
Join the City and County of San Francisco as a Information Systems Engineer specializing in Security. In this pivotal role, you will be responsible for designing, implementing, and maintaining secure information systems across multiple city departments. You will collaborate with various stakeholders to ensure the integrity and confidentiality of sensitive data while adhering to best practices and regulatory requirements.
The City and County of San Francisco is seeking a Senior Information Systems Engineer with a focus on security to join our dedicated team across multiple departments citywide. This role is crucial for enhancing the security posture of our information systems, ensuring the safety and integrity of city data.The ideal candidate will possess a robust background in information systems engineering, with a strong emphasis on security protocols, risk management, and system architecture. You will be responsible for designing, implementing, and maintaining security solutions that protect sensitive information.
Join the City and County of San Francisco as an Assistant Information Systems Engineer specializing in Security. This entry-level position offers a unique opportunity to work across multiple departments citywide, contributing to the enhancement of our IT security infrastructure.Your role will involve assisting in the design, implementation, and maintenance of information security systems, ensuring data integrity and protection against cyber threats. If you are passionate about technology and eager to develop your skills in a supportive environment, we encourage you to apply!
About Our TeamJoin our dynamic Infrastructure organization at OpenAI, where we are actively seeking talented software engineers to bolster our efforts across several high-impact teams. With a variety of focus areas available—including Core Distributed Systems, Databases, Observability, and Cloud Infrastructure—you'll have the opportunity to work on projects that fascinate you. Our teams operate with a high level of autonomy and foster a deeply collaborative environment, all dedicated to enhancing safety, reliability, and operational velocity across the organization.About the RoleAs a Software Engineer focused on Infrastructure Reliability, you will play a pivotal role in scaling and fortifying the infrastructure that supports some of the world’s most widely utilized AI systems. Your work will ensure that our systems maintain high reliability, observability, performance, and security—enabling researchers to iterate rapidly and allowing products like ChatGPT and the OpenAI API to effectively serve millions of users.This hands-on, impactful role is perfect for engineers who enjoy ownership, excel at solving complex technical challenges across the entire stack, and wish to contribute to systems that facilitate cutting-edge research deployed on a global scale. You will significantly influence technical direction, enhance system resilience, and collaborate closely with infrastructure, product, and research teams to transform intricate infrastructure into dependable platforms.Key ResponsibilitiesDesign, construct, and maintain reliable, high-performance systems utilized across engineering.Identify and resolve performance bottlenecks and inefficiencies, ensuring our infrastructure scales appropriately.Investigate and troubleshoot complex issues thoroughly.Enhance automation to minimize manual tasks and improve internal developer tools.Participate in incident response, postmortem analysis, and the development of best practices surrounding system reliability and scalability.Ideal Candidate ProfilePossess a deep understanding of distributed systems principles, with a proven track record in developing and managing scalable, reliable systems.Demonstrate a strong focus on performance and optimization, with the ability to maximize efficiency in complex, globally distributed systems.Have experience managing orchestration systems such as Kubernetes at scale and creating abstractions over cloud platforms.Be comfortable working within Linux environments and possess strong problem-solving skills.
Full-time|$248K/yr - $279K/yr|On-site|San Francisco Bay Area
At Discord, we connect over 200 million users every month through our platform, primarily for one exhilarating reason: gaming. With more than 90% of our community engaged in gaming, they collectively spend an astounding 1.5 billion hours indulging in thousands of unique titles on Discord each month. Our commitment is to enhance the gaming experience, making it more enjoyable for everyone to communicate and socialize before, during, and after gameplay.Your RoleLead a talented team of security engineers to develop and implement robust application security tools and services, conduct secure design reviews, and perform threat modeling, while providing expert guidance on secure development practices at Discord.Ensure the security of our code and development processes from the Integrated Development Environment (IDE) through to production.Enhance the detection and remediation of security vulnerabilities at scale.Collaborate with various Discord teams to minimize security risks for our users while proactively identifying and addressing security bugs prior to production deployment.Partner with Discord’s product engineering and management teams to advocate for innovative security features that enhance user protection.
Become a vital part of the engineering teams that responsibly bring OpenAI’s transformative technologies to the world!At OpenAI, our Applied Engineering team collaborates across research, engineering, product management, and design to deliver AI solutions to both consumers and businesses. We are committed to learning from our deployments, maximizing the benefits of AI, and ensuring that this powerful technology is utilized both safely and ethically. Our priority is safety over unchecked growth.About the RoleAs OpenAI continues to expand, we are seeking seasoned engineers who excel in problem-solving to enhance the scalability of our systems. Our achievements hinge on our ability to rapidly iterate on product development while ensuring optimal performance and reliability. You will thrive in a collaborative, fast-paced environment, playing a key role in delivering our technology to millions globally, with a focus on safety and reliability. As a reliability engineer, you will lead efforts to maintain and improve the stability, scalability, and performance of our dynamic infrastructure. You will collaborate closely with cross-functional teams, including software engineers, product managers, and data scientists, to construct and sustain robust systems capable of accommodating our growing user base and workload.Your Responsibilities Include:Designing and implementing solutions to scale our infrastructure to meet increasing demands effectively.Developing and maintaining load, chaos, and synthetic testing software that enhances the reliability of systems designed by development teams.Creating and managing automation tools to streamline repetitive tasks and bolster system reliability.Overseeing the lifecycle management platform for CPU/storage, GPU, and network resources to foster efficiency and support dynamic optimization.Implementing fault-tolerant and resilient design patterns to minimize service interruptions.Establishing and maintaining service level objectives (SLOs) and service level indicators (SLIs) to ensure system reliability.Collaborating with researchers, engineers, product managers, and designers to introduce new features and research advancements to the world.Participating in an on-call rotation to address critical incidents and ensure 24/7 system availability.Your Impact: Your contributions will be essential in guaranteeing the reliability and performance of our platforms as we continue to scale our operations.
ABOUT BASETENBaseten is at the forefront of powering mission-critical AI inference for some of the most innovative companies globally, including Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer. We integrate cutting-edge applied AI research with a flexible infrastructure and intuitive developer tools to empower companies at the leading edge of AI to deploy sophisticated models effectively. With our recent $300M Series E funding round—supported by prominent investors such as BOND, IVP, Spark Capital, Greylock, and Conviction—we are rapidly expanding. Join our dynamic team and contribute to creating an essential platform for engineers to launch AI products with ease.THE ROLEAs a Site Reliability Engineer, you will design and implement resilient systems and processes that ensure our infrastructure is scalable, reliable, and efficient. Your responsibilities will encompass everything from automating deployments and monitoring systems to enhancing performance and managing incidents effectively.Collaboration is key; you will work closely with our users to understand their challenges in operationalizing machine learning, facilitating their onboarding onto our platform, and leveraging these insights to inform improvements to Baseten.EXAMPLE INITIATIVESAs part of our Infrastructure team, you will engage in exciting projects such as:Innovative multi-cloud capacity managementOptimizing inference on B200 GPUsImplementing multi-node inferenceUtilizing fractional H100 GPUs for efficient model servingRESPONSIBILITIESDesign and maintain scalable infrastructures to support the deployment and operational needs of machine learning models.Establish standards and best practices to enhance reliability and performance across the infrastructure.Proactively identify and resolve reliability issues using monitoring and alerting systems.Collaborate with cross-functional teams to apply best practices in infrastructure management and incident response.Create automation scripts to streamline processes and reduce manual intervention.
About GridwareGridware is an innovative technology firm headquartered in San Francisco, committed to safeguarding and enhancing the reliability of the electrical grid. We have pioneered a revolutionary approach to grid management known as Active Grid Response (AGR), which meticulously monitors the electrical, physical, and environmental factors influencing grid safety and reliability. Our state-of-the-art AGR platform leverages high-precision sensors to identify potential issues at an early stage, facilitating proactive maintenance and fault resolution. This holistic strategy is designed to bolster safety, minimize outages, and ensure optimal grid performance. We are proud to be supported by prominent climate-tech and Silicon Valley investors. To learn more, visit www.Gridware.io.About the RoleWe are seeking a skilled Senior Hardware Reliability Engineer to lead reliability testing, analysis, and lifetime modeling of various outdoor electronic assemblies. This pivotal role will concentrate on the electronic components of our products, collaborating closely with our mechanical-focused Reliability Engineer and engaging with the broader hardware and cross-functional teams.
Full-time|On-site|Austin; Boston; Chicago; Denver; Miami; New York City; San Francisco; Seattle; United States
Join MongoDB as a Senior Site Reliability Engineer specializing in Fleet Management. In this role, you will be pivotal in enhancing the reliability and performance of our systems, ensuring seamless operations across our platforms. You will collaborate with cross-functional teams to design, implement, and maintain infrastructure solutions that meet the needs of our growing customer base.Your expertise will be crucial in identifying performance bottlenecks, automating processes, and orchestrating system deployments. If you are passionate about building scalable and resilient systems and thrive in a fast-paced environment, we want to hear from you!
About UnifyAt Unify, we are pioneering the first AI-driven system of action for revenue teams, enabling businesses to transform their outbound strategies into high-performing growth engines. Our focus is on making go-to-market execution measurable, repeatable, and scalable. Founded in 2023 by industry veterans from Ramp and Scale AI, our talented team has diverse experience from leading organizations such as Airbnb, Meta, Waymo, and Perplexity.In 2024, Unify achieved an impressive 8x revenue growth and serves notable clients including Perplexity, Cursor, SoFi, and Justworks. We are a dynamic, high-energy team backed by $58M in funding from Thrive, Emergence, OpenAI, and others. Join us as we shape the future of GTM!About the RoleAs the Staff SRE Tech Lead at Unify, you will be instrumental in enhancing the reliability and scalability of our platform as we handle increasing volumes of data and accommodate customers with stringent uptime requirements. You will define the technical roadmap for reliability engineering, lead a dedicated team of SREs, and collaborate closely with engineering leaders to establish systems and practices that ensure Unify remains both swift and dependable at scale.
About Multiply LabsMultiply Labs is an innovative startup located in San Francisco, California, backed by renowned investors in technology and life sciences such as Casdin Capital, Lux Capital, and Y Combinator. Our goal is to develop the world's leading robotic systems and utilize them to make groundbreaking life-saving therapies accessible to everyone.We are transforming the manufacturing process of cell therapies through the creation of advanced robotic systems that automate and scale the production of these crucial treatments. Our cutting-edge robots enable biopharma companies to produce cell therapies efficiently without overhauling their existing processes, thus minimizing regulatory hurdles and risks. Unlike traditional methods that are labor-intensive and costly (often exceeding $1M per patient), our robotic solutions aim to make these vital treatments more affordable and reachable for those who need them.To discover more and view our robots in action, please visit www.multiplylabs.com and follow us on LinkedIn.Position OverviewWe are looking for a dedicated Hardware Reliability Engineer to become an essential part of Multiply Labs’ Reliability Engineering team. As a founding member, you will collaborate closely with the Hardware Product and Systems Integration teams to enhance our designs throughout the entire development lifecycle, from initial prototypes to fully deployed GMP production systems. Your contributions will directly support the delivery of life-saving therapies by ensuring our robots operate seamlessly within the high-stakes biotech environment.
Full-time|Remote|Denver, Colorado, United States; San Francisco, California, United States
Join Checkr as a Software Engineer focusing on Reliability, where your contributions will enhance our platform's robustness and performance. You will be part of a dynamic team dedicated to building and scaling systems that support our growth and ensure outstanding service delivery to our clients.
Join Our Innovative TeamAt OpenAI, our Hardware organization is pioneering cutting-edge silicon and system-level solutions tailored to meet the demands of advanced AI workloads. We pride ourselves on developing next-generation AI-native silicon while collaborating with software and research partners to create hardware that is intricately integrated with AI models. Our mission includes delivering high-performance silicon for OpenAI’s supercomputing infrastructure and designing custom tools and methodologies that accelerate innovations, specifically optimized for AI applications.Your Role in Our MissionWe are on the lookout for a dynamic and experienced Reliability/DFX Engineer who possesses extensive knowledge in scaling machine learning systems. As an integral member of our hardware team, you will collaborate with chip design, platform design, hardware health, and the wider industry ecosystem to architect, implement, and deploy dependable next-generation AI accelerator systems. You will take a holistic approach to evaluate system and chip architecture, pinpointing high-ROI opportunities that enhance reliability and availability throughout the stack while translating these insights into actionable strategies and silicon features.Key Responsibilities:Lead the architecture, implementation, and execution of DFX strategies in silicon from concept to high-volume deployment, proposing impactful features to boost reliability and fault tolerance. Your focus will encompass design for testability, reliability, availability, and serviceability of high-performance AI hardware.Develop system-level reliability models based on empirical data to guide the organization’s DFX and reliability strategy, necessitating a deep understanding of chip and system architecture, design, implementation, and component-level reliability.Collaborate with chip and platform architecture/design teams to explore and implement DFX features, including the specification and integration of digital/mixed-signal IP, firmware/system software, and DFX methodologies.Work alongside hardware health and platform design teams to enhance reliability and fault tolerance in New Product Introduction (NPI) and High-Volume Manufacturing (HVM), driving continuous, data-driven improvements across the stack through optimized operating conditions and data analysis.Act as the DFX/reliability advocate, aligning the broader industry ecosystem with OpenAI’s strategic objectives and roadmap.Qualifications:Bachelor’s degree in Engineering or related field with 15+ years of experience, or a Master’s degree with 10+ years of relevant experience.Proven expertise in DFX methodologies and reliability engineering for high-performance hardware.Strong analytical and problem-solving skills, with a track record of improving system reliability and performance.Excellent collaboration and communication abilities, capable of working effectively in a cross-functional team environment.Familiarity with AI workloads and associated hardware requirements is highly desirable.
Join Cloudflare as a Database Reliability Engineer, where you will play a crucial role in ensuring the reliability and performance of our database systems. You will work collaboratively with our engineering teams to develop, implement, and maintain robust database solutions that support our mission of making the internet safer and faster.Your responsibilities will include monitoring database performance, troubleshooting issues, and optimizing queries to enhance system efficiency. If you are passionate about databases and eager to make an impact in a dynamic environment, we encourage you to apply!
Feb 6, 2026
Sign in to browse more jobs
Create account — see all 8,329 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.