Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Senior
Qualifications
The ideal candidate will possess a strong background in software engineering, with expertise in observability tools and techniques. A Bachelor's Degree in a relevant field is required, along with experience in distributed systems and cloud technologies. Proficiency in programming languages such as Java, Python, or Go is essential. You should also demonstrate excellent problem-solving skills and the ability to work effectively in a team environment.
About the job
Adyen seeks a Senior Software Engineer in San Francisco to focus on Customer Developer Observability. This position aims to enhance the tools and systems that let clients monitor and analyze their performance across the Adyen platform.
What you will do
Collaborate with cross-functional teams to design and build observability solutions.
Create and implement features that provide customers with deeper insights into their systems and data.
Help improve the customer experience by making monitoring and analysis more effective and accessible.
About Adyen
Adyen is a leading global payment company that provides businesses with a seamless payment experience across multiple channels. Our mission is to empower businesses to accept payments anywhere in the world, and we pride ourselves on our innovative technology and commitment to customer satisfaction.
Similar jobs
1 - 20 of 6,956 Jobs
Search for Senior Software Engineer Observability And Reliability
Full-time|$170K/yr - $240K/yr|On-site|San Francisco, CA
About the Role Sigma Computing is growing its engineering team in San Francisco, CA. The company builds technology to help users access data with ease. As a Senior Software Engineer focused on Observability and Reliability, you will work alongside engineers who value high standards and collaboration. What You Will Do Design and build observability platforms and tools, including metrics collection, logging, distributed tracing, dashboards, alerting, and application performance monitoring. Work with technologies such as Go, OpenTelemetry, and Kubernetes to solve reliability challenges. Take part in on-call rotations to help maintain strong uptime for Sigma’s services. Create tools and processes to improve cloud incident triage and reduce downtime. Define and promote practices that make systems and services measurable and observable. Join design and code reviews with peers and stakeholders to reinforce quality and effective collaboration.
Full-time|$194K/yr - $267K/yr|On-site|San Francisco, California
Discover OktaOkta is recognized as The World’s Identity Company, empowering individuals to securely leverage any technology across various devices and applications. Our versatile Okta Platform and Auth0 Platform provide reliable access, authentication, and automation, placing identity at the forefront of business security and expansion.At Okta, we value diverse perspectives and experiences. We seek continuous learners and individuals who can enhance our team with their distinct backgrounds.Join us as we create a world where identity is truly yours.We are in search of a highly skilled Observability Site Reliability Engineer specializing in Google Cloud, to take charge of and elevate our Observability ecosystem within GCP. In this position, you will progress beyond basic monitoring to develop a world-class, comprehensive, and scalable Observability Platform that supports our SRE teams and business collaborators. You will implement infrastructure as code by employing Terraform and demonstrating strong coding skills in Go, Python, or Ruby to automate the deployment of agents and collectors across intricate distributed systems.Key ResponsibilitiesAutomated Infrastructure: Design, build, and maintain scalable observability infrastructure utilizing tools such as Terraform.GCP Observability Engineering: Enhance the collection, processing, and storage of Observability data to guarantee high reliability and low latency for our Splunk and Grafana services.Incident Response: Engage in on-call rotations and conduct post-incident reviews to foster systemic improvements and promote 'observability-driven development.'Automation: Minimize 'toil' by automating the deployment and scaling of observability agents and collectors.
Role overview Adyen seeks a Senior Software Engineer in San Francisco to focus on Customer Developer Observability. This position aims to enhance the tools and systems that let clients monitor and analyze their performance across the Adyen platform. What you will do Collaborate with cross-functional teams to design and build observability solutions. Create and implement features that provide customers with deeper insights into their systems and data. Help improve the customer experience by making monitoring and analysis more effective and accessible.
Full-time|$166K/yr - $201K/yr|On-site|San Francisco, CA - US
At Crusoe, we are on a mission to accelerate the availability of energy and intelligence. We are building the foundational technology that empowers individuals to innovate boldly with AI while maintaining speed, scale, and sustainability.Join us in the AI revolution with sustainable technology at Crusoe, where you will lead significant innovations, make a real impact, and collaborate with a team that is pioneering responsible and transformative cloud infrastructure.About the Role:We are seeking a highly proficient engineer with extensive experience in designing and managing observability platforms at scale. You will be responsible for architecting, developing, and operating Crusoe’s next-generation observability stack, which will allow engineers to gain insights into the internal state of distributed systems through metrics, logs, and traces. Your contributions will guarantee reliability, performance, and actionable insights across Crusoe’s global infrastructure and cloud platform.Key Responsibilities:Design and manage scalable observability systems (metrics, logging, tracing) in multi-datacenter Kubernetes environments.Architect comprehensive telemetry pipelines, covering ingestion, storage, querying, and visualization.Enhance monitoring and alerting mechanisms with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and OpenTelemetry.Develop scalable log collection and processing pipelines utilizing Fluent Bit, Vector, Loki, or ELK/Opensearch stacks.Implement distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrate with service meshes, load balancers, and APIs.Establish and promote the adoption of SLOs, SLIs, and error budgets across various services and teams.Automate the provisioning and scaling of observability infrastructure using Kubernetes, Terraform, and custom tools (Go, Python).Ensure the reliability and cost-effectiveness of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure).Integrate security best practices into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controls.Collaborate with engineering teams to embed observability into applications, services, and infrastructure.Mentor engineers and influence Crusoe’s observability strategy and technical roadmap.
Join Crusoe as a Senior Software Engineer specializing in Observability, where you will play a pivotal role in enhancing our systems and ensuring robust performance across our platforms. You will collaborate with cross-functional teams to develop innovative solutions that improve the visibility and reliability of our software applications.
Become part of the innovative engineering teams at OpenAI, where we create and deliver groundbreaking AI technologies responsibly and safely to the world!Our Applied Engineering team collaborates across research, engineering, product, and design disciplines to deploy OpenAI's cutting-edge technology for both consumers and businesses. We are committed to learning from our deployments and ensuring that AI is utilized ethically while maximizing its benefits. To us, safety takes precedence over unchecked growth.About the RoleWe are in the process of developing OpenAI's observability product, which encompasses everything from scalable infrastructure to an intuitive, AI-enhanced user interface. Our systems process petabytes of logs and billions of time series metrics throughout our infrastructure. We are now integrating intelligence to create features like agents that summarize service events, auto-generate dashboards, and assist engineers in debugging through user-friendly notebook-like interfaces.We are looking to hire software engineers at all levels of our stack—be it infrastructure, backend, or product. You will be part of a dynamic, resourceful team that develops both foundational infrastructure and innovative internal tools, ensuring the reliability, performance, and observability of OpenAI's production systems.What You’ll DoLead the development of core observability infrastructure, focusing on distributed logging, time series, and trace storage.Create AI-integrated tools that empower engineers to autonomously identify, comprehend, and resolve issues.Enhance user interface experiences including dashboards, notebooking, and interactive debugging.Work collaboratively with engineers, researchers, user operations, and various teams to craft the next generation of the observability product.You Might Be a Fit If You:Have experience operating large-scale distributed systems in production, particularly logging systems or time series databases.Excel in ambiguous environments and tackle unscoped challenges head-on.Possess full-stack development skills or a strong product sensibility; you are eager to build practical tools that users will engage with.Demonstrate robust knowledge of systems, networking, and cloud infrastructure (Kubernetes, AWS, etc.).Bonus: Have built or contributed to observability systems (e.g., Prometheus, OpenTelemetry, etc.).Why This Team?We combine infrastructure and product development to create real AI applications for in-house use.Your contributions will directly enhance the reliability of GPT-based products at OpenAI.
Join Gusto as a Staff Software Engineer specializing in Observability, where you will play a pivotal role in enhancing our software's performance and reliability. Utilize your expertise to develop and implement monitoring solutions that provide insights into application behavior, ensuring a seamless experience for our users.Your contributions will directly impact our engineering processes and product quality. Collaborate with cross-functional teams to identify and resolve issues proactively, while also driving initiatives to improve system observability.
Full-time|On-site|San Francisco, CA • New York, NY • United States
Join Figma as a Software Engineering Manager specializing in Observability. In this pivotal role, you will lead a dynamic team of engineers in developing cutting-edge solutions that enhance visibility and performance across our platform. Your expertise will drive the design and implementation of observability tools that empower our engineering teams to optimize their workflows, ensuring the robustness and reliability of our applications.
Full-time|On-site|San Francisco, CA | New York City, NY | Seattle, WA
Join Anthropic as a Staff+ Software Engineer specializing in Observability, where you will play a crucial role in enhancing our systems to ensure high-performance and reliability. Collaborate with cross-functional teams to develop innovative solutions, implement observability metrics, and drive improvements that enable better decision-making and user experiences.
Full-time|$181.2K/yr - $217.5K/yr|On-site|Denver, CO; San Francisco, CA
At Fastly, we empower individuals to connect more effectively with the things they cherish. Our cutting-edge edge cloud platform enables customers to swiftly, securely, and reliably craft exceptional digital experiences by processing, serving, and safeguarding their applications as close to their end-users as possible — right at the edge of the Internet. Tailored for modern internet demands, our platform is programmable and supports agile software development. We proudly serve many of the world's leading companies, including GitHub, Yelp, Paramount, and JetBlue.Join us in our mission to build a more trustworthy Internet.Posting Open Date: Feb. 25, 2026Anticipated Posting Close Date*: March 25, 2026*Please note that this job posting may close early depending on the volume of applications.Role Overview:The Data Reliability team is seeking an experienced Senior Software Engineer to contribute to the development and support of next-generation data storage solutions at Fastly. The ideal candidate will possess expertise in backend and data services within cloud environments, proficiency with configuration and orchestration tools such as Terraform and Kubernetes, and the ability to create internal administration tools using Go and related technologies. Our team plays a vital role in ensuring the infrastructure, orchestration, and reliability of Fastly's most data-intensive applications, utilizing technologies like Terraform, Elasticsearch, ClickHouse, Prometheus, MySQL, and Redis across both cloud and hardware platforms. Your contributions will directly enhance our customers' success by providing product teams with a robust platform for efficient and consistent delivery of high-quality, high-throughput, globally distributed data systems and products. We embrace a distributed work model and value both collaborative and asynchronous communication styles.Key Responsibilities:Deploy, support, and maintain various critical data storage systems, scaling from gigabytes to petabytes.Develop statistics and dashboards to track service-level objectives for these systems.Create and manage tools for configuration, backup, and authenticated access to data systems employing peer review, CI/CD, and both daemon- and container-based deployment strategies.Write high-performance, maintainable, and concise code, actively participating in code reviews to enhance the codebase.
Join DigitalOcean as a Senior Observability Engineer, where you will play a critical role in enhancing our monitoring and observability platforms. Your expertise will help us ensure that our systems are performant, reliable, and scalable, providing a seamless experience for our customers.
Join our dynamic team at Cloudflare as a Software Engineer focused on Workers Observability. In this pivotal role, you'll be instrumental in enhancing the observability features of our Workers platform, ensuring optimal performance and reliability for our users. You will collaborate with cross-functional teams, tackle complex technical challenges, and contribute to the advancement of our innovative cloud solutions.
Become a vital part of the engineering teams that responsibly bring OpenAI’s transformative technologies to the world!At OpenAI, our Applied Engineering team collaborates across research, engineering, product management, and design to deliver AI solutions to both consumers and businesses. We are committed to learning from our deployments, maximizing the benefits of AI, and ensuring that this powerful technology is utilized both safely and ethically. Our priority is safety over unchecked growth.About the RoleAs OpenAI continues to expand, we are seeking seasoned engineers who excel in problem-solving to enhance the scalability of our systems. Our achievements hinge on our ability to rapidly iterate on product development while ensuring optimal performance and reliability. You will thrive in a collaborative, fast-paced environment, playing a key role in delivering our technology to millions globally, with a focus on safety and reliability. As a reliability engineer, you will lead efforts to maintain and improve the stability, scalability, and performance of our dynamic infrastructure. You will collaborate closely with cross-functional teams, including software engineers, product managers, and data scientists, to construct and sustain robust systems capable of accommodating our growing user base and workload.Your Responsibilities Include:Designing and implementing solutions to scale our infrastructure to meet increasing demands effectively.Developing and maintaining load, chaos, and synthetic testing software that enhances the reliability of systems designed by development teams.Creating and managing automation tools to streamline repetitive tasks and bolster system reliability.Overseeing the lifecycle management platform for CPU/storage, GPU, and network resources to foster efficiency and support dynamic optimization.Implementing fault-tolerant and resilient design patterns to minimize service interruptions.Establishing and maintaining service level objectives (SLOs) and service level indicators (SLIs) to ensure system reliability.Collaborating with researchers, engineers, product managers, and designers to introduce new features and research advancements to the world.Participating in an on-call rotation to address critical incidents and ensure 24/7 system availability.Your Impact: Your contributions will be essential in guaranteeing the reliability and performance of our platforms as we continue to scale our operations.
Full-time|Remote|Denver, Colorado, United States; San Francisco, California, United States
Join Checkr as a Software Engineer focusing on Reliability, where your contributions will enhance our platform's robustness and performance. You will be part of a dynamic team dedicated to building and scaling systems that support our growth and ensure outstanding service delivery to our clients.
Full-time|Remote|Remote with offices in San Francisco, CA / New York, NY / Minneapolis, MN
Join Dagster Labs as a Software Engineer specializing in our Observability Product. In this fully remote role, you will play a crucial part in enhancing the visibility and performance of our software solutions. Collaborate with cross-functional teams to develop and implement innovative observability features that empower our users to monitor and optimize their applications effectively.
About AbridgeFounded in 2018, Abridge is dedicated to enhancing understanding in the healthcare sector. Our innovative AI-powered platform is specifically designed to enhance medical conversations, streamlining clinical documentation while allowing healthcare providers to prioritize what matters most—their patients.Our enterprise-grade technology revolutionizes patient-clinician dialogues by converting them into structured clinical notes in real-time, with integrated EMR functionalities. Utilizing Linked Evidence and our auditable AI, we uniquely map AI-generated summaries to verified ground truth, fostering quick trust among providers. As trailblazers in generative AI for healthcare, we are establishing industry benchmarks for the responsible integration of AI within health systems.Our diverse team comprises practicing MDs, AI scientists, PhDs, creatives, technologists, and engineers, all united in empowering individuals and simplifying care. Our offices are situated in San Francisco's Mission District, New York's SoHo neighborhood, and East Liberty in Pittsburgh.The RoleAs part of our rapidly scaling services and engineering team, we are seeking seasoned Site Reliability Engineers (SREs) to enhance our software's performance, stability, and scalability significantly. This role focuses primarily on distributed systems, with approximately 80% dedicated to software and 20% to cloud infrastructure.You will play a pivotal role in integrating load testing and chaos engineering into our CI pipelines. You will utilize observability and profiling tools to pinpoint and rectify performance bottlenecks, collaborate with various teams to transition their applications to more scalable infrastructures, and ensure a seamless experience as we expand our application adoption in the healthcare domain. This may include embedding with other teams for extended periods.The platform we are developing must optimize both engineering speed and security, facing significant scale challenges and presenting numerous opportunities to exercise creativity, independence, and leadership in taking projects from inception to fruition. This is a rare chance to advance your career in a rapidly growing company that harnesses cutting-edge technologies.What You'll DoUtilize load testing, chaos engineering, and other testing methodologies to uncover performance and latency issues across all systems, implementing code changes to resolve them.Lead software modifications that facilitate the migration of applications at the code level to new infrastructures (including run times, event-driven frameworks, databases, etc.).
Why Join Harvey?At Harvey, we are revolutionizing the landscape of legal and professional services with a holistic approach. By integrating advanced AI technology, a robust enterprise platform, and extensive industry knowledge, we are redefining how essential knowledge work is conducted for years to come.This is a unique opportunity to contribute to the foundation of a transformative company at a pivotal moment in its journey. With over 1000 clients across more than 58 countries, a solid product-market fit, and outstanding investor backing, we are rapidly expanding and creating a new category in real-time. The challenges are significant, expectations are high, and the potential for personal, professional, and financial development is unparalleled.Our team comprises driven, intelligent individuals who are deeply passionate about our mission. We prioritize speed, intensity, and accountability in addressing challenges — from initial ideation to long-term solutions. By maintaining close relationships with our clients, from executives to engineers, we collaboratively address pressing issues with urgency and care. If you excel in uncertain environments, strive for excellence, and wish to shape the future of work alongside a team that raises the bar, we invite you to build alongside us.At Harvey, we are currently writing the future of professional services — and we are just getting started.Your RoleAs a Senior Software Engineer on the Site Reliability team at Harvey, your mission will be to uphold the reliability, scalability, and performance of our innovative legal AI platform. You will become part of a high-impact team that operates at the crossroads of infrastructure and product, taking ownership of the systems that ensure our platform remains fast, secure, and continuously available. From scaling operations across 50+ regions to automating critical processes, your efforts will fortify Harvey's resilience as we expand. If you are enthusiastic about constructing robust systems and simplifying complexity through automation, we would love to collaborate with you.This position is situated in San Francisco, CA, and we adhere to an in-person work model, providing relocation assistance to new employees.Your ResponsibilitiesDesign, implement, and oversee monitoring, alerting, and infrastructure resources (compute, storage, networking) across 50+ global regions.Lead incident management processes, including postmortems, root cause analyses, and driving actionable enhancements.Automate operational tasks and workflows by developing tools and processes for capacity planning, seamless rollouts, and secure data access to maintain high reliability and minimize manual intervention.Collaborate across teams to drive solutions that enhance system performance and reliability.
Full-time|$175K/yr - $225K/yr|On-site|San Francisco, CA
About Us:At LangChain, we are dedicated to making intelligent agents a common part of everyday technology. Our goal is to provide a robust foundation for agent engineering that empowers developers to transition from prototypes to production-ready AI agents that teams can depend on. Initially starting as a widely embraced open-source toolset, we have expanded our offerings to include a comprehensive platform for the building, evaluating, deploying, and managing of agents at scale.Currently, our tools—LangChain, LangGraph, LangSmith, and Agent Builder—are utilized by teams developing real AI products in both startups and large enterprises. Millions of developers rely on LangChain to power AI initiatives at notable companies such as Replit, Clay, Coinbase, Workday, Lyft, Cloudflare, Harvey, Rippling, Vanta, and 35% of the Fortune 500.Having secured $125M in Series B funding from leading investors like IVP, Sequoia, Benchmark, CapitalG, and Sapphire Ventures, we are in an exciting phase of product development and rapid growth, where every team member has a substantial impact on our projects and collaborative efforts. At LangChain, your contributions will play a crucial role in shaping how this technology manifests in the real world.About the Role:This position requires in-person attendance 5 days a week in San Francisco, CA, as well as options in New York and Boston.We are seeking a seasoned frontend engineer to innovate and improve features on LangSmith, our enterprise platform designed for LLM application observability, testing, and debugging.What You Will Do:Create new user-facing features utilizing React and TypeScript.Develop reusable components and front-end libraries for future projects.Convert designs and wireframes into high-quality, maintainable code.Optimize components for peak performance across diverse web-capable devices and browsers.Collaborate with fullstack and backend developers as well as UX/UI designers to enhance usability and experience.You’re a Good Fit If You Have:Extensive frontend engineering experience, with strong command of React, JavaScript, and TypeScript.Practical experience with frontend development tools such as Babel, Vite, Webpack, NPM, and Yarn.Familiarity with REST APIs and experience collaborating closely with fullstack and backend developers.
About GridwareGridware is an innovative technology firm headquartered in San Francisco, committed to safeguarding and enhancing the reliability of the electrical grid. We have pioneered a revolutionary approach to grid management known as Active Grid Response (AGR), which meticulously monitors the electrical, physical, and environmental factors influencing grid safety and reliability. Our state-of-the-art AGR platform leverages high-precision sensors to identify potential issues at an early stage, facilitating proactive maintenance and fault resolution. This holistic strategy is designed to bolster safety, minimize outages, and ensure optimal grid performance. We are proud to be supported by prominent climate-tech and Silicon Valley investors. To learn more, visit www.Gridware.io.About the RoleWe are seeking a skilled Senior Hardware Reliability Engineer to lead reliability testing, analysis, and lifetime modeling of various outdoor electronic assemblies. This pivotal role will concentrate on the electronic components of our products, collaborating closely with our mechanical-focused Reliability Engineer and engaging with the broader hardware and cross-functional teams.
About UsAt Sierra, we are pioneering a transformative platform that empowers businesses to forge authentic customer experiences through AI technology. Headquartered in the vibrant city of San Francisco, we also boast a dynamic presence in Atlanta, New York, London, France, Singapore, and Japan.Our operations are anchored in core values that shape our culture: Trust, Customer Obsession, Craftsmanship, Intensity, and Family. These principles guide our actions and are integral to our mission.Our visionary founders, Bret Taylor and Clay Bavor, bring unparalleled expertise. Bret, currently the Board Chair of OpenAI, previously co-led Salesforce and served as CTO at Facebook, while Clay led numerous initiatives at Google, including AR/VR projects and Google Workspace.Your RoleIn your capacity as a Software Engineer on the Site Reliability team, you will play a crucial role in establishing and enhancing the reliability, observability, and scalability of Sierra’s AI-centric infrastructure. Collaborating closely with our engineering and product teams, your goal is to ensure our systems remain highly available, efficient, and primed for growth.Lead the development of Sierra’s observability stack—including monitoring, alerting, logging, and tracing—to provide engineers with critical insights into system health and performance.Collaborate with product and platform engineers to architect systems that prioritize reliability and scalability from the outset, not as an afterthought.Design and implement robust, scalable, and secure cloud infrastructure on AWS, employing Terraform and cutting-edge DevOps tools.Enhance the reliability and scalability of our LLM deployments, ensuring they operate efficiently and cost-effectively.Drive improvements in deployment pipelines, CI/CD tooling, and incident management processes to minimize downtime and accelerate response times.Define and cultivate SRE practices within Sierra, shaping culture, tooling, and best practices across the engineering organization.QualificationsBachelor's degree in Computer Science or a related field, or equivalent experience.Proven experience in Site Reliability Engineering or a similar role, with a strong understanding of cloud infrastructure (AWS).Proficiency in Terraform and modern DevOps practices.Experience with observability tools and techniques—monitoring, alerting, logging, and tracing.Strong problem-solving skills with a focus on scalability and performance optimization.Excellent collaboration and communication skills, with the ability to work effectively in a team environment.
Oct 21, 2025
Sign in to browse more jobs
Create account — see all 6,956 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.