1 - 20 of 1,955 Jobs

Search for Senior Site Reliability Engineer - Remote

1,955 results

Apply
companyClickHouse logo
Full-time|Remote|Singapore(Remote)

About ClickHouseRanked among the 2025 Forbes Cloud 100, ClickHouse stands as a leading innovator in the private cloud sector. With a customer base exceeding 3,000 and an annual recurring revenue (ARR) growth of over 250% year-on-year, we excel in real-time analytics, data warehousing, observability, and AI workloads.Our recent $400 million Series D funding round underscores our rapid growth and momentum. In just three months, renowned clients like Capital One, Lovable, Decagon, Polymarket, and Airwallex have adopted or expanded their use of our platform. They join industry giants such as Meta, Cursor, Sony, and Tesla who rely on our technology.We invite you to join us on our mission to revolutionize the way organizations harness their data!About the RoleAs we aim to provide our customers with dependable and secure services, we are expanding our Site Reliability Engineering team. In this role, you will lead initiatives to guarantee the reliability, availability, scalability, and performance of our cloud infrastructure. Collaborating with teams across Control Plane, Data Plane, Core, Security, Support, and Operations, you will guide the design and implementation of scalable, secure, and resilient distributed systems. You will also oversee incident management, conduct post-mortem analyses, and drive continuous improvements in our Cloud services. Utilizing your software engineering skills, you will develop platforms and tools to enhance operational and engineering efficiencies in ClickHouse Cloud. This position offers a unique chance to significantly contribute to the high-performance, elastic, and limitless scale of ClickHouse Cloud.What Will You Do?Work collaboratively with various engineering teams at ClickHouse to design and implement scalable, secure, and highly available systems.Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.Ensure comprehensive monitoring and alerting for all infrastructure components in ClickHouse Cloud, enabling timely incident detection and resolution.Refine incident response processes and conduct post-mortem analyses for outages, partnering with the support team to communicate effectively with affected customers.Continuously enhance the reliability and performance of our ClickHouse services.Plan and lead Chaos Engineering initiatives to identify potential vulnerabilities.

Mar 13, 2026
Apply
company
Full-time|On-site|Singapore

About k-IDk-ID stands at the forefront of privacy-first compliance and age verification infrastructure, recognized as one of TIME’s Best Inventions of 2025 and a Tech Pioneer by the World Economic Forum. As a recipient of Fast Company’s accolade for the Next Big Things in Tech, we are creating the Age Layer for the internet—a vital framework that empowers digital platforms to seamlessly verify age and manage global compliance.Our flagship platform, driven by the Compliance Development Kit (CDK) and AgeKit, serves as the trusted engine for the world’s foremost game publishers and digital ecosystems. We transform fragmented, manual compliance processes into a unified API that efficiently handles age verification, parental consent, and regulatory logic across over 200 markets. Supported by leading venture capital firms such as a16z and Lightspeed, k-ID is entering an exciting growth phase to set the benchmark for global digital safety.About the RoleWe are seeking a dynamic Senior Site Reliability Engineer to ensure k-ID's reliability at scale. This pivotal role resides within our production backbone, where you will take ownership of and enhance the systems that maintain the availability, observability, security, and resilience of our platform as we experience traffic growth and an expanding client base.You will engage in infrastructure, tooling, deployment workflows, incident response, and systems design to ensure our scalability without compromise. This position is not about closing tickets; we seek a proactive individual who can assess systems, pinpoint vulnerabilities, and fortify them. You should possess a keen understanding of failure modes, blast radius, deployment safety, recovery time, cost efficiency, and the realities of managing production systems under pressure. Comfort with coding, automating processes, and collaborating closely with engineers to enhance reliability through improved architecture and operational practices is essential.

Apr 8, 2026
Apply
companyPlaud Inc. logo
Full-time|On-site|Singapore

Join Our Team at Plaud Inc.Plaud is at the forefront of developing the world’s most reliable AI work companion, designed to enhance productivity through innovative note-taking solutions. Since our inception in 2023, we have gained the trust of over 1.5 million users globally. Our mission is to amplify human intelligence by constructing advanced interfaces and infrastructures that capture, extract, and utilize information from various forms of communication.Headquartered in San Francisco and incorporated in Delaware, Plaud Inc. is pioneering the integration of human and AI intelligence through an innovative hardware-software blend. We adhere to the highest standards of data security and privacy protection, maintaining ISO 27001, ISO 27701, GDPR, SOC 2, HIPAA, and EN 18031 compliance.Why Join Plaud?Experience working with a bootstrapped, rapidly growing company, achieving a remarkable $250 million revenue run rate in just three years.Help define the future of human-AI interaction.Engage with cutting-edge AI technologies and play a pivotal role in our global expansion efforts.Collaborate with a passionate team that values innovation and customer success.Advance your career in a culture that promotes continuous learning and development.

Jan 20, 2026
Apply
company
Full-time|On-site|Singapore

About k-IDk-ID is a pioneer in privacy-first compliance and age verification infrastructure, setting the standard for digital safety. We were celebrated as one of TIME’s Best Inventions of 2025, recognized as a Tech Pioneer by the World Economic Forum, and featured in Fast Company’s Next Big Things in Tech. Our mission is to create the Age Layer for the internet—a crucial infrastructure enabling digital platforms to verify age and manage compliance seamlessly across global markets.Powered by our Compliance Development Kit (CDK) and AgeKit, our core platform is the trusted backbone for the world’s leading game publishers and digital ecosystems, streamlining fragmented compliance through a unified API that efficiently handles age verification, parental consent, and regulatory requirements in over 200 markets. Supported by esteemed venture capital firms, including a16z and Lightspeed, k-ID is poised for significant growth.About The RoleWe are seeking a Lead Site Reliability Engineer and NOC Lead to spearhead production reliability and operational excellence across our platform.In this senior position, you will be accountable for the reliability, availability, observability, and operational maturity of k-ID’s systems while leading the Network Operations Center (NOC) function. Your role extends beyond merely responding to incidents; you will build systems, processes, tools, and team standards that minimize incident frequency and severity, ensuring rapid resolution when they occur.This role surpasses our senior NOC hires, as we need someone capable of establishing the operational model for the NOC, enhancing technical standards for incident management, collaborating closely with engineering leadership, and driving the long-term reliability roadmap for the business. You should be adept at transitioning between hands-on technical tasks, operational leadership, incident command, and team development.

Apr 8, 2026
Apply
companyfuku logo
Full-time|On-site|Singapore, Singapore, Singapore

As a Site Reliability Engineer (SRE) focused on Globalization, you will play a pivotal role in ensuring the robustness and availability of our next-generation international infrastructure. As our client, a fast-growing global consumer internet platform, scales its operations across international markets, you'll be instrumental in building a resilient architecture that supports millions of users worldwide. This role involves working on multi-region architecture, global traffic routing, and large-scale distributed systems, directly influencing the reliability and scalability of our evolving platform.Key Responsibilities:Global Architecture & Disaster Recovery: Collaborate in designing and implementing a global infrastructure architecture. Own cross-region architecture, disaster recovery (DR), and high availability (HA) capabilities. Enable critical systems for multi-region deployment, disaster recovery failover, and fault isolation.Overseas Infrastructure Platform Deployment & Operations: Build, deploy, operate, and optimize core infrastructure platforms in overseas regions, ensuring consistency and reliability between international and domestic environments.Reliability Engineering & Incident Response: Develop a comprehensive reliability engineering framework for international systems, including observability systems, incident response mechanisms, and root cause analysis processes.Internationalization Infrastructure Enablement: Understand overseas business requirements and architectural constraints to drive the implementation of infrastructure capabilities in global environments.Cross-Team Collaboration & System Alignment: Work closely with domestic infrastructure, product engineering, and platform teams to ensure alignment with internal architecture standards and best practices.

Apr 9, 2026
Apply
companyAirwallex logo
Full-time|On-site|SG - Singapore

About AirwallexAirwallex is a pioneering global payments and financial platform, uniquely designed to streamline operations for businesses around the world. With our exceptional blend of proprietary technology and software, we empower over 200,000 companies globally—including industry leaders such as Brex, Rippling, Navan, Qantas, and SHEIN—with integrated solutions that cover everything from business accounts and payment processing to spend management and treasury operations, all tailored for a global audience.Founded in Melbourne, Airwallex boasts a diverse team of over 2,000 innovative professionals across 26 offices worldwide. With a valuation of US$8 billion and support from top-tier investors including T. Rowe Price, Visa, Mastercard, Robinhood Ventures, Sequoia Capital, Salesforce Ventures, DST Global, and Lone Pine Capital, we are at the forefront of transforming the future of global finance. If you're ready to take on the most ambitious challenges of your career, we invite you to join us.Attributes We ValueWe seek builders with entrepreneurial spirit eager to make a significant impact, accelerate their learning, and take true ownership of their work. You should possess strong expertise in your field, complemented by analytical thinking and a passion for our mission and operating principles. You are quick to act with sound judgment, driven by curiosity to explore deeply, and you make informed decisions based on foundational principles, balancing speed with thoroughness.Collaboration and humility are vital traits; you can transform initial ideas into fully realized products and ensure tasks are completed efficiently. You leverage AI to enhance productivity and solve challenges swiftly. In this role, you will tackle intricate, high-profile challenges alongside exceptional colleagues, advancing your career while we build the future of global banking. If this resonates with you, let’s create the future together.About the TeamThe Engineering team at Airwallex comprises a vibrant mix of innovators, builders, and problem solvers committed to empowering businesses to operate without constraints. We thrive in a collaborative, fast-paced environment, relentlessly pushing the boundaries of what’s achievable in the fintech sector. Our focus is on technical excellence, continual learning, and a profound sense of ownership, all while creating scalable, reliable, and secure products that enable businesses to expand globally.Our Site Reliability Engineering (SRE) team is paving the way for innovative engineering solutions, addressing a variety of challenges and setting a benchmark for other teams to emulate. This team is accountable for the availability, performance, and reliability of our systems, ensuring seamless operations across our platform.

Apr 7, 2026
Apply
companypinely logo
Full-time|On-site|Singapore, Singapore, Singapore

We are seeking a dedicated Site Reliability Engineer (SRE) to enhance and maintain the availability of our trading binary systems. This role requires you to be on duty during the European team's off-hours, ensuring uninterrupted operations.Your Key Responsibilities:Overseeing operational management of trading activities, with a focus on proactive monitoring.Managing incidents, including rapid escalation and mitigation strategies.Participating in on-call duty to address critical issues.Performing debugging tasks using C++ and Python, along with classifying issues effectively.Developing observability metrics and trading analytics to support our trading systems.Keeping abreast of financial and technical news by reading relevant materials and monitoring exchange newsletters.Our Ideal Candidate Will Have:A Bachelor’s degree in a quantitative field such as Computer Science, Engineering, Physics, or Mathematics.At least 5 years of experience in a Site Reliability Engineering role.Programming proficiency in Python or Go is preferred.Strong knowledge of Unix systems.Experience deploying, configuring, and managing Linux-based servers, including Docker, Kubernetes, and Grafana.Ability to identify opportunities for platform improvements within a complex technical landscape.Exceptional communication skills, capable of engaging with both internal teams and external clients.Proficiency in English at B2/Upper-Intermediate level or higher.A proactive approach and willingness to learn about new domains.

Nov 19, 2025
Apply
companyPinely logo
Full-time|On-site|Singapore, Singapore, Singapore

Join our innovative team at Pinely, where we are building a top-tier Site Reliability Engineering (SRE) group from the ground up! As the SRE Lead for TradeOps, you will be instrumental in shaping our infrastructure and processes, beginning as a hands-on individual contributor from day one. As we grow, you will have the opportunity to lead a talented team dedicated to enhancing our trading operations.Key Responsibilities:Oversee the operational management of trading activities, ensuring robust monitoring and availability.Lead incident management efforts, including rapid escalation and effective mitigation strategies.Participate in on-call rotations to provide immediate support when needed.Debug and classify issues within our trading systems, utilizing your expertise in C++ and Python.Develop observability metrics and analytics to enhance trading performance and reliability.Stay updated with financial and technical news, actively engaging with exchange newsletters.

Jan 2, 2026
Apply
companyfuku logo
Contract|On-site|Singapore, Singapore, Singapore

As a Site Reliability Engineer (SRE) and Environment Engineer in the Banking sector, you will play a pivotal role in enhancing application reliability and operational efficiency.This contract position based in Singapore involves:- Managing the software deployment lifecycle, from development to production, ensuring systematic release schedules.- Overseeing multiple test environments on the Bank’s core platform, ensuring proper configurations and connectivity of satellite applications.- Supporting IT project executions through comprehensive test executions and regressions.- Conducting regular health checks to ensure system connectivity, consistency, and data integrity across all testing environments.- Coordinating deployment processes for both production and test environments and provisioning environments for various testing phases.- Maintaining proactive communication with stakeholders about environment statuses, managing expectations, and highlighting risks and issues.- Collaborating with global teams to support environment-related changes.- Reviewing and executing deployment instructions accurately for both production and test environments.- Working closely with the change manager to coordinate all releases.

Feb 20, 2026
Apply
companyAvePoint logo
Full-time|On-site|Singapore

Key Responsibilities:• Develop and implement comprehensive test plans and test cases for our infrastructure platforms.• Create and manage automated testing suites for diverse infrastructure components.• Conduct both manual and automated testing to ensure the quality and reliability of our systems.• Analyze testing results and report defects with detailed reproduction steps.• Collaborate with development and operations teams to enhance testing processes and continuously refine testing methodologies and tools.• Document testing procedures meticulously and maintain up-to-date test documentation.• Track and report on test coverage alongside quality metrics.• Employ Chaos Engineering practices to uncover system vulnerabilities.• Contribute to the formulation of Service Level Objectives (SLOs) and error budgets.

Mar 4, 2026
Apply
companyPave Bank logo
Full-time|On-site|Singapore, Singapore

The RoleJoin Pave Bank, where we are pioneering the future of programmable banking by merging traditional banking services with digital assets on a single, regulated platform. We are seeking a dynamic Site Reliability Engineer (SRE) to play a critical role in ensuring our core systems are consistently available, scalable, and high-performing as we expand.As a Site Reliability Engineer at Pave Bank, you will collaborate closely with our Engineering, Product, Security, and Operations teams to develop robust infrastructure, automate operational tasks, and uphold reliability across all services. Your contributions will significantly influence the safety, performance, and scalability of our banking platform, enabling customers to place their trust in Pave Bank for their financial needs.Key ResponsibilitiesOversee, maintain, and enhance the reliability, availability, and performance of our production systems and services.Design and sustain infrastructure as code (IaC), deployment pipelines, and automation processes to facilitate continuous delivery, scalability, and disaster recovery.Address incidents, conduct root-cause analyses, and lead postmortems to ensure that lessons learned are effectively implemented.Establish and uphold operational best practices including observability, logging, metrics, alerting, capacity planning, failover strategies, and backups.Collaborate with Engineering, Product, Compliance, and Operations teams to ensure that our infrastructure aligns with reliability, compliance, and security standards.Assist in service scaling, database operations, cloud infrastructure (preferably GCP), networking, and microservices orchestration.Document operational runbooks, on-call procedures, and system architecture to support maintenance, knowledge sharing, and compliance.QualificationsTechnical Skills and ExperienceProficient in programming or scripting languages such as Go, Python, Bash, or similar for automation and tooling.Hands-on experience with cloud infrastructure, preferably Google Cloud Platform (GCP).Familiar with containerization and orchestration technologies (Docker, Kubernetes, etc.).Experience with infrastructure-as-code tools (Terraform, Cloud Deployment Manager, etc.).

Jan 12, 2026
Apply
companyAvePoint logo
Full-time|On-site|Singapore

Join AvePoint as a Site Reliability Engineer (SRE) and play a crucial role in the development and management of a Whole-of-Government (WoG) runtime platform. We are looking for a dedicated engineer who is passionate about enhancing infrastructure and ensures optimal performance.In this role, you will design and manage robust infrastructure utilizing GitLab, AWS, and Kubernetes solutions, focusing on the stability, scalability, and performance of our platform.Key Responsibilities:Toil Reduction & Automation: Identify repetitive tasks and implement automation through CI/CD pipelines to minimize manual processes and enhance operational efficiency.Observability & System Health: Develop comprehensive observability solutions (logs, metrics, traces, alerts) focusing on the four Golden Signals: latency, traffic, errors, and saturation. Build automation for proactive system health evaluations and self-remediation.Production Support & Incident Management: Engage in on-call rotations, respond swiftly to incidents to reduce MTTR, and conduct thorough post-incident analyses to bolster system resilience.Security & Compliance: Collaborate with security teams to design and implement secure and compliant solutions, perform regular audits, and integrate advanced vulnerability scanning tools.Maintenance, Optimization & Performance: Identify and rectify performance bottlenecks, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive continuous optimization efforts.Strategic Customer Engagement: Serve as a technical advisor for tenants, guiding them on containerization and best practices for cloud-native deployments while participating in strategic initiatives to enhance scalability and performance.Knowledge Sharing & Documentation: Create and maintain detailed playbooks, runbooks, and documentation to promote team-wide knowledge sharing and streamline incident responses.Continuous Learning & Innovation: Stay abreast of industry trends and innovations to enhance our operational practices and technologies.

Mar 4, 2026
Apply
companyAvePoint logo
Full-time|On-site|Singapore

Join AvePoint as a Senior Splunk Engineer focused on Automation and Reliability Engineering Projects!Project OverviewContribute to Automation and Reliability Engineering efforts and operations.Key Responsibilities:Oversee Observability Engineering and Governance initiatives.Design and maintain enterprise SIEM solutions compliant with operational resilience frameworks (e.g., MAS TRM, DORA, APRA CPS 230).Lead the deployment, configuration, and optimization of Splunk for comprehensive visibility across infrastructure, applications, networks, and user experiences.Establish and uphold telemetry data governance standards—including metrics, logs, and traces—to ensure consistency, compliance, and security.Integrate Splunk with incident management, ITSM, and AIOps systems for predictive alerting and anomaly detection.Serve as the SIEM/Splunk subject matter expert (SME) for architecture reviews, upgrades, and performance enhancements.Reliability Engineering and Automation:Implement and advocate for Site Reliability Engineering (SRE) frameworks and reliability practices for critical systems.Design and automate runbooks, alerts, and self-healing workflows using Python, Ansible, and Terraform.Collaborate with Application, Infrastructure, and Cyber teams to incorporate reliability principles into the delivery lifecycle.Conduct resilience, chaos, and capacity testing in accordance with business continuity and disaster recovery standards.Define and monitor error budgets, reliability scorecards, and service health indicators for production workloads.Cloud & Platform Integration:Engineer SIEM solutions for cloud-native workloads in AWS and Azure, ensuring visibility across compute, storage, and network layers.Integrate Splunk and cloud observability tools into CI/CD pipelines and landing zones for continuous compliance.Implement infrastructure-as-code (IaC) models using Terraform and Ansible for consistent and auditable provisioning.Work alongside Cloud, DevOps, and Security teams to ensure telemetry aligns with audit, compliance, and operational risk requirements.Operational Excellence and Collaboration:Drive reductions in incident recurrence, Mean Time to Recovery (MTTR), and manual intervention through observability-led automation.Partner with Service Delivery, Cyber, and Application teams to facilitate predictive incident prevention and root cause transparency.Develop and maintain executive dashboards and reports highlighting availability, reliability KPIs, and operational risk indicators.

Mar 30, 2026
Apply
companyLifted An Upwork Company logo
Contract|Remote|Singapore

Lifted, an Upwork company, is seeking a Senior Software Engineer / Site Reliability Engineer based in Singapore. This role centers on observability, focusing on developing and enhancing tools and practices that improve system reliability and support smooth software delivery. Key responsibilities Design and build monitoring solutions to deliver clear insights into system health and performance. Review and interpret system performance data to spot trends, identify bottlenecks, and highlight areas that need attention. Collaborate with cross-functional teams to detect and address issues early, aiming to prevent user impact. Role focus This position places a strong emphasis on observability. The work involves both hands-on engineering and close coordination with other teams to ensure reliable, efficient software delivery.

Apr 22, 2026
Apply
companycsit logo
Full-time|On-site|Singapore, Singapore

Join our innovative team as an Infrastructure Security Engineer, where your expertise will be invaluable in designing, building, and maintaining robust infrastructure security services. You will play a vital role in ensuring the reliability, availability, and security of our infrastructure platform by implementing advanced automation, effective monitoring, and rapid incident response strategies. A strong understanding of IT architecture, cybersecurity practices, and site reliability engineering (SRE) is essential, along with analytical skills to troubleshoot and resolve security incidents efficiently.

Nov 10, 2025
Apply
companycsit logo
Full-time|On-site|Singapore, Singapore

Join our innovative team at csit as a Network Reliability Engineer, where you'll be instrumental in constructing robust and resilient network infrastructures utilizing state-of-the-art technologies, including cloud-based solutions and software-defined networking such as SD-WAN, ACI, and NSX. A solid understanding of IT infrastructure systems and familiarity with the latest advancements in networking technologies and platforms are essential. We seek a collaborative team player eager to embrace new challenges and stay updated with the rapidly changing technology landscape.

Mar 3, 2025
Apply
companycsit logo
Full-time|On-site|Singapore, Singapore

Join our innovative team at csit as a Network Reliability Staff Engineer, where you will play a crucial role in developing robust network infrastructure. You will leverage advanced technologies, including cloud-based solutions and software-defined networking, such as SD-WAN, ACI, and NSX. A solid understanding of IT infrastructure systems and familiarity with the latest networking technologies is essential. As a technical expert within our team, you will be encouraged to embrace new challenges and stay updated with the rapidly changing technology landscape.

Mar 3, 2025
Apply
companySquarepoint Capital logo
Full-time|On-site|Singapore, Montreal , London

Position Overview:Join our Risk team as a Reliability Software Engineer, where you will be instrumental in maintaining the performance, stability, and availability of our Risk software systems. The Risk platform at Squarepoint is essential for position management, profit/loss computation, inventory management, and internal order routing. These vital systems must handle high volumes of trading data efficiently and reliably, necessitating strong software development capabilities and analytical skills.Your primary focus will be on developing firm-wide platforms aimed at enhancing Squarepoint's observability, preventing functional and performance regressions, and automating operational processes. You will implement domain-specific logic tailored for various Risk sub-teams using these platforms. Examples of our projects include:Observability: Our health check platform simplifies the implementation of health checks across teams at Squarepoint. It supports generic health checks set up through configuration, as well as a 'plug-n-play' architecture for custom health checks.Preventing functional/performance regressions: We are creating a platform to automate benchmarking by managing job scheduling, hardware resources, metric collection, reporting results, and integrating with GitLab.Automation: We are developing a self-service automation platform allowing users to request system configuration changes via a Jira portal, which automatically schedules jobs to apply approved changes.Operational continuity is vital; therefore, our responsibilities include:Level-2 support: Each team member participates in a daily support rotation, prioritizing incident response during business hours over project work.

Mar 10, 2026
Apply
companycsit logo
Full-time|On-site|Singapore, Singapore

Join a vibrant team dedicated to the exploration, design, management, and optimization of our on-premises cloud infrastructure platforms and services. As a Cloud Infrastructure Engineer, you will collaborate with skilled cloud infrastructure engineers to implement robust cloud networking, storage solutions, virtual machines, and security measures. A solid understanding of cloud infrastructure technologies, architectural principles, and site reliability engineering (SRE) is essential for success in this role.

Nov 10, 2025
Apply
companyCSIT logo
Full-time|On-site|Singapore, Singapore

Join our innovative team as an Infrastructure Platform Engineer, where you will play a pivotal role in researching, designing, constructing, and optimizing distributed systems and platforms that empower our internal developers and products. A solid grasp of IT architecture, systems design, application and systems integration, and site reliability engineering (SRE) is essential for success in this role.

Nov 10, 2025

Sign in to browse more jobs

Create account — see all 1,955 results

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.