Network Reliability Staff Engineer
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Experience
Qualifications
About csit
csit is a leading technology company focused on building resilient network infrastructure. We utilize the latest advancements in networking technology to deliver reliable services and solutions to our clients.
Similar jobs
Search for Senior Splunk Engineer For Automation And Reliability Engineering
1,856 results
Join AvePoint as a Senior Splunk Engineer focused on Automation and Reliability Engineering Projects!Project OverviewContribute to Automation and Reliability Engineering efforts and operations.Key Responsibilities:Oversee Observability Engineering and Governance initiatives.Design and maintain enterprise SIEM solutions compliant with operational resilience frameworks (e.g., MAS TRM, DORA, APRA CPS 230).Lead the deployment, configuration, and optimization of Splunk for comprehensive visibility across infrastructure, applications, networks, and user experiences.Establish and uphold telemetry data governance standards—including metrics, logs, and traces—to ensure consistency, compliance, and security.Integrate Splunk with incident management, ITSM, and AIOps systems for predictive alerting and anomaly detection.Serve as the SIEM/Splunk subject matter expert (SME) for architecture reviews, upgrades, and performance enhancements.Reliability Engineering and Automation:Implement and advocate for Site Reliability Engineering (SRE) frameworks and reliability practices for critical systems.Design and automate runbooks, alerts, and self-healing workflows using Python, Ansible, and Terraform.Collaborate with Application, Infrastructure, and Cyber teams to incorporate reliability principles into the delivery lifecycle.Conduct resilience, chaos, and capacity testing in accordance with business continuity and disaster recovery standards.Define and monitor error budgets, reliability scorecards, and service health indicators for production workloads.Cloud & Platform Integration:Engineer SIEM solutions for cloud-native workloads in AWS and Azure, ensuring visibility across compute, storage, and network layers.Integrate Splunk and cloud observability tools into CI/CD pipelines and landing zones for continuous compliance.Implement infrastructure-as-code (IaC) models using Terraform and Ansible for consistent and auditable provisioning.Work alongside Cloud, DevOps, and Security teams to ensure telemetry aligns with audit, compliance, and operational risk requirements.Operational Excellence and Collaboration:Drive reductions in incident recurrence, Mean Time to Recovery (MTTR), and manual intervention through observability-led automation.Partner with Service Delivery, Cyber, and Application teams to facilitate predictive incident prevention and root cause transparency.Develop and maintain executive dashboards and reports highlighting availability, reliability KPIs, and operational risk indicators.
As an AIOps Engineer (Splunk), you will play a crucial role in designing, building, and testing the AIOps and Observability platform. Your primary focus will be on developing AIOps use cases, operationalizing them to meet customer requirements, and significantly enhancing productivity in service delivery and operations. Key Responsibilities:Architect, design, develop, deploy, and maintain the enterprise logging and observability platform utilizing Splunk or Elastic ELK.Contribute to the architectural design by assessing trade-offs related to scalability, resiliency, high availability, and security.Conduct capacity planning and solution reviews for the ELK/Splunk environments.Implement solutions for business data analysis and design data structures using the Elastic ELK/Splunk observability platform.Oversee high-volume data ingestion and real-time data flow processes.Collaborate with data log streaming platforms and tools for data ingestion from diverse systems and applications.Design and develop multi-tenant dashboard solutions.Establish and maintain operational best practices to ensure the effective functioning of the Elastic ELK/Splunk observability solution.Actively contribute to the enhancement of the Elastic ELK/Splunk observability solution.Optimize and fine-tune the Elastic ELK/Splunk observability solution to fulfill performance requirements.Work closely with developers to promote best practices for the data warehouse and analytics environment.Investigate emerging technologies and advancements to address customer needs and implement relevant upgrades.Develop, test, and operationalize AIOps use cases.Ensure platform operation meets high availability standards and aligns with customer SLA. 
About k-IDk-ID stands at the forefront of privacy-first compliance and age verification infrastructure, recognized as one of TIME’s Best Inventions of 2025 and a Tech Pioneer by the World Economic Forum. As a recipient of Fast Company’s accolade for the Next Big Things in Tech, we are creating the Age Layer for the internet—a vital framework that empowers digital platforms to seamlessly verify age and manage global compliance.Our flagship platform, driven by the Compliance Development Kit (CDK) and AgeKit, serves as the trusted engine for the world’s foremost game publishers and digital ecosystems. We transform fragmented, manual compliance processes into a unified API that efficiently handles age verification, parental consent, and regulatory logic across over 200 markets. Supported by leading venture capital firms such as a16z and Lightspeed, k-ID is entering an exciting growth phase to set the benchmark for global digital safety.About the RoleWe are seeking a dynamic Senior Site Reliability Engineer to ensure k-ID's reliability at scale. This pivotal role resides within our production backbone, where you will take ownership of and enhance the systems that maintain the availability, observability, security, and resilience of our platform as we experience traffic growth and an expanding client base.You will engage in infrastructure, tooling, deployment workflows, incident response, and systems design to ensure our scalability without compromise. This position is not about closing tickets; we seek a proactive individual who can assess systems, pinpoint vulnerabilities, and fortify them. You should possess a keen understanding of failure modes, blast radius, deployment safety, recovery time, cost efficiency, and the realities of managing production systems under pressure. Comfort with coding, automating processes, and collaborating closely with engineers to enhance reliability through improved architecture and operational practices is essential.
About ClickHouseRanked among the 2025 Forbes Cloud 100, ClickHouse stands as a leading innovator in the private cloud sector. With a customer base exceeding 3,000 and an annual recurring revenue (ARR) growth of over 250% year-on-year, we excel in real-time analytics, data warehousing, observability, and AI workloads.Our recent $400 million Series D funding round underscores our rapid growth and momentum. In just three months, renowned clients like Capital One, Lovable, Decagon, Polymarket, and Airwallex have adopted or expanded their use of our platform. They join industry giants such as Meta, Cursor, Sony, and Tesla who rely on our technology.We invite you to join us on our mission to revolutionize the way organizations harness their data!About the RoleAs we aim to provide our customers with dependable and secure services, we are expanding our Site Reliability Engineering team. In this role, you will lead initiatives to guarantee the reliability, availability, scalability, and performance of our cloud infrastructure. Collaborating with teams across Control Plane, Data Plane, Core, Security, Support, and Operations, you will guide the design and implementation of scalable, secure, and resilient distributed systems. You will also oversee incident management, conduct post-mortem analyses, and drive continuous improvements in our Cloud services. Utilizing your software engineering skills, you will develop platforms and tools to enhance operational and engineering efficiencies in ClickHouse Cloud. This position offers a unique chance to significantly contribute to the high-performance, elastic, and limitless scale of ClickHouse Cloud.What Will You Do?Work collaboratively with various engineering teams at ClickHouse to design and implement scalable, secure, and highly available systems.Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.Ensure comprehensive monitoring and alerting for all infrastructure components in ClickHouse Cloud, enabling timely incident detection and resolution.Refine incident response processes and conduct post-mortem analyses for outages, partnering with the support team to communicate effectively with affected customers.Continuously enhance the reliability and performance of our ClickHouse services.Plan and lead Chaos Engineering initiatives to identify potential vulnerabilities.
Join Our Team at Plaud Inc.Plaud is at the forefront of developing the world’s most reliable AI work companion, designed to enhance productivity through innovative note-taking solutions. Since our inception in 2023, we have gained the trust of over 1.5 million users globally. Our mission is to amplify human intelligence by constructing advanced interfaces and infrastructures that capture, extract, and utilize information from various forms of communication.Headquartered in San Francisco and incorporated in Delaware, Plaud Inc. is pioneering the integration of human and AI intelligence through an innovative hardware-software blend. We adhere to the highest standards of data security and privacy protection, maintaining ISO 27001, ISO 27701, GDPR, SOC 2, HIPAA, and EN 18031 compliance.Why Join Plaud?Experience working with a bootstrapped, rapidly growing company, achieving a remarkable $250 million revenue run rate in just three years.Help define the future of human-AI interaction.Engage with cutting-edge AI technologies and play a pivotal role in our global expansion efforts.Collaborate with a passionate team that values innovation and customer success.Advance your career in a culture that promotes continuous learning and development.
As a Site Reliability Engineer (SRE) and Environment Engineer in the Banking sector, you will play a pivotal role in enhancing application reliability and operational efficiency.This contract position based in Singapore involves:- Managing the software deployment lifecycle, from development to production, ensuring systematic release schedules.- Overseeing multiple test environments on the Bank’s core platform, ensuring proper configurations and connectivity of satellite applications.- Supporting IT project executions through comprehensive test executions and regressions.- Conducting regular health checks to ensure system connectivity, consistency, and data integrity across all testing environments.- Coordinating deployment processes for both production and test environments and provisioning environments for various testing phases.- Maintaining proactive communication with stakeholders about environment statuses, managing expectations, and highlighting risks and issues.- Collaborating with global teams to support environment-related changes.- Reviewing and executing deployment instructions accurately for both production and test environments.- Working closely with the change manager to coordinate all releases.
About AirwallexAirwallex is a pioneering global payments and financial platform, uniquely designed to streamline operations for businesses around the world. With our exceptional blend of proprietary technology and software, we empower over 200,000 companies globally—including industry leaders such as Brex, Rippling, Navan, Qantas, and SHEIN—with integrated solutions that cover everything from business accounts and payment processing to spend management and treasury operations, all tailored for a global audience.Founded in Melbourne, Airwallex boasts a diverse team of over 2,000 innovative professionals across 26 offices worldwide. With a valuation of US$8 billion and support from top-tier investors including T. Rowe Price, Visa, Mastercard, Robinhood Ventures, Sequoia Capital, Salesforce Ventures, DST Global, and Lone Pine Capital, we are at the forefront of transforming the future of global finance. If you're ready to take on the most ambitious challenges of your career, we invite you to join us.Attributes We ValueWe seek builders with entrepreneurial spirit eager to make a significant impact, accelerate their learning, and take true ownership of their work. You should possess strong expertise in your field, complemented by analytical thinking and a passion for our mission and operating principles. You are quick to act with sound judgment, driven by curiosity to explore deeply, and you make informed decisions based on foundational principles, balancing speed with thoroughness.Collaboration and humility are vital traits; you can transform initial ideas into fully realized products and ensure tasks are completed efficiently. You leverage AI to enhance productivity and solve challenges swiftly. In this role, you will tackle intricate, high-profile challenges alongside exceptional colleagues, advancing your career while we build the future of global banking. If this resonates with you, let’s create the future together.About the TeamThe Engineering team at Airwallex comprises a vibrant mix of innovators, builders, and problem solvers committed to empowering businesses to operate without constraints. We thrive in a collaborative, fast-paced environment, relentlessly pushing the boundaries of what’s achievable in the fintech sector. Our focus is on technical excellence, continual learning, and a profound sense of ownership, all while creating scalable, reliable, and secure products that enable businesses to expand globally.Our Site Reliability Engineering (SRE) team is paving the way for innovative engineering solutions, addressing a variety of challenges and setting a benchmark for other teams to emulate. This team is accountable for the availability, performance, and reliability of our systems, ensuring seamless operations across our platform.
Join AECOM as an Instrumentation & Automation Engineer and become part of a dynamic team that shapes the future of engineering and construction. In this role, you will be responsible for designing, implementing, and maintaining advanced instrumentation and automation systems, ensuring they meet project specifications and industry standards.
Join our innovative team at csit as a Network Reliability Engineer, where you'll be instrumental in constructing robust and resilient network infrastructures utilizing state-of-the-art technologies, including cloud-based solutions and software-defined networking such as SD-WAN, ACI, and NSX. A solid understanding of IT infrastructure systems and familiarity with the latest advancements in networking technologies and platforms are essential. We seek a collaborative team player eager to embrace new challenges and stay updated with the rapidly changing technology landscape.
Join our innovative team at csit as a Network Reliability Staff Engineer, where you will play a crucial role in developing robust network infrastructure. You will leverage advanced technologies, including cloud-based solutions and software-defined networking, such as SD-WAN, ACI, and NSX. A solid understanding of IT infrastructure systems and familiarity with the latest networking technologies is essential. As a technical expert within our team, you will be encouraged to embrace new challenges and stay updated with the rapidly changing technology landscape.
As a Site Reliability Engineer (SRE) focused on Globalization, you will play a pivotal role in ensuring the robustness and availability of our next-generation international infrastructure. As our client, a fast-growing global consumer internet platform, scales its operations across international markets, you'll be instrumental in building a resilient architecture that supports millions of users worldwide. This role involves working on multi-region architecture, global traffic routing, and large-scale distributed systems, directly influencing the reliability and scalability of our evolving platform.Key Responsibilities:Global Architecture & Disaster Recovery: Collaborate in designing and implementing a global infrastructure architecture. Own cross-region architecture, disaster recovery (DR), and high availability (HA) capabilities. Enable critical systems for multi-region deployment, disaster recovery failover, and fault isolation.Overseas Infrastructure Platform Deployment & Operations: Build, deploy, operate, and optimize core infrastructure platforms in overseas regions, ensuring consistency and reliability between international and domestic environments.Reliability Engineering & Incident Response: Develop a comprehensive reliability engineering framework for international systems, including observability systems, incident response mechanisms, and root cause analysis processes.Internationalization Infrastructure Enablement: Understand overseas business requirements and architectural constraints to drive the implementation of infrastructure capabilities in global environments.Cross-Team Collaboration & System Alignment: Work closely with domestic infrastructure, product engineering, and platform teams to ensure alignment with internal architecture standards and best practices.
Key Responsibilities:• Develop and implement comprehensive test plans and test cases for our infrastructure platforms.• Create and manage automated testing suites for diverse infrastructure components.• Conduct both manual and automated testing to ensure the quality and reliability of our systems.• Analyze testing results and report defects with detailed reproduction steps.• Collaborate with development and operations teams to enhance testing processes and continuously refine testing methodologies and tools.• Document testing procedures meticulously and maintain up-to-date test documentation.• Track and report on test coverage alongside quality metrics.• Employ Chaos Engineering practices to uncover system vulnerabilities.• Contribute to the formulation of Service Level Objectives (SLOs) and error budgets.
Squarepoint Capital
Position Overview:Join our Risk team as a Reliability Software Engineer, where you will be instrumental in maintaining the performance, stability, and availability of our Risk software systems. The Risk platform at Squarepoint is essential for position management, profit/loss computation, inventory management, and internal order routing. These vital systems must handle high volumes of trading data efficiently and reliably, necessitating strong software development capabilities and analytical skills.Your primary focus will be on developing firm-wide platforms aimed at enhancing Squarepoint's observability, preventing functional and performance regressions, and automating operational processes. You will implement domain-specific logic tailored for various Risk sub-teams using these platforms. Examples of our projects include:Observability: Our health check platform simplifies the implementation of health checks across teams at Squarepoint. It supports generic health checks set up through configuration, as well as a 'plug-n-play' architecture for custom health checks.Preventing functional/performance regressions: We are creating a platform to automate benchmarking by managing job scheduling, hardware resources, metric collection, reporting results, and integrating with GitLab.Automation: We are developing a self-service automation platform allowing users to request system configuration changes via a Jira portal, which automatically schedules jobs to apply approved changes.Operational continuity is vital; therefore, our responsibilities include:Level-2 support: Each team member participates in a daily support rotation, prioritizing incident response during business hours over project work.
Join NCS3 as an Automation Engineer, where you will play a pivotal role in enhancing automation solutions to drive operational efficiency. You will be responsible for developing, implementing, and maintaining automated processes that contribute to our ambitious goals. Collaborate with cross-functional teams to identify automation opportunities and design innovative solutions.
About k-IDk-ID is a pioneer in privacy-first compliance and age verification infrastructure, setting the standard for digital safety. We were celebrated as one of TIME’s Best Inventions of 2025, recognized as a Tech Pioneer by the World Economic Forum, and featured in Fast Company’s Next Big Things in Tech. Our mission is to create the Age Layer for the internet—a crucial infrastructure enabling digital platforms to verify age and manage compliance seamlessly across global markets.Powered by our Compliance Development Kit (CDK) and AgeKit, our core platform is the trusted backbone for the world’s leading game publishers and digital ecosystems, streamlining fragmented compliance through a unified API that efficiently handles age verification, parental consent, and regulatory requirements in over 200 markets. Supported by esteemed venture capital firms, including a16z and Lightspeed, k-ID is poised for significant growth.About The RoleWe are seeking a Lead Site Reliability Engineer and NOC Lead to spearhead production reliability and operational excellence across our platform.In this senior position, you will be accountable for the reliability, availability, observability, and operational maturity of k-ID’s systems while leading the Network Operations Center (NOC) function. Your role extends beyond merely responding to incidents; you will build systems, processes, tools, and team standards that minimize incident frequency and severity, ensuring rapid resolution when they occur.This role surpasses our senior NOC hires, as we need someone capable of establishing the operational model for the NOC, enhancing technical standards for incident management, collaborating closely with engineering leadership, and driving the long-term reliability roadmap for the business. You should be adept at transitioning between hands-on technical tasks, operational leadership, incident command, and team development.
Join aumovio as an Application Automation Engineer Intern, where you will have the opportunity to dive into the world of software automation and contribute to innovative projects. This internship will allow you to work alongside experienced engineers, enhancing your skills and gaining valuable insights into the automation processes that drive our applications.
Join our dynamic team as a Linux & Ansible Automation Engineer (Level 2)! We are seeking a highly motivated and skilled engineer with a minimum of 3 years’ hands-on experience in infrastructure automation, particularly utilizing Ansible and the Ansible Automation Platform (AAP). In this role, you will be instrumental in developing and managing automation solutions that drive operational efficiency and ensure the reliability of our Linux environments.Key Responsibilities:Design, develop, and maintain Ansible playbooks, roles, and collections for the automated setup of systems, deployment of applications, and provisioning of infrastructure.Oversee and support the Ansible Automation Platform (AAP), including the management of job templates, inventories, workflows, and credentials.Collaborate with DevOps, Cloud, and Security teams to implement Infrastructure as Code (IaC) solutions effectively.Automate critical tasks such as patch management, compliance verification, and system hardening on Linux servers.Diagnose and resolve issues related to automation processes and system performance.Produce and maintain detailed documentation of automation workflows and best practices.Mentor junior engineers and work closely with cross-functional teams to foster automation initiatives.Qualifications:Required Skills and Experience:Minimum of 3 years of experience in automation using Ansible.Strong expertise in Linux system administration (e.g., RHEL, CentOS, Ubuntu).Proficient in developing and managing Ansible playbooks, roles, and modules.Hands-on experience with Ansible Automation Platform or AWX.Familiarity with scripting languages such as Bash or Python.Competent in using Git and adhering to best practices in version control.Basic knowledge of CI/CD pipelines and their integration with automation tools.Excellent troubleshooting and problem-solving skills.Preferred Qualifications:Red Hat Certified Specialist in Ansible Automation or similar certification.Experience with public cloud services, including AWS, Azure, or Google Cloud.Familiarity with containerization technologies like Docker or Podman.Understanding of monitoring tools such as Prometheus and Grafana.Experience with IT service management tools like ServiceNow.
We are seeking a dedicated Site Reliability Engineer (SRE) to enhance and maintain the availability of our trading binary systems. This role requires you to be on duty during the European team's off-hours, ensuring uninterrupted operations.Your Key Responsibilities:Overseeing operational management of trading activities, with a focus on proactive monitoring.Managing incidents, including rapid escalation and mitigation strategies.Participating in on-call duty to address critical issues.Performing debugging tasks using C++ and Python, along with classifying issues effectively.Developing observability metrics and trading analytics to support our trading systems.Keeping abreast of financial and technical news by reading relevant materials and monitoring exchange newsletters.Our Ideal Candidate Will Have:A Bachelor’s degree in a quantitative field such as Computer Science, Engineering, Physics, or Mathematics.At least 5 years of experience in a Site Reliability Engineering role.Programming proficiency in Python or Go is preferred.Strong knowledge of Unix systems.Experience deploying, configuring, and managing Linux-based servers, including Docker, Kubernetes, and Grafana.Ability to identify opportunities for platform improvements within a complex technical landscape.Exceptional communication skills, capable of engaging with both internal teams and external clients.Proficiency in English at B2/Upper-Intermediate level or higher.A proactive approach and willingness to learn about new domains.
Bjak is a financial services company working to make affordable financial solutions available throughout ASEAN. Headquartered in Malaysia, Bjak operates the largest insurance portal in Southeast Asia, helping millions compare and select insurance policies at Bjak.com. The team builds with custom APIs, trading systems, and data science to simplify access to financial services that were previously hard to reach. Bjak has developed products that address complex regulatory requirements, including a global platform for online purchase of investment-linked life and health insurance with instant agent access. Role overview This Senior QA Automation Engineer position is based in Singapore. The role centers on building and maintaining reliable, secure platforms that support Bjak’s mission to improve access to financial services. The work directly impacts the user experience and helps ensure the company’s platforms remain dependable and safe.
ABOUT LUMILENSAt Lumilens, we are at the forefront of creating essential photonics infrastructure that will drive the future of AI supercomputing. Our innovations range from chip-to-chip optical interconnects to scalable photonic engines, paving the way for a revolution in computing—faster, cooler, and significantly more efficient.As a well-funded startup supported by Mayfield and led by industry veterans who have successfully developed transformative technologies, this is not just another job. It’s a ground-floor opportunity to fundamentally rethink the optical layer from the silicon level up. You will collaborate with a team of top-tier engineers tackling some of the most challenging problems in optics, systems, and scalability. Each line of code you write and every design decision you make will have a lasting impact on the infrastructure of tomorrow.If you're seeking a mission-driven role with momentum and the opportunity to make a significant impact, join us on this exciting journey. We’re just getting started.POSITION OVERVIEWIn the role of Optical Manufacturing Test Engineer with a focus on software automation, you will engage in test automation and networking protocols. Your primary responsibility will be to validate complex systems at Layer 1 and Layer 2 using advanced tools such as Ixia, while also developing Python-based frameworks. This critical role is essential for validating and ensuring the quality of Lumilens’ photonic components, modules, and systems. You will blend hands-on production testing with software test development and reporting, supporting both new product introduction (NPI) and volume manufacturing.You will collaborate closely with optical design, hardware, and manufacturing teams for design optimization, calibration, characterization, and troubleshooting.DUTIES/RESPONSIBILITIESWork alongside design engineering and manufacturing teams to define and develop software tests for high-volume production of premium optical components.Own the end-to-end manufacturing test process, ensuring all modifications and upgrades to test and manufacturing software adhere to change control and established Quality Management System (QMS) requirements, including monitoring software revision control, bug tracking, tester updates, regression testing, and test logs.Maintain up-to-date documentation and test specifications, utilizing systematic processes for test cases, results, and defect management.Demonstrate proficiency with version control systems such as GitHub or Bitbucket.Conduct functional, regression, and performance testing on Layer 1 and Layer 2 systems, ensuring optimal performance and reliability.
Sign in to browse more jobs
Create account — see all 1,856 results

