About the job
Join AvePoint as a Senior Splunk Engineer focused on Automation and Reliability Engineering Projects!
Project Overview
- Contribute to Automation and Reliability Engineering efforts and operations.
- Key Responsibilities:
- Oversee Observability Engineering and Governance initiatives.
- Design and maintain enterprise SIEM solutions compliant with operational resilience frameworks (e.g., MAS TRM, DORA, APRA CPS 230).
- Lead the deployment, configuration, and optimization of Splunk for comprehensive visibility across infrastructure, applications, networks, and user experiences.
- Establish and uphold telemetry data governance standards—including metrics, logs, and traces—to ensure consistency, compliance, and security.
- Integrate Splunk with incident management, ITSM, and AIOps systems for predictive alerting and anomaly detection.
- Serve as the SIEM/Splunk subject matter expert (SME) for architecture reviews, upgrades, and performance enhancements.
- Reliability Engineering and Automation:
- Implement and advocate for Site Reliability Engineering (SRE) frameworks and reliability practices for critical systems.
- Design and automate runbooks, alerts, and self-healing workflows using Python, Ansible, and Terraform.
- Collaborate with Application, Infrastructure, and Cyber teams to incorporate reliability principles into the delivery lifecycle.
- Conduct resilience, chaos, and capacity testing in accordance with business continuity and disaster recovery standards.
- Define and monitor error budgets, reliability scorecards, and service health indicators for production workloads.
- Cloud & Platform Integration:
- Engineer SIEM solutions for cloud-native workloads in AWS and Azure, ensuring visibility across compute, storage, and network layers.
- Integrate Splunk and cloud observability tools into CI/CD pipelines and landing zones for continuous compliance.
- Implement infrastructure-as-code (IaC) models using Terraform and Ansible for consistent and auditable provisioning.
- Work alongside Cloud, DevOps, and Security teams to ensure telemetry aligns with audit, compliance, and operational risk requirements.
- Operational Excellence and Collaboration:
- Drive reductions in incident recurrence, Mean Time to Recovery (MTTR), and manual intervention through observability-led automation.
- Partner with Service Delivery, Cyber, and Application teams to facilitate predictive incident prevention and root cause transparency.
- Develop and maintain executive dashboards and reports highlighting availability, reliability KPIs, and operational risk indicators.

