About the job
qodeworld is seeking a Senior Site Reliability Architect to join the team in Austin, Texas. This position focuses on unified observability, proactive detection, AIOps, and GenAI-driven operations for distributed financial services platforms. The role requires deep technical expertise in designing and maintaining reliable, high-performance systems across complex architectures.
Role overview
The Senior Site Reliability Architect will drive enhancements in platform reliability and performance. This includes building SLI/SLO-driven monitoring, implementing dynamic thresholds, and developing intelligent alerting and AI/ML-based anomaly detection. The position is central to evolving operational practices from reactive alerting to proactive, insight-driven approaches.
Key responsibilities
- Design and deploy unified observability dashboards that integrate metrics, logs, traces, events, and system topology.
- Establish and manage SLIs, SLOs, and error budgets aligned with business goals.
- Create actionable dashboards for operational, engineering, and leadership teams.
- Implement advanced alerting strategies using both static and dynamic thresholds.
- Apply AI/ML/AIOps technologies to detect anomalies, forecast incidents, and reduce MTTR.
- Shift monitoring practices from reactive alerting to proactive insights.
- Incorporate noise reduction, alert correlation, and root cause analysis.
- Use baseline modeling, seasonality detection, and anomaly scoring.
- Oversee and resolve issues in multi-service architectures, including microservices, APIs, Kafka/streaming platforms, and cloud infrastructure (Terraform, Infrastructure as Code).
- Analyze and trace issues across upstream/downstream dependencies, streaming platforms, infrastructure, and application code.
- Work extensively with Dynatrace (mandatory requirement).
- Utilize tools such as OpenTelemetry, Prometheus/Grafana, ELK/EFK, and cloud-native monitoring solutions (AWS, Azure, GCP).
- Manipulate and enrich telemetry using JSON.
- Apply GenAI/LLMs for incident summarization, root cause explanations, runbook recommendations, and auto-remediation suggestions.
- Collaborate with platform teams to operationalize GenAI technologies safely.
Requirements
- 15+ years of experience in Site Reliability Engineering or Production Engineering.
- Strong background in unified observability, AIOps, and related fields.
- Proven experience with AI/ML technologies and cloud-native environments.

