companyqodeworld logo

Senior Site Reliability Architect - Unified Observability & AIOps

qodeworldTexas, Texas, United StatesNew
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

15+ years in Site Reliability Engineering or Production Engineering Strong background in Unified Observability, AIOps, and related fields Proven experience in AI/ML technologies and cloud-native environments

About the job

qodeworld is seeking a Senior Site Reliability Architect to join the team in Austin, Texas. This position focuses on unified observability, proactive detection, AIOps, and GenAI-driven operations for distributed financial services platforms. The role requires deep technical expertise in designing and maintaining reliable, high-performance systems across complex architectures.

Role overview

The Senior Site Reliability Architect will drive enhancements in platform reliability and performance. This includes building SLI/SLO-driven monitoring, implementing dynamic thresholds, and developing intelligent alerting and AI/ML-based anomaly detection. The position is central to evolving operational practices from reactive alerting to proactive, insight-driven approaches.

Key responsibilities

  • Design and deploy unified observability dashboards that integrate metrics, logs, traces, events, and system topology.
  • Establish and manage SLIs, SLOs, and error budgets aligned with business goals.
  • Create actionable dashboards for operational, engineering, and leadership teams.
  • Implement advanced alerting strategies using both static and dynamic thresholds.
  • Apply AI/ML/AIOps technologies to detect anomalies, forecast incidents, and reduce MTTR.
  • Shift monitoring practices from reactive alerting to proactive insights.
  • Incorporate noise reduction, alert correlation, and root cause analysis.
  • Use baseline modeling, seasonality detection, and anomaly scoring.
  • Oversee and resolve issues in multi-service architectures, including microservices, APIs, Kafka/streaming platforms, and cloud infrastructure (Terraform, Infrastructure as Code).
  • Analyze and trace issues across upstream/downstream dependencies, streaming platforms, infrastructure, and application code.
  • Work extensively with Dynatrace (mandatory requirement).
  • Utilize tools such as OpenTelemetry, Prometheus/Grafana, ELK/EFK, and cloud-native monitoring solutions (AWS, Azure, GCP).
  • Manipulate and enrich telemetry using JSON.
  • Apply GenAI/LLMs for incident summarization, root cause explanations, runbook recommendations, and auto-remediation suggestions.
  • Collaborate with platform teams to operationalize GenAI technologies safely.

Requirements

  • 15+ years of experience in Site Reliability Engineering or Production Engineering.
  • Strong background in unified observability, AIOps, and related fields.
  • Proven experience with AI/ML technologies and cloud-native environments.

About qodeworld

qodeworld is at the forefront of innovative financial services technology, delivering scalable solutions that enhance operational efficiency and reliability. Our team is dedicated to fostering a culture of excellence, teamwork, and continuous improvement.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.