About the job
Key Objectives
• Ensure comprehensive visibility of application health, performance, and user experience across vital business services.
• Deploy and oversee cutting-edge Application Performance Monitoring (APM) tools to facilitate real-time insights, transaction tracing, and thorough diagnostics of application behavior.
• Assist in performance engineering, incident prevention, and swift root cause analysis of application-related challenges.
• Collaborate with DevOps, development, and infrastructure teams to cultivate a performance-driven culture supported by actionable telemetry.
Responsibilities Specific to the Role
• Implement, configure, and manage enterprise-level APM tools throughout the application ecosystem.
• Develop dashboards, alert protocols, and SLA-based benchmarks for essential business applications.
• Enable distributed tracing and service mapping to visualize and diagnose performance across complex dependencies.
• Work closely with application owners and development teams to resolve recurring issues and enhance code-level performance.
• Ensure seamless integration of APM tools within CI/CD pipelines and incident response workflows.
• Lead performance monitoring initiatives during significant application launches or peak operational periods.
• Conduct APM health evaluations and guarantee consistent telemetry coverage across development, testing, and production environments.
General Functional Duties
• Uphold monitoring standards and documentation, including runbooks, dashboards, and escalation procedures.
• Aid in defining and monitoring service-level objectives (SLOs) and service-level agreements (SLAs).
• Participate in post-incident reviews and performance retrospectives to enhance visibility and minimize Mean Time To Recovery (MTTR).
• Support the automation of alert routing, event correlation, and ticket enrichment using observability data.
• Stay updated on trends in application monitoring, including the adoption of OpenTelemetry and AI-driven diagnostics.
• Provide out-of-hours performance troubleshooting during P1/P2 incidents as part of a rotating schedule.

