About the job
Location: NYC Global HQ (Hybrid: 3 days in office)
About DoubleVerify
DoubleVerify (DV) provides digital performance solutions that help advertisers and agencies verify campaign quality, optimize results, and measure business impact with independent third-party analytics. Since 2008, DV has worked with Fortune 500 companies, agencies, publishers, and digital ad platforms to improve transparency and drive better outcomes in digital advertising. Learn more at www.doubleverify.com.
What You Will Do
- Improve the reliability, scalability, and performance of digital media measurement platforms.
- Set up and refine observability practices, including metrics collection, dashboards, and alerting to support proactive reliability improvements.
- Reduce Mean Time to Recovery (MTTR) for critical incidents by automating processes, improving observability, and enhancing monitoring.
- Lead incident response efforts, especially for Sev1 and Sev2 incidents, and drive resolutions.
- Maintain high availability for infrastructure and services across GCP, AWS, OCI, and on-premises systems.
- Guide technical projects from planning through deployment, keeping stakeholders informed and collaborating with teams.
- Design and deploy automation tools that reduce manual work and improve efficiency in deployment workflows, validation scripts, and self-service tools.
- Use AI-assisted development tools for faster automation and troubleshooting. Build integrations and Monitoring Control Plane (MCP) servers to support monitoring platforms and AI-driven analysis.
- Apply Infrastructure-as-Code methods using Terraform, Helm charts, Python scripts, and configuration management tools for consistent, version-controlled deployments.
- Develop and update documentation, runbooks, and Standard Operating Procedures (SOPs) in Confluence to support consistent incident response.

