About the job
Abnormal Security is looking for a Platform & Infrastructure Software Engineer II to join the PI team in Bangalore (hybrid). This role focuses on building and evolving the core platforms that support Abnormal’s products and growth. The position centers on observability and developer tooling, with a direct impact on engineering efficiency and system reliability.
Key Focus Areas
- Observability Platform: Take charge of the monitoring, metrics, and alerting infrastructure used by all engineering teams. Work with Prometheus, Chronosphere, and Grafana to improve real-time visibility into system performance. Build dashboards, manage metric pipelines at scale, and optimize observability across production environments in the US, EU, and GovCloud.
Your Impact
- Lead the observability stack (Prometheus, Chronosphere, Grafana, PagerDuty), giving teams the tools to quickly identify and resolve production issues. Your work will help engineers operate faster and with greater confidence.
- Design and refine platforms and developer tools that reduce friction, speed up deployments, and simplify pipeline creation, so product teams can focus on building features instead of troubleshooting infrastructure.
- Set and maintain SLAs and SLOs for shared infrastructure, balancing resilience and cost for the systems that support Abnormal’s products.
- Make architectural decisions about alerting pipelines and cross-environment deployments that shape product capabilities and delivery speed.
What You Will Do
- Work closely with the Tech Lead, Engineering Manager, and Product Manager to design, build, and launch key platform features, from technical documentation through to production deployment.
- Own features end-to-end: scope, implement, test, deploy, and monitor them across US, EU, and GovCloud environments.
- Manage one to three critical services in Observability (Prometheus, Chronosphere, Grafana, PagerDuty pipeline) or Data Infrastructure (Airflow, Spark), ensuring they remain reliable, performant, and up to date.
- Participate in on-call rotations to triage, diagnose, and resolve production issues independently, building deep operational understanding of your systems.
- Increase system resilience by automating runbooks, refining SLAs/SLOs, and proactively finding performance bottlenecks or potential failure points.
- Take responsibility for the reliability of your code, including writing unit and integration tests and adding observability instrumentation.
- Develop platforms, tools, and APIs that help other engineering teams deliver their work efficiently.

