About the job
At Block, we are more than just a company; we are a collective of diverse teams united by a common mission of economic empowerment. Our foundational teams — including People, Finance, Counsel, Hardware, Information Security, and Platform Infrastructure Engineering — collaborate across various business sectors and global time zones to create inclusive policies, provide financial forecasting, deliver legal support, secure our systems, and nurture innovative initiatives. Every challenge we face opens new opportunities, and we value diverse perspectives to uncover them. We invite you to bring yours to Block.
The Role
As a vital member of our Site Reliability Engineering (SRE) team, you will take on the dual responsibility of proactively enhancing and reactively managing the reliability of Block's platform and critical infrastructure. You are driven by metrics, possess a systems-oriented mindset, and are dedicated to building distributed platforms that facilitate safe, scalable product development.
You will utilize and continuously refine AI-driven tools and automation to boost observability, expedite incident detection and response, and minimize operational toil. This includes applying AI techniques to incident analysis, alert tuning, and operational workflows.
Your role will also involve primary platform on-call duties (12 hours a day, one week every few weeks, depending on team size), supporting Block's most critical (Tier 0) services. In this capacity, you will lead incident command, coordinate mitigation efforts, and ensure effective escalation during high-severity incidents.
You Will
- Build and extend platforms to enhance system reliability.
- Collaborate on team objectives that prioritize reliability across the entire company.
- Standardize reliability tools across multiple platforms and departments.
- Triaging, coordinating, and leading stabilization efforts for severity 0–1 incidents.
- Serve as the primary on-call engineer, maintaining clear escalation paths and demonstrating leadership during escalations.
- Drive improvements in platform-wide reliability, shared operational tools, and safe deployment patterns.
- Leverage AI-driven systems to enhance signal detection, reduce noise, and accelerate root cause analysis.
- Design and implement safe deployment strategies (including progressive delivery, automated rollback, and guardrails).
You Have
- A strong inclination towards identifying root causes in complex systems and implementing necessary fixes.
- Proven technical initiative and leadership on prior projects, particularly those focused on backend/platform.
- Experience with AI-driven tools for observability, incident analysis, or automation.
- A mindset that naturally re-evaluates existing processes to drive continual improvement.

