About the job
Join Alpheya, an innovative B2B WealthTech startup headquartered in Abu Dhabi, backed by industry titans such as BNY Mellon, America's oldest bank, and Lunate, a leading $50B AUM alternative asset management firm. With a successful funding round of $300 million, we are on a mission to revolutionize wealth technology.
Our goal is to empower our clients' wealth franchises by providing unique experiences, financial solutions, and insightful analytics. Our cutting-edge digital wealth management platform aims to assist banks and financial institutions in the Middle East to effectively engage affluent, high-net-worth (HNW), and ultra-high-net-worth (UHNW) investor segments.
As a startup, we leverage the extensive capabilities of large organizations while fostering agile, cross-functional teams that drive innovation and efficiency.
Role Overview
We are creating a dedicated engineering team focused on production incident response and advanced debugging, ensuring permanent solutions across application, data, and deployment layers.
This role transcends traditional operations—you will write code, implement fixes safely, and fortify our platform to prevent recurrence of issues. This position combines engineering and operations, allowing you to take complete ownership of outcomes: investigating incidents, deploying code fixes, and implementing preventative measures through testing, observability, and hardening.
- Lead the production incident response process, including triage, mitigation, stakeholder communication, and team coordination.
- Debug and resolve issues across Go services (mandatory) and other relevant services (e.g., Node.js).
- Collaborate across service boundaries: GraphQL/RPC, distributed tracing, and identifying performance bottlenecks.
- Troubleshoot Kubernetes workloads and deployments effectively.
- Diagnose PostgreSQL/CNPG-related issues.
- Manage production bugs that affect application and data pipelines (ETL/Snowflake mappings), including backfills/replays and data quality validation.
- Implement preventative strategies: add regression tests, enhance observability, and maintain operational documentation.
- Drive reliability improvements by establishing SLOs/SLIs, enhancing alert quality, and ensuring release readiness across teams.
- Automate post-deployment validation (smoke/regression) utilizing tools like Playwright, ensuring comprehensive test coverage for every fix to avert regressions.

