About the job
About Us
At Heidi, we believe healthcare should have a more harmonious flow—one that prioritizes continuous and compassionate care. Our mission is to develop an AI Care Partner that collaborates with healthcare professionals to achieve this vision.
We are a diverse team of medical practitioners, engineers, designers, researchers, and visionaries dedicated to creating tools that allow clinicians to concentrate on what truly matters: their patients.
In just 18 months, Heidi has enabled healthcare professionals to reclaim over 18 million hours, facilitating 73 million patient visits across 116 countries. We currently support more than two million patient visits globally each week.
With nearly $100 million in funding, we are expanding our reach across the US, UK, Canada, and Europe, collaborating with top-tier health systems such as the NHS, Beth Israel Lahey Health, and Monash Health.
Your Role
Incident Response and On-Call Duties:
Take part in incident management, addressing production issues, aiding in service restoration, and ensuring effective communication throughout. As you gain experience, you'll lead incidents from start to finish.
Enhancing Operational Reliability:
Identify and address recurring issues and reliability threats, implementing improvements through enhanced alerting, automation, system modifications, or process enhancements.
Ownership of Production Environment:
Manage and enhance Kubernetes clusters, cloud infrastructure, and core platform services, gradually increasing your ownership as you become more familiar with our systems.
Observability Improvement:
Refine dashboards, alerts, logs, and traces to enable quicker issue detection and resolution, focusing on actionable insights.
Minimizing Operational Toil:
Automate routine tasks, streamline runbooks, and enhance tools to simplify on-call responsibilities and daily operations.
Facilitating Safe Changes:
Enhance deployment methods, rollback strategies, and operational readiness to mitigate the risks of incidents due to changes.
Contribution to Operational Practices:
Document and maintain runbooks, engage in blameless post-mortems, and assist in refining incident response protocols over time.
Collaboration with Engineering Teams:
Work closely with product and feature teams to ensure seamless integration and functionality.

