About the job
Site Reliability Engineer - AI Infrastructure
Location: Global Remote / San Francisco · Full-Time
About Andromeda
Andromeda Cluster, established by Nat Friedman and Daniel Gross, aims to democratize access to advanced AI infrastructure for early-stage startups, previously exclusive to hyperscalers. Our journey began with a single managed cluster that quickly reached capacity, propelling us to develop robust systems, networking, and orchestration layers to make AI infrastructure more accessible than ever.
Today, we collaborate with top AI laboratories, data centers, and cloud service providers to deliver compute resources precisely when and where they're needed the most. Our platform efficiently manages the routing of training and inference jobs across a global supply chain, facilitating flexibility and efficiency in one of the most rapidly expanding markets worldwide.
Our vision is to create a liquidity layer for global AI compute — a marketplace that dynamically moves the infrastructure and workloads essential for AGI, akin to the capital flows in global financial markets.
We are on the lookout for talented individuals who excel in AI infrastructure, research, and engineering to join our pioneering team.
Your Responsibilities
Provision, configure, and manage Kubernetes clusters for clients across various service providers.
Develop automation tools to enhance the deployment and integration of clusters.
Troubleshoot customer issues related to networking, storage, scheduling, and system layers.
Enhance the reliability and scalability of training and inference infrastructures.
Design and implement monitoring, alerting, and observability solutions for critical systems.
Work collaboratively with engineering and product teams to strategize and deliver infrastructure for new services.
Engage in on-call duties and incident response, leading postmortems and reliability enhancements.
Ideal Candidate Profile
A minimum of 5 years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles.
Solid foundation in Linux systems and networking principles.
Extensive expertise in Kubernetes and container orchestration at scale.
Proficient in Infrastructure-as-Code methodologies (Terraform, Helm, etc.).

