companyAndromeda Cluster logo

Site Reliability Engineer - AI Infrastructure

Andromeda ClusterGlobal Remote / San Francisco, CA
Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

Experience: 5+ years in SRE, DevOps, or infrastructure roles. Technical Skills: Strong Linux and networking fundamentals, deep Kubernetes experience, proficiency with Infrastructure-as-Code tools.

About the job

Site Reliability Engineer - AI Infrastructure

Location: Global Remote / San Francisco · Full-Time

About Andromeda

Andromeda Cluster, established by Nat Friedman and Daniel Gross, aims to democratize access to advanced AI infrastructure for early-stage startups, previously exclusive to hyperscalers. Our journey began with a single managed cluster that quickly reached capacity, propelling us to develop robust systems, networking, and orchestration layers to make AI infrastructure more accessible than ever.

Today, we collaborate with top AI laboratories, data centers, and cloud service providers to deliver compute resources precisely when and where they're needed the most. Our platform efficiently manages the routing of training and inference jobs across a global supply chain, facilitating flexibility and efficiency in one of the most rapidly expanding markets worldwide.

Our vision is to create a liquidity layer for global AI compute — a marketplace that dynamically moves the infrastructure and workloads essential for AGI, akin to the capital flows in global financial markets.

We are on the lookout for talented individuals who excel in AI infrastructure, research, and engineering to join our pioneering team.

Your Responsibilities

  • Provision, configure, and manage Kubernetes clusters for clients across various service providers.

  • Develop automation tools to enhance the deployment and integration of clusters.

  • Troubleshoot customer issues related to networking, storage, scheduling, and system layers.

  • Enhance the reliability and scalability of training and inference infrastructures.

  • Design and implement monitoring, alerting, and observability solutions for critical systems.

  • Work collaboratively with engineering and product teams to strategize and deliver infrastructure for new services.

  • Engage in on-call duties and incident response, leading postmortems and reliability enhancements.

Ideal Candidate Profile

  • A minimum of 5 years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles.

  • Solid foundation in Linux systems and networking principles.

  • Extensive expertise in Kubernetes and container orchestration at scale.

  • Proficient in Infrastructure-as-Code methodologies (Terraform, Helm, etc.).

About Andromeda Cluster

Andromeda Cluster is revolutionizing AI infrastructure accessibility for startups, enabling them to harness the power of advanced compute resources previously reserved for large-scale enterprises. Our innovative solutions support a global ecosystem of AI research and development.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.