Zettabyte Space logo

Senior/Staff Backend Engineer - Distributed Systems at Zettabyte Space

Zettabyte SpaceUnited States
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Senior

Qualifications

Candidates should possess strong analytical and problem-solving skills, along with a collaborative mindset to work effectively within a team-oriented environment.

About the job

About Zettabyte Space

At Zettabyte Space, we are dedicated to revolutionizing AI computing by making it ubiquitous, seamless, and limitless. Our vision is to create a cloud where AI operates effortlessly, anytime, anywhere. Join us in designing the infrastructure for an AI-first future.

Role Overview

We are seeking a skilled Backend Engineer to develop systems that orchestrate GPU clusters for AI workloads. You will be responsible for creating APIs that manage GPU allocation, memory handling, compute scheduling, and multi-tenant isolation, addressing challenges that are distinct to AI infrastructure, surpassing traditional backend engineering. As a key member of our backend team, you will tackle problems such as: How can we efficiently share valuable GPU resources among users? How do we navigate GPU memory limitations for large AI models? How can we ensure optimal quality of service when workloads compete for resources? This is your chance to contribute to an infrastructure where every API call could represent thousands of dollars worth of compute usage per hour, directly influencing the affordability of AI model training for startups.

Key Responsibilities

  • Design APIs that simplify complex GPU operations for developers

  • Develop scheduling algorithms that maximize GPU utilization while adhering to SLA requirements

  • Create resource management systems for the GPU lifecycle, provisioning, allocation, scheduling, and deallocation

  • Implement usage tracking and billing systems for GPU hours, memory use, and compute efficiency

  • Establish monitoring for GPU-specific metrics, perform health checks, and enable automatic failure recovery

  • Construct multi-tenancy systems with resource isolation, quota management, and equitable scheduling

  • Optimize cold start times for model serving and develop efficient model loading techniques

  • Collaborate with frontend engineers to present complex infrastructure through user-friendly interfaces

  • Utilize AI-assisted coding tools (e.g., GitHub Copilot, Claude Code, Cursor IDE) to enhance productivity and code quality.

Ideal Candidate Profile

  • A minimum of 5 years of backend engineering experience, particularly with distributed systems

  • Proficient in Go, Python, or similar backend programming languages

  • Experience with resource scheduling, orchestration, and API design (REST, GraphQL, gRPC)

About Zettabyte Space

Zettabyte Space is at the forefront of AI computing, aiming to create an accessible and efficient environment for AI technologies through innovative cloud solutions.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.