About the job
About Zettabyte Space
At Zettabyte Space, we are dedicated to revolutionizing AI computing by making it ubiquitous, seamless, and limitless. Our vision is to create a cloud where AI operates effortlessly, anytime, anywhere. Join us in designing the infrastructure for an AI-first future.
Role Overview
We are seeking a skilled Backend Engineer to develop systems that orchestrate GPU clusters for AI workloads. You will be responsible for creating APIs that manage GPU allocation, memory handling, compute scheduling, and multi-tenant isolation, addressing challenges that are distinct to AI infrastructure, surpassing traditional backend engineering. As a key member of our backend team, you will tackle problems such as: How can we efficiently share valuable GPU resources among users? How do we navigate GPU memory limitations for large AI models? How can we ensure optimal quality of service when workloads compete for resources? This is your chance to contribute to an infrastructure where every API call could represent thousands of dollars worth of compute usage per hour, directly influencing the affordability of AI model training for startups.
Key Responsibilities
Design APIs that simplify complex GPU operations for developers
Develop scheduling algorithms that maximize GPU utilization while adhering to SLA requirements
Create resource management systems for the GPU lifecycle, provisioning, allocation, scheduling, and deallocation
Implement usage tracking and billing systems for GPU hours, memory use, and compute efficiency
Establish monitoring for GPU-specific metrics, perform health checks, and enable automatic failure recovery
Construct multi-tenancy systems with resource isolation, quota management, and equitable scheduling
Optimize cold start times for model serving and develop efficient model loading techniques
Collaborate with frontend engineers to present complex infrastructure through user-friendly interfaces
Utilize AI-assisted coding tools (e.g., GitHub Copilot, Claude Code, Cursor IDE) to enhance productivity and code quality.
Ideal Candidate Profile
A minimum of 5 years of backend engineering experience, particularly with distributed systems
Proficient in Go, Python, or similar backend programming languages
Experience with resource scheduling, orchestration, and API design (REST, GraphQL, gRPC)
