About the job
Join the innovative team at Perplexity as an AI Infrastructure Engineer. In this role, you will leverage your expertise in Kubernetes, Slurm, Python, C++, and PyTorch, primarily utilizing AWS. Collaborate closely with our Inference and Research teams to design, deploy, and optimize our extensive AI training and inference clusters.
Responsibilities
Architect, deploy, and manage scalable Kubernetes clusters tailored for AI model inference and training workloads.
Oversee and enhance Slurm-based HPC environments for distributed training of large language models.
Create robust APIs and orchestration systems for training pipelines and inference services.
Implement effective resource scheduling and job management systems across diverse compute environments.
Evaluate system performance, identify bottlenecks, and implement enhancements across both training and inference infrastructures.
Develop monitoring, alerting, and observability solutions specifically designed for ML workloads running on Kubernetes and Slurm.
Quickly respond to system outages and collaborate with multiple teams to ensure high uptime for critical training runs and inference services.
Optimize cluster utilization and execute autoscaling strategies to meet dynamic workload demands.
Qualifications
Extensive experience in Kubernetes administration, including custom resource definitions, operators, and cluster management.
Proficient in Slurm workload management, encompassing job scheduling, resource allocation, and cluster optimization.
Demonstrated experience in deploying and managing distributed training systems at scale.
In-depth knowledge of container orchestration and the architecture of distributed systems.
Solid familiarity with LLM architecture and training processes, including Multi-Head Attention, Multi/Grouped-Query, and distributed training strategies.
Experience in managing GPU clusters and optimizing compute resource utilization.
Required Skills
Advanced Kubernetes administration and YAML configuration management skills.
Expertise in Slurm job scheduling, resource management, and cluster configuration.
Proficiency in Python and C++ programming with a focus on systems and infrastructure automation.

