About the job
Join Krea's Innovative Team
At Krea, we are at the forefront of developing next-generation AI creative tools. Our commitment lies in making AI an intuitive and controllable medium for creatives. We aspire to create tools that enhance human creativity rather than replace it.
We view AI as a transformative medium that enables expressions across diverse formats—text, images, video, sound, and even 3D. Our focus is on creating smarter, more adaptable tools that leverage this medium effectively.
The Role of Supercomputing and AI Infrastructure at Krea
Our team is responsible for building and managing the foundational infrastructure that supports Krea's research and inference processes. This includes distributed training systems, over 1000 Kubernetes GPU clusters, and extensive petabyte-scale data pipelines. Much of our work involves creating bespoke solutions, such as custom distributed datastores, job orchestration systems, and advanced streaming pipelines, which are designed to handle modern AI workloads efficiently.
Key Projects You Will Contribute To:
Distributed Data Systems: Design and implement multi-stage pipelines to transform petabytes of raw data into clean, annotated datasets; run classification models across billions of images; deploy and integrate large language models to caption extensive multimedia data.
GPU Infrastructure: Manage distributed training and inference across 1000+ GPU Kubernetes clusters; address orchestration and scaling challenges for large-scale GPU job processing; optimize research workflows across multiple datacenters.
Distributed Training: Profile and enhance dataloaders streaming thousands of images per second; troubleshoot InfiniBand networking during extensive training runs; develop fault tolerance systems for large-scale pretraining; collaborate with researchers to refine reinforcement learning infrastructure.
Applied ML Pipelines: Identify clean scenes in millions of videos utilizing distributed shot-boundary detection; tailor and train models to sift through billions of images for specific queries; construct systems that link raw cluster capacity with research outcomes.

