About the job
About Recraft
Founded in the US in 2022 and now thriving in London, UK, Recraft is revolutionizing the creative landscape with an innovative AI tool designed for professional designers, illustrators, and marketers. We are committed to setting a new benchmark in the realm of image generation.
Our platform empowers creators to swiftly generate and refine original images, vector art, illustrations, icons, and 3D graphics using AI technology. With over 3 million users across 200 countries producing hundreds of millions of stunning images, Recraft is just getting started in its journey to redefine creativity.
Join us in a world of professional growth, contribute to large-scale projects, and help shape the future of creativity. Our mission is clear: to make Recraft an indispensable tool for every designer and to elevate the industry standard. We are dedicated to ensuring that creators retain full control over their creative processes, providing them with cutting-edge tools to transform their ideas into reality.
If you are driven by the passion to push the boundaries of AI, we invite you to join our team!
Position Overview
As a part of Recraft, you will play a pivotal role in developing the next generation of generative models for images and text. We are seeking an ML Data Engineer to enhance our data pipelines for unstructured data, primarily focusing on images. Your work will ensure our training workflows are efficient, reliable, and scalable. You will design and manage high-throughput ingestion and preprocessing on Kubernetes, evolve our internal data-pipeline framework, and collaborate closely with ML engineers to deliver datasets that significantly improve model quality.
Key Responsibilities
- Design and maintain robust data-ingestion pipelines to source and prepare large-scale datasets of images (and occasionally text/HTML) from open, publicly accessible, and authorized sources.
- Manage the complete data flow: from raw data to quality filtering, deduplication, validation, and creation of training-ready artifacts. Enhance our Kubernetes-based data-pipeline framework, including distributed job handling, retries, monitoring, and automation.
- Utilize S3-style object storage for efficient data layouts, lifecycle management, throughput optimization, and cost considerations.
- Implement additional tools for pipeline observability, including progress tracking, health visualization, performance metrics, and alert systems to facilitate rapid iteration.
- Collaborate intimately with ML engineers to align datasets with training requirements and expedite experimentation processes.

