About the job
About Cantina Labs:
Cantina Labs is at the forefront of social AI, innovating a comprehensive suite of cutting-edge, real-time models that redefine expression, personality, and realism. Our mission is to breathe life into characters, revolutionizing storytelling, connection, and creativity. The Cantina platform, our flagship social AI offering, is just the beginning of our journey.
Join us in shaping the future of AI and its impact on human creativity and social interactions!
About the Role:
We are seeking an Applied Machine Learning Engineer to construct and enhance the data pipelines that support our extensive video generation models. This position emphasizes the collection of significant video data, the preparation of high-quality training samples, and the development of robust preprocessing, filtering, and parsing workflows. You will manage annotation pipelines across platforms like MTurk and oversee the entire lifecycle of training data, from raw ingestion to polished, model-ready samples that directly enhance our model's quality. This role sits at the critical intersection of data engineering and machine learning research, essential for transforming complex real-world data into the driving force behind our models.
What You’ll Do:
- Develop and maintain scalable data pipelines for large video generation models, encompassing data ingestion, filtering, preprocessing, and dataset curation, utilizing tools such as AWS S3 and DynamoDB.
- Design and execute annotation workflows on platforms such as MTurk and Prolific, focusing on task design, quality control, and label validation.
- Train and enhance smaller supporting models used for data filtering, quality assessment, preprocessing, and other components of the machine learning pipeline.
- Collaborate closely with research and engineering teams to transform experimental workflows into scalable and repeatable systems that facilitate model training and evaluation.
- Ensure data quality throughout the pipeline by identifying bottlenecks, failure modes, and low-quality sources, and continuously refine tools and processes.
- Create internal tools and automation to streamline dataset preparation, launch annotation jobs, monitor outputs, and support model development comprehensively.
- Lead larger pipeline projects from inception to completion, including new dataset creation initiatives or enhancements to labeling and preprocessing infrastructure.
