About the job
The Technical Staff Member (Data) - World Models position at Reka centers on building and maintaining large-scale data systems for AI model training. This role can be based in the US, Singapore, or remote. The focus is on integrating and processing petabyte-scale multimodal datasets, ensuring data infrastructure is both sustainable and efficient.
What you will do
- Lead the development of data pipelines and storage solutions to handle vast and varied datasets for model training.
- Create automated, resource-efficient tools and systems that process diverse small datasets at scale.
- Design, automate, and maintain Python ETL pipelines (using Spark or Ray) for large multimodal data processing.
- Develop and maintain systems for data cataloging, lineage tracking, quality assurance, integrity checks, access management, and lifecycle oversight.
- Support colleagues by providing internal tools, documentation, and guidance on data best practices.
- Act as the primary steward of the organization’s datasets, ensuring their quality, accessibility, and overall health.
Key challenges
- Build high-performance pipelines that process petabyte-scale datasets across thousands of CPUs and hundreds of GPUs.
- Adapt data formats, storage, and processing methods to keep pace with AI advancements while maintaining backward compatibility.
- Scale data infrastructure to support rapid organizational growth.
- Ensure the platform remains flexible for handling heterogeneous datasets and ad-hoc analytics needs.
Requirements
- Expertise in data engineering, especially with Python ETL pipelines and familiarity with infrastructure, data formats, and large-scale storage systems.
- Experience in managing datasets, annotations, and data versioning for machine learning model training.
- Solid understanding of fundamental machine learning concepts to collaborate effectively with researchers and inform platform decisions.
- Ability to draft clear specifications for AI agents and maintain strong human oversight of AI-generated outputs.
- Demonstrated initiative, ownership, and effective communication in managing workload and priorities.
