About the job
Maincode is seeking a Signal Engineer in Melbourne to help shape the future of Matilda, a leading Australian language model. This role focuses on transforming massive amounts of raw data into high-quality training sets, directly influencing model performance. Both engineering and editorial skills are essential, as the work blends technical pipeline development with careful judgment about data quality.
Key Responsibilities
- Design and implement large-scale data pipelines for ingesting, cleaning, deduplicating, filtering, and scoring training data (handling volumes from terabytes up to petabytes).
- Develop classifiers and heuristics to distinguish valuable data from irrelevant or low-quality input.
- Experiment with different dataset combinations to identify what best improves the language model.
- Build tools for exploring, sampling, and auditing the data corpus.
- Collaborate closely with researchers and training engineers to align data strategies with model objectives.
Requirements
- Strong engineering background, particularly with Python, data tools, distributed processing, and building reliable data pipelines.
- High attention to detail, recognizing that small errors can quickly scale in this environment.
- Ability to assess the quality of training data accurately.
- Comfort working with extremely large and complex datasets.
- Interest in how data decisions influence model behavior.
- Quick to pick up new concepts; prior experience with large language models is not required.
Preferred Experience
- Experience with web-scale corpora or pre-training data pipelines.
- Familiarity with unstructured text data.
- Knowledge of distributed data frameworks such as Spark or Ray.
- Background in deduplication, quality classification, or tokenization.
Additional Details
This is a full-time, in-person position based in Melbourne. Maincode cannot provide visa sponsorship, so candidates must already hold valid and unrestricted work rights in Australia.

