TLDR: We are seeking an experienced Audio and Multimodal Machine Learning Engineer to develop, train, and deploy cutting-edge speech, audio, and multimodal AI models for an advanced AI safety platform that handles over 100 million API calls each month.About UsWhite Circle is at the forefront of AI safety, dedicated to creating a reliable and optimized framework for AI systems. Our innovative platform is powered by simple natural-language policies that dictate the acceptable behaviors of AI models. We automate the testing, enforcement, and continuous enhancement of these policies to ensure they scale effectively.Backed by $11 million from leading investors, founders, and executives from organizations like OpenAI, Anthropic, HuggingFace, Mistral, DeepMind, and Datadog.Processing over 100 million API calls monthly.We specialize in fine-tuning and training our own large language models to outperform both open-source and proprietary alternatives in speed and cost.Our team is small yet immensely focused. If you are eager to tackle complex challenges, quickly see your contributions in production, and shape the future of AI safety, we want you on our team.Your Responsibilities:Train and refine large-scale audio and multimodal models from scratch and using pretrained checkpoints.Design and execute experiments, including architectural modifications, data mixes, and training methodologies.Create and maintain audio data pipelines, transforming raw recordings into training-ready datasets.Optimize models for production environments, focusing on quantization, distillation, and streaming inference.Implement end-to-end model deployment, ensuring low-latency serving from research checkpoints.Collaborate with research teams to translate experimental concepts into deployable features.Establish key evaluation metrics and benchmarks that are crucial for product performance.Ideal Candidate:3+ years of experience in training large-scale deep learning models within audio, speech, or acoustic realms.Proficient in PyTorch and experienced in distributed training frameworks (such as DeepSpeed, FSDP, etc.).Familiar with audio/speech architectures like Audio Qwen, Whisper, HuBERT, or Conformer.Experience with multimodal architectures such as Audio Flamingo, Omni Qwen, etc.
Jan 16, 2026