About the job
Who We Are
At TwelveLabs, we are at the forefront of creating innovative multimodal foundation models capable of interpreting videos in a manner akin to human understanding. Our groundbreaking models have set new benchmarks in video-language modeling, granting us enhanced capabilities and fundamentally reshaping how we engage with and analyze diverse media formats.
With over $110 million in Seed and Series A funding, we are supported by leading venture capital firms including NVIDIA’s NVentures, NEA, Radical Ventures, and Index Ventures, alongside esteemed AI pioneers like Fei-Fei Li, Silvio Savarese, and Alexandr Wang. Our headquarters are in San Francisco, complemented by a significant presence in the APAC region from our Seoul office, reflecting our dedication to fostering global innovation.
Our strategic partnerships with NVIDIA and AWS provide us with access to state-of-the-art hardware, including B300s, enabling us to explore the frontiers of video AI technology.
As a global organization, we celebrate the distinct journeys of each individual. Our diverse cultural, educational, and life backgrounds empower us to consistently challenge traditional norms. We seek passionate individuals who resonate with our mission and are eager to contribute to transformative advancements in technology. Join us in revolutionizing video comprehension and multimodal AI.
About the Team
The Pegasus team is integral to TwelveLabs' video understanding initiatives and spearheads the development of Pegasus, our Video Analysis product. We focus on constructing multimodal video analysis systems that excel in instruction following and generate complex, hierarchically organized outputs. Our priority is to deliver products that hold real-world significance rather than engaging in isolated research, collaborating within a goal-driven, cross-functional team comprising both ML researchers and engineers.
Our responsibilities encompass a wide array of challenges: large-scale distributed training of multimodal LLMs from pre-training to reinforcement learning, precise temporal segmentation and structured metadata extraction for practical applications, extending temporal context lengths to several hours, and implementing data curation processes that facilitate well-aligned evaluations and performance enhancements through improved training data.
Our team utilizes the latest cutting-edge chips, including NVIDIA B300s, to propel the limits of video analysis systems—accelerating our transition from research to production as swiftly as possible.

