About the job
About Us:
TransPerfect, a prominent name in the translation software industry with an energetic start-up culture, is on the lookout for a talented and enthusiastic Backend Developer to join our dynamic Artificial Intelligence (AI) team. Here, you will have the chance to influence the future of AI within a global organization. With over a decade of experience and the development of initial machine translation models, our AI team has become a cornerstone of innovation in machine translation, generative AI, natural language processing, and automation.
We are in search of a skilled backend developer eager to explore the limits of technology and create a meaningful impact in the AI domain. You will collaborate with a diverse and global team of experts based in the USA, Spain, Portugal, and India. If you are driven by a passion for building robust and scalable solutions that extend AI capabilities to users, this position is perfect for you.
About the Role:
In the role of Backend Developer, you will tackle the "last mile" of document processing: transforming complex, unstructured PDFs into well-formatted, editable .docx files. Your mission extends beyond mere text extraction; you will aim to faithfully reproduce the visual and structural intent of the original documents, including nested tables, multi-column layouts, font hierarchies, and styling.
This position requires you to spearhead the research and development of our document conversion pipeline, balancing strategic decision-making with hands-on development (leveraging both engineering and AI skills).
Your responsibilities will include:
- Comparative Analysis: Conduct an in-depth evaluation of commercial solutions (ABBYY, Adobe, AWS Textract) against open-source/AI-native options (Mistral OCR, Docling, Nougat, LlamaParse).
- Benchmarking: Set metrics for "format fidelity" to objectively assess how well various tools replicate headers, footers, tables, and styles.
- Pipeline Development: Create a Python-based workflow that integrates OCR engines with document generation libraries (such as python-docx or Pandoc).

