About the job
Overview
We are looking for a highly skilled Principal Solutions Architect to spearhead the comprehensive design, sizing, and deployment of infrastructure aligned with NVIDIA's AI Factory. In this technical and customer-centric role, you will convert intricate AI and machine learning workload specifications into robust, engineered infrastructure solutions that encompass colocation facilities, GPU compute, high-performance networking, parallel storage, and the complete NVIDIA AI software ecosystem.
Your role will be that of a trusted technical advisor to enterprise and hyperscale clients, collaborating closely with sales, product, and engineering teams to secure and execute transformative AI infrastructure initiatives. Your insights will significantly influence how organizations construct and manage production AI Factories capable of training cutting-edge models, managing extensive inference fleets, and accelerating data science workflows on a large scale.
Your Impact
Solution Design & Architecture
- Facilitate discovery workshops to gather AI/ML workload requirements, including model training scale, inference SLAs, data pipeline throughput, and multi-tenancy needs.
- Architect comprehensive AI Factory solutions in accordance with NVIDIA reference architectures, integrating colocation, GPU compute, networking, storage, and software components.
- Create detailed Bills of Materials (BOMs), rack elevation diagrams, network topology diagrams, and power/cooling budgets for client proposals.
- Define GPU cluster architectures utilizing NVIDIA DGX, HGX, and MGX systems with B200, B300, and GB300 Blackwell SXM and NVLink-Switch configurations.
- Design RTX PRO 6000 Blackwell Server Edition deployments tailored for inference-optimized and enterprise AI workloads.
- Conduct workload sizing and TCO/ROI modeling to substantiate infrastructure dimensions for training, fine-tuning, and inference at scale.
Colocation & Facility Planning
- Outline colocation requirements, including critical power load (MW-scale), UPS and generator configurations, and PUE targets.
- Design high-density GPU deployments using air-cooled, direct liquid cooling (DLC), and rear-door heat exchanger setups.
- Specify meet-me room (MMR) and cross-connect requirements; detail carrier-neutral telecom diversity strategies.
- Engage with colocation providers and data center operators to validate...

