company

Distributed AI Support Engineer

GRNET S.A.Athens, Attica, Greece
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Focus Areas1. User Support and OperationsDeliver first-line support for AI on HPC workloads (e.g., LLM, computer vision, and other GPU-accelerated workloads). This includes ticket triage, rapid diagnosis of job failures, and escalation of hardware issues when necessary. Assist users in the creation, review, and debugging of Slurm job scripts for launching multi-GPU/multi-node jobs using tools such as torchrun, accelerate launch, or deepspeed, and provide support for Ray/DeepSpeed and vLLM inference workflows as applicable.2. AI/LLM Software Stacks and ContainersMaintain and conduct testing on shared AI/LLM and computer vision stacks for both HPC and Cloud environments (including PyTorch, DDP/FSDP, Hugging Face Transformers & Accelerate, PEFT/LoRA, Unsloth, DeepSpeed, Bitsandbytes, TensorFlow, RAPIDS, Ray, vLLM, and related tools). Ensure compatibility with NVIDIA drivers, CUDA, and NCCL, and design, publish, and support recommended Apptainer/Singularity containers for various training and inference tasks.3. Debugging, Diagnostics, and PerformanceDiagnose typical AI/LLM failures (such as CUDA errors, NCCL timeouts, GPU OOM, and distributed hangs). Validate driver/CUDA/NCCL stacks and profile/tune workloads using tools like PyTorch Profiler, NVIDIA Nsight (Systems/Compute), TensorBoard, MLflow, and Weights & Biases (WandB).4. Distributed Training, Quantization, and InferenceGuide users in scalable distributed training with PyTorch DDP/FSDP and DeepSpeed, focusing on techniques like ZeRO, pipeline, and tensor parallelism.

About the job

Why Join Us

At GRNET S. A., we are dedicated to enhancing Internet connectivity and providing high-quality e-Infrastructures and advanced services to the Greek Educational, Academic, and Research communities. Our mission is to bridge the digital divide while ensuring equal participation for all members in the global Society of Knowledge. Our services span across various sectors including Education, Research, Health, and Culture.

In 2026, GRNET will proudly host the DAEDALUS supercomputer, poised to be among the top supercomputers in Europe. This state-of-the-art facility will cater to the unique demands of the Greek AI factory - Pharos, particularly for AI workflows. Built on HPE’s NVIDIA GH200 architecture, DAEDALUS is designed to deliver approximately 89 petaflops of sustained performance (115 petaflops peak) for traditional HPC, AI, and Big Data workloads, supported by 1 PB of high-performance NVMe and 10 PB of usable storage.

As a Distributed AI Support Engineer, your role will be pivotal in assisting researchers, startups, and industry teams in transforming this cutting-edge infrastructure into tangible AI innovations. You will collaborate with leading European universities, supercomputing centers, and industrial partners within the broader EuroHPC ecosystem. While familiarity with all the technologies mentioned is not necessary, a strong foundation in AI and Python programming, along with a willingness to learn, is essential.

About GRNET S.A.

GRNET S. A. is committed to providing innovative Internet connectivity and advanced services tailored for the Greek Educational, Academic, and Research communities. Our efforts are directed towards minimizing the digital divide and facilitating equal access to the global knowledge society.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.