companySciforium logo

Senior HPC & GPU Infrastructure Engineer

SciforiumSan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

The ideal candidate will have a strong background in high-performance computing and GPU technologies, coupled with hands-on experience in Linux system administration. Proficiency in managing machine learning frameworks and tools such as CUDA, PyTorch, and JAX is essential. You should also have excellent problem-solving skills and the ability to work collaboratively in a fast-paced environment. A Bachelor's degree in Computer Science, Engineering, or a related field is preferred.

About the job

At Sciforium, we are at the forefront of AI infrastructure, pioneering advanced multimodal AI models and an innovative, high-efficiency serving platform. With substantial backing from AMD and a dedicated team of engineers, we are rapidly expanding our capabilities to support the next generation of frontier AI models and real-time applications.

About the Role

We are looking for a highly skilled Senior HPC & GPU Infrastructure Engineer who will be responsible for ensuring the health, reliability, and performance of our GPU compute cluster. As the primary custodian of our high-density accelerator environment, you will serve as the crucial link between hardware operations, distributed systems, and machine learning workflows. This position encompasses a range of responsibilities, from hands-on Linux systems engineering and GPU driver setup to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you are passionate about optimizing hardware performance, enjoy troubleshooting GPUs at scale, and aspire to create world-class AI infrastructure, we would love to hear from you.

Your Responsibilities

1. System Health & Reliability (SRE)

  • On-Call Response: Be the primary responder for system outages, GPU failures, node crashes, and other cluster-wide incidents, ensuring rapid issue resolution to minimize downtime.

  • Cluster Monitoring: Develop and maintain monitoring protocols for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and general system load.

  • Vendor Liaison: Collaborate with data center personnel, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster.

2. Linux & Network Administration

  • OS Management: Oversee the installation, patching, and maintenance of Linux distributions (Ubuntu / CentOS / RHEL), ensuring consistent configuration, kernel tuning, and automation for large node fleets.

  • Security & Access Controls: Set up VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computing infrastructure.

  • Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity and administer distributed file systems like NFS, GPFS, or Lustre.

3. GPU & ML Stack Engineering

  • Deployment & Bring-Up: Spearhead the deployment of new GPU nodes, including BIOS configuration and software integration to ensure optimal performance.

About Sciforium

Sciforium is a cutting-edge AI infrastructure company committed to developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. With significant investment backing and direct support from AMD, we are rapidly growing our team to build the comprehensive stack that powers advanced AI models and real-time applications.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.