About the job
Join Our Innovative Team
- The Machine Learning Engineer (Infra) will be part of the ML Platform Team within the Product Division at Toss Securities.
- The primary goal of the ML Platform Team is to create an optimal machine learning platform that enables the efficient and stable development and operation of various AI/ML services at Toss Securities.
- The ML Engineer (Infra) will focus on maximizing the efficiency of large-scale AI infrastructure, finely controlling resource usage, and enhancing infrastructure performance to its peak.
Your Responsibilities
- Design and operate high-performance AI computing environments reliably.
- Design and operate top-of-the-line GPU clusters (H100, B300 series) connected via InfiniBand and high-performance storage (400Gbps) within a Kubernetes environment.
- Beyond merely building infrastructure, optimize networks and storage to extract the full potential of hardware performance.
- Develop a comprehensive control system for the entire AI infrastructure.
- Create an observability system to integrate and monitor AI resources distributed across internal infrastructure and external cloud.
- Implement management features to prevent resource monopolization by specific services and allocate resources precisely based on importance.
- Create automation tools for the most efficient resource usage.
- Analyze actual usage patterns to develop tools that recommend 'just-right resources' to avoid waste.
- Implement features that automatically scale up or down based on real-time model performance or error rates, and reallocate GPUs where necessary.
- Establish an environment for identifying and resolving model performance bottlenecks.
- Build profiling environments to accurately pinpoint slowdowns during model training or serving.
- Support the analysis and improvement of performance degradation causes between hardware and software.
Who We Are Looking For
- You have experience building and operating Kubernetes-based ML infrastructures that handle large-scale traffic.
- You take responsibility for reliably operating live services beyond simple development.
- You have experience persistently analyzing and debugging to resolve root causes when issues arise.
- You possess a solid understanding of system resources (GPU/CPU/Memory/Network/Storage) and have experience building monitoring systems for them.
- You value the process of solving various problems that arise during service operations and strengthening the system.
Preferred Qualifications
- Experience in unified monitoring of resource usage in large-scale clusters.
- Experience building systems to systematically control resources through Quota and Rate Limits.
- Experience with open-source platforms like Kubeflow or Kubernetes, including in-depth modifications as needed.
- Experience analyzing and optimizing bottlenecks at the kernel level using tools like Nsight Systems/Compute or PyTorch Profiler.
- Experience designing tasks to reduce costs or enhance performance tailored to workload characteristics (Rightsizing, Cost Optimization).
- Experience leveraging GPU virtualization technologies like MIG and MPS to maximize resource utilization.

