companyToss Securities logo

Machine Learning Engineer - Infrastructure

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Experience with building and operating Kubernetes-based ML infrastructure for large-scale traffic is necessary. A sense of responsibility for the stable operation of live services, experience analyzing and debugging root causes of issues, and a strong understanding of system resource operations are essential. Experience in strengthening systems through problem-solving during service operations is highly valued.

About the job

Join Our Innovative Team

  • The Machine Learning Engineer (Infra) will be part of the ML Platform Team within the Product Division at Toss Securities.
  • The primary goal of the ML Platform Team is to create an optimal machine learning platform that enables the efficient and stable development and operation of various AI/ML services at Toss Securities.
  • The ML Engineer (Infra) will focus on maximizing the efficiency of large-scale AI infrastructure, finely controlling resource usage, and enhancing infrastructure performance to its peak.

 

Your Responsibilities

  • Design and operate high-performance AI computing environments reliably.
    • Design and operate top-of-the-line GPU clusters (H100, B300 series) connected via InfiniBand and high-performance storage (400Gbps) within a Kubernetes environment.
    • Beyond merely building infrastructure, optimize networks and storage to extract the full potential of hardware performance.
  • Develop a comprehensive control system for the entire AI infrastructure.
    • Create an observability system to integrate and monitor AI resources distributed across internal infrastructure and external cloud.
    • Implement management features to prevent resource monopolization by specific services and allocate resources precisely based on importance.
  • Create automation tools for the most efficient resource usage.
    • Analyze actual usage patterns to develop tools that recommend 'just-right resources' to avoid waste.
    • Implement features that automatically scale up or down based on real-time model performance or error rates, and reallocate GPUs where necessary.
  • Establish an environment for identifying and resolving model performance bottlenecks.
    • Build profiling environments to accurately pinpoint slowdowns during model training or serving.
    • Support the analysis and improvement of performance degradation causes between hardware and software.

 

Who We Are Looking For

  • You have experience building and operating Kubernetes-based ML infrastructures that handle large-scale traffic.
  • You take responsibility for reliably operating live services beyond simple development.
  • You have experience persistently analyzing and debugging to resolve root causes when issues arise.
  • You possess a solid understanding of system resources (GPU/CPU/Memory/Network/Storage) and have experience building monitoring systems for them.
  • You value the process of solving various problems that arise during service operations and strengthening the system.

 

Preferred Qualifications

  • Experience in unified monitoring of resource usage in large-scale clusters.
  • Experience building systems to systematically control resources through Quota and Rate Limits.
  • Experience with open-source platforms like Kubeflow or Kubernetes, including in-depth modifications as needed.
  • Experience analyzing and optimizing bottlenecks at the kernel level using tools like Nsight Systems/Compute or PyTorch Profiler.
  • Experience designing tasks to reduce costs or enhance performance tailored to workload characteristics (Rightsizing, Cost Optimization).
  • Experience leveraging GPU virtualization technologies like MIG and MPS to maximize resource utilization.

About Toss Securities

Toss Securities is a leading company in the financial technology sector, dedicated to leveraging artificial intelligence and machine learning to enhance our services. We are committed to innovation and excellence in providing top-tier financial solutions.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.