About the job
About the Team You'll Join
- The ML Engineer (Platform) at Toss Securities is part of the ML Platform Team within the Product Division.
- The mission of the ML Platform Team is to create an optimal machine learning platform that enables efficient and stable development and operation of various AI/ML services at Toss Securities.
Your Responsibilities Upon Joining
Develop and enhance the Gateway system, the gateway for ML services.
- Develop and operate a Gateway system based on FastAPI that handles enterprise-level LLM API requests.
- Design and implement authentication, routing, traffic control, fault isolation (Circuit Breaker, Fallback), large-scale TPS processing, and load balancing strategies from both application and infrastructure perspectives in the FastAPI-based Gateway application.
Manage and serve ML services.
- Directly operate a machine learning model serving system in a Kubernetes environment.
- Design and improve the LLM serving architecture to operate stably under large traffic conditions.
- Monitor latency, error rates, resource usage, and analyze and resolve operational issues for the models in service.
- Identify root causes of failures and implement structural improvements, including operational policies and architecture.
Develop and manage a common ML platform for the company.
- Develop and manage a common platform for efficiently operating the training and serving of internal ML/LLM models based on Kubeflow.
- Continuously monitor and optimize the performance and resources of workloads executed on the platform.
Build infrastructure for LLM-based services.
- Operate LLM services using various serving frameworks such as vLLM, SGLang, and Triton.
- Manage the environment to ensure stable operation of training and serving workloads on high-performance GPU clusters like H100/B300.
- Build and operate a large-scale data training environment for finance domain-specific LLMs.
We Are Looking for Candidates Who:
- Are proficient in one or more programming languages such as Python, Go, Java, or Kotlin, and have experience designing and developing API servers in production environments.
- Have experience developing or operating API Gateways (Nginx, Kong, etc.) or LLM Routers (LiteLLM, Envoy AI Gateway, etc.), with a background in handling high-volume traffic and incident response.
- Have experience operating serving logs and event pipelines integrated with Kafka, Elasticsearch, and Kibana.
- Have experience defining monitoring metrics for model serving and configuring and operating dashboards using Prometheus and Grafana.
- Have experience operating ML/LLM model serving using KServe, BentoML, vLLM, SGLang, etc.
- Have experience directly managing MLOps components (Kubeflow, KServe, Airflow, Argo CD, MLflow, etc.) in Kubernetes environments and debugging and resolving issues.
- Can design and apply long-term improvement plans through root cause analysis beyond short-term responses to issues that arise during service operations.
Additional Preferred Experience:
- Experience in related fields or technologies will be a plus.

