About the job
Join Bold as we seek an experienced Engineer/Senior Engineer to take charge of our production Machine Learning (ML) environments and MLOps infrastructure within our infrastructure and data science teams. You will manage daily operations, respond to ad-hoc requests, and collaborate on cross-team projects. Your expertise will help architect robust and scalable MLOps pipelines utilizing AWS services such as SageMaker, EMR, and OpenSearch.
ABOUT THIS TEAM
The Infrastructure team at Bold provides essential services including automation, observability, cloud/server/network architectures, CI/CD, infrastructure as code, database administration, incident management, vendor management, security and compliance, and continuous skill development. These services enhance efficiency, minimize errors, and ensure rapid and reliable application releases while upholding security and compliance. Our TechOps team empowers various teams to monitor applications and infrastructure, build resilient systems, resolve IT service issues, manage vendor relationships, and ensure robust cloud security. Additionally, we prioritize continuous learning and the adoption of cutting-edge technologies to deliver outstanding value to the organization.
WHAT YOU’LL DO
- Design and implement microservices within advanced DevOps pipelines, collaborating with data science and ML teams to optimize model serving on Kubernetes platforms utilizing service mesh.
- Create comprehensive observability frameworks to reduce Mean Time to Recovery (MTTR) and ensure 24/7 platform reliability.
- Develop secure, automated CI/CD workflows featuring governance and compliance controls.
- Lead 24/7 on-call engineering efforts alongside CDN & security hardening, focusing on cost optimization for IaaS & PaaS in the cloud.
- Oversee infrastructure operations, managing production deployments and hybrid cloud environments.
- Foster cross-functional collaboration with development, QA, operations, and data teams across global time zones to scale OpenSearch/EMR platforms and enhance platform operations.

