About the job
Bold is on the lookout for an experienced DevOps Engineer to take ownership of our production machine learning environments and MLOps infrastructure, collaborating closely with our infrastructure and data science teams. You will manage daily support tasks, respond to ad-hoc requests, and spearhead cross-team projects. Your role will involve architecting scalable MLOps pipelines utilizing AWS services such as SageMaker, EMR, and OpenSearch.
ABOUT THIS TEAM
The Infrastructure team delivers a variety of services, including automation, observability, cloud/server/network architecture, CI/CD, infrastructure as code, database administration, incident management, vendor management, security and compliance, and continuous skill development. Our services enhance efficiency, minimize errors, and ensure rapid, reliable application releases while upholding security and compliance standards. The TechOps team assists in monitoring applications and infrastructure, developing resilient environments, troubleshooting IT service challenges, managing vendor relationships, and safeguarding cloud security and compliance. We emphasize continuous learning and the integration of new technologies to maximize our value to the organization.
WHAT YOU’LL DO
- Design and productionize microservices using advanced DevOps pipelines, collaborating with data science/ML teams to optimize model serving on Kubernetes platforms with service mesh.
- Create comprehensive observability frameworks to reduce mean time to recovery (MTTR) and ensure 24/7 platform reliability.
- Develop secure, automated CI/CD workflows while incorporating governance and compliance controls.
- Lead on-call engineering efforts around the clock, focusing on CDN and security hardening, alongside cost optimization strategies in IaaS and PaaS environments.
- Oversee infrastructure operations, manage production deployments, and operate hybrid cloud environments.
- Facilitate cross-functional collaboration with development, QA, operations, and data teams across global time zones to enhance OpenSearch/EMR platform scaling and operations.

