companyAlembic logo

Senior Automation & Tools Engineer

AlembicSan Francisco HQ
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

Must-Have QualificationsMinimum of 5 years of experience in tools development, Site Reliability Engineering (SRE), DevOps, or platform engineering roles. Proficient programming skills in Infrastructure as Code (IaC) languages such as Ansible, Helm, or Kustomize. Strong programming skills in general-purpose languages such as Python or Go. Extensive experience with containerization technologies, particularly Docker and Kubernetes. Solid understanding of Linux systems and networking fundamentals. Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, ELK, OpenTelemetry). Familiarity with CI/CD tools and pipelines (e.g., GitHub Actions, ArgoCD). Ability to troubleshoot complex systems and automate solutions using scripting languages.

About the job

About the Role

We are seeking a highly skilled Senior Automation and Tools Engineer to enhance our platform's scalability, reliability, observability, and operational excellence. In this pivotal role, you will collaborate with engineers and data scientists to design, automate, and sustain the infrastructure that drives our core platform, which includes data pipelines, machine learning workloads, and real-time analytics systems.

This is a hands-on, impactful position with visibility across the tech stack, providing a unique opportunity to influence the future of our infrastructure and operations.

Key Responsibilities

  • Collaborate closely with Site Reliability Engineers (SREs) and developers to automate and manage datacenter and cloud-based infrastructure, focusing on build and deployment systems, as well as configuration management.
  • Enhance system reliability and performance through automation, observability, and proactive capacity planning.
  • Develop and configure tools for datacenter provisioning, configuration management, and observability using both standard and custom tools.
  • Elevate developer experience by providing self-service tools.
  • Implement and uphold monitoring, alerting, and incident response processes including Service Level Objectives (SLOs), runbooks, and on-call rotations.
  • Foster collaboration across engineering and data science teams to cultivate a culture of performance and reliability.
  • Maintain security, compliance, and operational readiness across our physical and cloud infrastructures.
  • Lead post-incident reviews and continuous improvement initiatives.

About Alembic

Alembic is at the forefront of innovative technology solutions, providing powerful infrastructure capabilities to enhance operational efficiency and deliver exceptional performance. We are committed to fostering a culture of collaboration and continuous improvement, empowering our teams to achieve excellence in everything they do.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.