About the job
LILA Sciences is seeking a Staff DevOps Engineer to lead the design, deployment, and ongoing improvement of infrastructure and delivery systems. This role blends platform engineering, site reliability, and DevOps practices to build scalable, automated systems that support reliable software delivery in cloud and Kubernetes environments. Collaboration is central: work closely with software developers, laboratory scientists, and machine learning engineers to create the backbone for automated scientific analysis and experiment orchestration.
Key Responsibilities
- Design and implement Kubernetes-based infrastructure to support scientific services, machine learning pipelines, and platform workloads. Ensure production hardening, RBAC, network policies, and compliance with Pod Security Standards.
- Set up and maintain CI/CD pipelines using GitHub Actions or GitLab CI. Apply best practices including build attestations, SBOM generation, dependency scanning, and container image hardening.
- Apply Infrastructure-as-Code principles with Terraform and Helm. Implement policy-as-code guardrails (using OPA, Kyverno, or Checkov) and enable drift detection.
- Manage AWS cloud infrastructure, including EKS clusters, IAM with least privilege, VPC/PrivateLink networking, KMS/Secrets Manager, ECR, S3, and centralized logging and monitoring systems.
- Develop platform tools to improve deployment, observability, and developer workflows, supporting self-service with secure defaults.
- Drive reliability engineering practices, focusing on SLOs/SLIs, incident response, capacity planning, and performance optimization across the technology stack.
- Implement software supply chain security, including artifact signing, registry governance, and vulnerability management.
- Build and maintain QA and testing infrastructure: static analysis, code quality gates in CI pipelines, automated end-to-end and browser-based regression testing, ephemeral environments for pull request validation, and pre-merge quality checks.
- Automate operational processes and develop tooling in Python or Go to streamline infrastructure operations and integrate telemetry with observability platforms.
Requirements
- Significant experience in DevOps, Site Reliability Engineering, or Platform Engineering within large-scale cloud environments.
- Proficiency with cloud deployments (AWS, GCP, etc.) using Infrastructure-as-Code tools such as Terraform and Helm, and expertise in containerization.
- Strong background in CI/CD systems (GitHub Actions, GitLab CI, or Jenkins) and familiarity with GitOps workflows.
- Advanced scripting skills in Python or similar languages for automation and tool creation.
- Comprehensive understanding of Kubernetes operations, including deployments, networking, storage, observability, and troubleshooting.
This position is based in Cambridge, MA, USA.

