About the job
About Sieve
Sieve stands as a pioneering AI research lab dedicated solely to video data. Our innovative approach integrates exabyte-scale video infrastructure with state-of-the-art video understanding techniques and a myriad of data sources, creating unparalleled datasets that redefine video modeling. With video accounting for 80% of global internet traffic, it has become the vital digital medium fueling creativity, communication, gaming, AR/VR, and robotics. At Sieve, we aim to eliminate the most significant bottleneck hindering the expansion of these applications: access to high-quality training data.
With strategic partnerships with leading AI labs, our team of just 12 has achieved remarkable financial success, generating $XXM last quarter alone. Earlier this year, we secured Series A funding from elite firms including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.
About the Role
As we process petabytes of video across numerous nodes and cloud environments, ensuring reliability, observability, and security is essential to our growth.
We are seeking our inaugural Reliability Engineer, who will focus entirely on fortifying the infrastructure that underpins Sieve. This role demands high ownership and a deep understanding of:
System throughput and stability
Monitoring and incident management
Security principles, including least-privilege design
Minimizing operational burdens for the entire engineering team
You will collaborate closely with our CTO and founding engineers to develop the foundational tools that empower our engineering efforts.
This position is ideal for an engineer who is passionate about reliability, throughput, observability, and security. You are proactive in anticipating potential failure modes, reducing operational risks, and designing resilient systems.
If a system failure occurs, you take it personally, thriving under the weight of responsibility.
What You'll Be Doing
Collaborate with engineering to design and validate infrastructure supporting PB-scale workloads
Develop and manage Terraform-based multi-cloud deployments
Enhance cloud and data security (SSO, IAM, least privilege access, auditability)
Lead incident response efforts and strengthen systems against failures
Create CI/CD systems to minimize user errors and maximize safety
Establish monitoring and alerting frameworks (Prometheus, OpenTelemetry, VictoriaMetrics)

