About the job
Join Databricks as we embark on a transformative journey to revolutionize the data lifecycle, from ingestion through ETL, BI, and into the realms of ML/AI, all within a unified platform. Our vision is to transition from traditional data warehouse architectures to the innovative Lakehouse paradigm, as detailed in the CIDR 2021 paper. This new architecture addresses critical challenges such as data staleness, reliability, total cost of ownership, data lock-in, and limited use-case support.
At Databricks, we are developing the next generation of decoupled query engines and structured storage systems designed to surpass specialized data warehouses in relational query performance. Our goal is to maintain the expressiveness and robustness of general-purpose systems, like Apache Spark™, to accommodate diverse workloads, ranging from ETL to advanced data science applications. You will play an essential role in this multi-year endeavor.
As a valued member of our team, you will be tasked with designing cutting-edge systems that leapfrog current state-of-the-art technologies in the following areas:
- Query compilation and optimization
- Distributed query execution and scheduling
- Vectorized execution engine
- Data security
- Resource management
- Transaction coordination
- Efficient storage structures (encodings, indexes)
- Automatic physical data optimization

