📊 Data Layouts and Execution Models
📍 EPFL – Master in Data Science, Year 2 (2026) 🔗 Code Repository: GitHub
This project focuses on building the core components of an in-memory database engine by implementing various data layouts and query execution models. The implementation stores all data in memory and relies on a simple memory allocation strategy for fixed-size records, bypassing the need for a buffer manager.
The system architecture is divided into two main components: data storage and query execution.
1. Data Layouts
The storage engine supports three distinct relational table layouts:
- Row-oriented format (NSM): Stores data continuously by row using the
RowStorageclass. - Columnar format (DSM): Organizes data by attributes through the
ColumnStoreclass. - Partition Attributes Across (PAX): A hybrid approach implemented via
PAXStorethat groups data into pages while maintaining a columnar layout within each page.
2. Execution Models
The query engine processes seven core relational operators: Scan, Select, Project, Sort, Limit, Aggregate, and HashInner Join. These operators are executed across three different processing models:
- Tuple-at-a-time (Volcano Model): Processes a single tuple per
next()call. This engine operates over theRowAccessibleStoreinterface, making it compatible with both row and PAX data layouts. - Column-at-a-time: Executes operators on entire columns simultaneously, returning the full result as a sequence of columns. This engine utilizes the
ColumnAccessibleStoreinterface. - Vector-at-a-time: Processes data in chunks of PAX pages (a vector of tuples) per
next()call. This model uses thePAXStoreas its native data source.
🛠 Tools & Libraries:
- Scala / Java (JDK 21)
- GitLab & Git
- IntelliJ IDEA
🧠 Techniques:
- In-Memory Database Architecture
- Data Storage Layouts (NSM, DSM, PAX)
- Query Execution Pipelines
- Relational Algebra Implementation