📊 Data Layouts and Execution Models

📍 EPFL – Master in Data Science, Year 2 (2026) 🔗 Code Repository: GitHub

This project focuses on building the core components of an in-memory database engine by implementing various data layouts and query execution models. The implementation stores all data in memory and relies on a simple memory allocation strategy for fixed-size records, bypassing the need for a buffer manager.

The system architecture is divided into two main components: data storage and query execution.

1. Data Layouts

The storage engine supports three distinct relational table layouts:

Row-oriented format (NSM): Stores data continuously by row using the RowStorage class.
Columnar format (DSM): Organizes data by attributes through the ColumnStore class.
Partition Attributes Across (PAX): A hybrid approach implemented via PAXStore that groups data into pages while maintaining a columnar layout within each page.

2. Execution Models

The query engine processes seven core relational operators: Scan, Select, Project, Sort, Limit, Aggregate, and HashInner Join. These operators are executed across three different processing models:

Tuple-at-a-time (Volcano Model): Processes a single tuple per next() call. This engine operates over the RowAccessibleStore interface, making it compatible with both row and PAX data layouts.
Column-at-a-time: Executes operators on entire columns simultaneously, returning the full result as a sequence of columns. This engine utilizes the ColumnAccessibleStore interface.
Vector-at-a-time: Processes data in chunks of PAX pages (a vector of tuples) per next() call. This model uses the PAXStore as its native data source.

🛠 Tools & Libraries:

Scala / Java (JDK 21)
GitLab & Git
IntelliJ IDEA

🧠 Techniques:

In-Memory Database Architecture
Data Storage Layouts (NSM, DSM, PAX)
Query Execution Pipelines
Relational Algebra Implementation