🍿 Spark Movie Recommender

📍 EPFL – Master in Data Science, Year 2 (2026) 🔗 Code Repository: GitHub

This project focuses on implementing scale-out data processing pipelines over Apache Spark to power a video streaming application’s recommendation engine. Using a dataset based on MovieLens, the system pre-computes statistics and serves personalized movie recommendations to end-users.

The architecture is divided into four core processing milestones:

Data Loading & Ratings Aggregation: Building pipelines to load, parse, and persist large-scale CSV data into Spark RDDs. This includes computing per-movie average ratings, executing keyword-based aggregate queries, and incrementally maintaining these aggregates as new, append-only rating batches arrive.
Rating Prediction: Implementing two distinct prediction models to estimate how a user would rate unseen movies. The first is a Baseline Predictor that incorporates user bias and normalizes rating deviations. The second utilizes Spark MLlib’s Alternating Least Squares (ALS) algorithm to learn and predict ratings via collaborative filtering.
LSH-Based Recommendation: Designing a Locality-Sensitive Hashing (LSH) index over movie keywords to perform approximate nearest-neighbor queries efficiently. This approach clusters movies with similar Jaccard similarity signatures, avoiding the O(D²) cost of exhaustive pairwise comparisons.
HNSW-Based Retrieval: Utilizing Hierarchical Navigable Small World (HNSW) search over dense numeric embeddings. The system extracts learned user and movie embeddings from the ALS model and builds an in-process index using the Hnswlib library. Queries use the target user’s embedding to retrieve candidate movies via inner-product distance, which are then post-filtered and ranked.

🛠 Tools & Libraries:

Apache Spark & Spark MLlib
Scala / Java (JDK 21)
Hnswlib
GitLab & IntelliJ IDEA

🧠 Techniques:

Resilient Distributed Datasets (RDD) Processing
Collaborative Filtering (Matrix Factorization / ALS)
Locality-Sensitive Hashing (LSH)
Approximate Nearest Neighbor (ANN) Vector Search (HNSW)