🍿 Spark Movie Recommender
📍 EPFL – Master in Data Science, Year 2 (2026) 🔗 Code Repository: GitHub
This project focuses on implementing scale-out data processing pipelines over Apache Spark to power a video streaming application’s recommendation engine. Using a dataset based on MovieLens, the system pre-computes statistics and serves personalized movie recommendations to end-users.
The architecture is divided into four core processing milestones:
- Data Loading & Ratings Aggregation: Building pipelines to load, parse, and persist large-scale CSV data into Spark RDDs. This includes computing per-movie average ratings, executing keyword-based aggregate queries, and incrementally maintaining these aggregates as new, append-only rating batches arrive.
- Rating Prediction: Implementing two distinct prediction models to estimate how a user would rate unseen movies. The first is a Baseline Predictor that incorporates user bias and normalizes rating deviations. The second utilizes Spark MLlib’s Alternating Least Squares (ALS) algorithm to learn and predict ratings via collaborative filtering.
- LSH-Based Recommendation: Designing a Locality-Sensitive Hashing (LSH) index over movie keywords to perform approximate nearest-neighbor queries efficiently. This approach clusters movies with similar Jaccard similarity signatures, avoiding the O(D²) cost of exhaustive pairwise comparisons.
- HNSW-Based Retrieval: Utilizing Hierarchical Navigable Small World (HNSW) search over dense numeric embeddings. The system extracts learned user and movie embeddings from the ALS model and builds an in-process index using the Hnswlib library. Queries use the target user’s embedding to retrieve candidate movies via inner-product distance, which are then post-filtered and ranked.
đź› Tools & Libraries:
- Apache Spark & Spark MLlib
- Scala / Java (JDK 21)
- Hnswlib
- GitLab & IntelliJ IDEA
đź§ Techniques:
- Resilient Distributed Datasets (RDD) Processing
- Collaborative Filtering (Matrix Factorization / ALS)
- Locality-Sensitive Hashing (LSH)
- Approximate Nearest Neighbor (ANN) Vector Search (HNSW)