milanghimire.info.np

Credit Risk Scoring System using XGBoost

A detailed engineering project showcasing system design, data processing, and production-ready implementation.

System Architecture

Overview

This project builds a content-based movie recommendation system using the TMDB dataset. It recommends movies based on similarity in metadata such as genres, overview, tagline, and production details.

Instead of using collaborative filtering or deep learning, the system relies on classical NLP techniques and vector space modeling.


Problem Statement

Streaming platforms need recommendation systems to improve content discovery. However, user-based filtering requires large behavioral datasets.

This project solves the cold-start problem by using content-based similarity, where recommendations are generated purely from movie metadata.


Dataset Processing

The dataset (movies_metadata.csv) was cleaned and filtered to retain relevant features.

Selected Columns

title, genres, overview, tagline, vote_average, popularity, id, original_language, production_companies, release_date


Data Cleaning

Steps Performed

  • Removed duplicate records
  • Dropped rows with missing titles
  • Filled missing values in:
    • overview → empty string
    • tagline → empty string
    • original_language → empty string
    • release_date → empty string
df = df.drop_duplicates().reset_index(drop=True)
df = df.dropna(subset=['title'])

df['overview'] = df['overview'].fillna('')
df['tagline'] = df['tagline'].fillna('')

Feature Engineering

Genre & Production Parsing

The dataset stores structured JSON-like strings, converted using ast.literal_eval.

import ast

df['genres'] = df['genres'].apply(
    lambda x: " ".join([i['name'] for i in ast.literal_eval(x)])
)

df['production_companies'] = df['production_companies'].apply(
    lambda x: " ".join([i['name'] for i in ast.literal_eval(x)])
)

Feature Combination

All important text features were merged into a single field:

df['combined_features'] = df['genres'] + ' ' + df['overview'] + ' ' + df['tagline']
This ensures that all semantic signals are captured in one representation.

NLP Preprocessing

A custom text preprocessing pipeline was applied:

Steps:

  1. Lowercasing
  2. Removing punctuation (regex)
  3. Stopword removal
  4. Stemming using Porter Stemmer
def preprocess_text(text):
    text = str(text).lower()
    text = re.sub(r'[^\w\s]', ' ', text)
    words = text.split()

    words = [word for word in words if word not in stop_words]
    words = [ps.stem(word) for word in words]

    return ' '.join(words)

Vectorization (TF-IDF)

Text data is converted into numerical vectors using TF-IDF.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=50000,
    ngram_range=(1,2),
    stop_words='english'
)

tfidf_matrix = tfidf.fit_transform(df['combined_features'])

Why TF-IDF?

Captures importance of words Reduces influence of common words Works well for similarity-based systems Similarity Calculation

Cosine similarity is used to measure distance between movie vectors.

  • Each movie is represented as a vector
  • Similarity is computed by angle between vectors
  • Recommendation Engine
  • Index Mapping
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

Recommendation Function

from sklearn.metrics.pairwise import cosine_similarity

def recommend(movie, n=10):
    movie = movie.title()

    if movie not in indices:
        return ["Movie not found"]

    idx = indices[movie]

    sim_score = cosine_similarity(tfidf_matrix[idx], tfidf_matrix).flatten()

    similar_idx = sim_score.argsort()[::-1][1:n+1]

    return df['title'].iloc[similar_idx]

Model Persistence (Deployment Ready)

The trained components are saved using pickle:

import pickle

pickle.dump(tfidf_matrix, open('tfidf_matrix.pkl', 'wb'))
pickle.dump(indices, open('indices.pkl', 'wb'))
pickle.dump(tfidf, open('tfidf.pkl', 'wb'))

df.to_pickle('df.pkl')

Saved Artifacts

  • TF-IDF Vectorizer (vocabulary)
  • TF-IDF Matrix (learned representation)
  • Index mapping (title → row index)
  • Dataset (metadata storage) Example Output

Input:

Toy Story

Output:

  • Similar Movie 1
  • Similar Movie 2
  • Similar Movie 3
  • ....

Key Insight

This system demonstrates that strong recommendations can be built without deep learning or user behavior data purely using statistical NLP and vector similarity.

Deployment

The recommendation logic was exposed using a FastAPI backend, which serves movie suggestions through REST APIs. The frontend was built using Next.js and fetches real-time recommendations from the backend. The backend is deployed on Render, while the frontend is hosted on Vercel for production use. The final system is fully functional, allowing users to search any movie and receive instant recommendations based on content similarity.

Live Demo

https://movie-recomendation-using-cosine-si.vercel.app/

You can search any movie and instantly get recommendations powered by TF-IDF + cosine similarity. The free tire is used so you may feel the performance issue.