Tutorial - MovieLens

This tutorial provides and end-to-end walkthrough to get familiar with the Crossing Minds API, using the open MovieLens dataset.

Setup

You can start by downloading the MovieLens dataset from the GroupLens Research Project at the University of Minnesota: ml-25m.zip

If the quality of the recommendations is not important at this point, prefer the smaller dataset: ml-small.zip

This tutorial is using our python client. Be sure you have the python package installed with:

pip install xminds

Create a client and login to your account, and create a new database using the POST databases/ and POST login/root/ endpoints:

from xminds.api.client import CrossingMindsApiClient

ROOT_EMAIL = 'your@email.com'
ROOT_PASSWORD = 'yourp@ssw0rd'

ITEM_ID_TYPE = 'uint32'
USER_ID_TYPE = 'uint32'

api = CrossingMindsApiClient()
api.login_root(ROOT_EMAIL, ROOT_PASSWORD)
db = api.create_database(
    name='My MovieLens DB',
    description='Crossing Minds API tutorial using MovieLens dataset',
    item_id_type=ITEM_ID_TYPE,
    user_id_type=USER_ID_TYPE)
api.login_individual(ROOT_EMAIL, ROOT_PASSWORD, db['id'])

Then we can create the item properties using the POST items-properties/ endpoint: the year, genres and tags.

api.create_item_property('year', 'int16')
api.create_item_property('genres', 'unicode18', repeated=True)
api.create_item_property('tags', 'unicode20', repeated=True)

We use unicode18 (U18 in numpy) for the property genres as all MovieLens genres are at most 18 characters. Similarly, we use unicode20 (U20 in numpy) for the property tags, as after cleaning of tags strings, we keep only the strings with at most 20 characters. See properties to find all supported types.

Parsing MovieLens Dataset

Before we can upload all the items with their tag information, we need to load, extract and clean the data. In the MovieLens dataset, some values like the year or genres need to be parsed from strings. This requires some cleanup steps we will perform using pandas (≥1.0.0).

Because we are using the python client, we can use the array-optimized format to represent tags and genres. This avoids building millions of python dict, and gives a significant speed-up from vectorized operations.

Parsing the Movies File

We start by loading the item file movies.csv, and extract year and genres data.

import numpy
import pandas

DATA_PATH = 'path/to/data/'

# Load items csv
items_df = pandas.read_csv(f'{DATA_PATH}movies.csv')
n_items = len(items_df)

# Prepare an empty numpy array with id and year (the only non-repeated properties)
items = numpy.empty(n_items, [('item_id', ITEM_ID_TYPE), ('year', 'int16')])
items['item_id'] = items_df['movieId']

# Extract year in title, formatted as 'title (year)', replace missing data by 0
items_year_str = items_df['title'].str.extract(r'\((\d+)\)$', expand=False)
items['year'] = items_year_str.to_numpy(na_value=0)

# Extract genres, formatted as 'Genre1|Genre2|Genre3'
items_genres_df = items_df['genres'].str.split('|').explode().reset_index()
items_genres = numpy.empty(len(items_genres_df),
                          [('item_index', 'uint16'), ('value_id', 'U18')])
items_genres['item_index'] = items_genres_df['index']
items_genres['value_id'] = items_genres_df['genres']

Parsing the Tags File

We then load the tags file tags.csv, apply simple cleaning on strings, and convert the movieId into index in the items array.

# Load tags (discarding userId and timestamp columns)
tags_df = pandas.read_csv(f'{DATA_PATH}tags.csv')
tags_df = tags_df[['movieId', 'tag']].drop_duplicates()

# Do some simple cleaning on tags data to merge different spellings
# remove all non-word characters (e.g. dash, space, parenthesis)
tags_df['tag'] = tags_df['tag'].str.replace(r'[^a-zA-Z0-9]', '')
# convert to lower case
tags_df['tag'] = tags_df['tag'].str.lower()
# remove all tag with more than 20 chars (probably noise)
tags_df = tags_df[tags_df['tag'].str.len() <= 20]

# Build mapping from item-id to item-index (or -1 if the movieId does not exist)
# the largest movieId is small (~210k), we don't need an hashmap and use an array
items_id2idx = numpy.full(items['item_id'].max() + 1, -1, 'int32')
items_id2idx[items['item_id']] = numpy.arange(n_items)

# Build items_tags containing tags data in array-optimized format
items_tags = numpy.empty(len(tags_df),
                         dtype=[('item_index', 'uint16'), ('value_id', 'U20')])
items_tags['item_index'] = items_id2idx[tags_df['movieId']]
items_tags['value_id'] = tags_df['tag']
assert (items_tags['item_index'] != -1).all(), 'unknown movieId in tags.csv'

Parsing the Ratings File

We then load the user/item interactions file ratings.csv.

# Load ratings
ratings_df = pandas.read_csv(f'{DATA_PATH}ratings.csv')
ratings_df = ratings_df.rename(columns={'userId': 'user_id', 'movieId': 'item_id'})

# Convert to array
ratings = ratings_df.to_records(index=False, column_dtypes={
    'user_id': USER_ID_TYPE,
    'item_id': ITEM_ID_TYPE,
    'rating': 'float32',
    'timestamp': 'float64',
})
ratings = ratings.view(ratings.dtype.fields, numpy.ndarray)

# Shift rating values from [0.5, 5] in to [1, 10]
ratings['rating'] = 2 * ratings['rating']
assert ((1 <= ratings['rating']) & (ratings['rating'] <= 10)).all()

Uploading the Items and the Ratings

Sending the Data

The next step is to upload all the items with their tag information. For this we use the array-optimized bulk endpoints PUT items-bulk/ and PUT ratings-bulk/. Depending on your internet connection and on which dataset you loaded, this may take a few minutes.

items_m2m = {
  'genres': items_genres,
  'tags': items_tags,
}
api.create_or_update_items_bulk(items, items_m2m)
api.create_or_update_ratings_bulk(ratings)

Models Training and Optimization

Before testing the recommendations, we need to wait for the API to train and optimize the models. Our UI dashboard is still under construction. Instead we can call the following python method to poll the status of the database until it is ready.

Depending on the size of the dataset, you may need to wait more than 10 minutes, it is a good time to take a coffee break while the API is crunching numbers for you.

api.wait_until_ready(timeout=10*60, sleep=5)

Getting Recommendations

As soon as the models are ready, we can start requesting movies recommendations. We cover three kinds of recommendations: item-item, user-item, and session-item.

Item-Item Recommendations

The first kind of recommendations we try are item-item recommendations, that is recommending items similar to a given item. For this we call the endpoint GET recommendation/items/<str:item_id>/items/.

item_id = 50  # The Usual Suspects (1995)
response = api.get_reco_item_to_items(item_id, amt=8)
print('\n'.join(map(str, response['items_id'])))

"""
4878
7438
2329
293
47
1089
32587
6502
"""

The API only returns items IDs, which are integers in our specific case. You can get details about the respective movies by navigating to the MovieLens website, such as https://movielens.org/movies/6365.

For this tutorial, since we already have the data about all movies, we can use pandas to find the respective title given the ID:

indexed_items_df = items_df.set_index('movieId')

def print_items(items_id, values=None):
    items_title = indexed_items_df.loc[items_id]['title']
    if values is None:
      print('\n'.join(
          f'{id:10d} {name}'
          for id, name in zip(items_id, items_title)))
    else:
      print('\n'.join(
          f'{id:10d} {value:4.1f} {name}'
          for id, value, name in zip(items_id, values, items_title)))

print_items(response['items_id'])
"""
  4878 Donnie Darko (2001)
  7438 Kill Bill: Vol. 2 (2004)
  2329 American History X (1998)
   293 Léon: The Professional (a.k.a. The Professional) (Léon) (1994)
    47 Seven (a.k.a. Se7en) (1995)
  1089 Reservoir Dogs (1992)
 32587 Sin City (2005)
  6502 28 Days Later (2002)
"""

Profile-based Recommendations

The second kind of recommendations to get familiar with are user-item recommendations, that is recommending items to a user for which we already uploaded ratings in the API. For this we call the endpoint GET recommendation/users/<str:user_id>/items/.

This is the ideal endpoint to call for users in your database, because the API has already computed the models given the ratings of these users. The endpoint is therefore very fast, and can be called in real time at production scale.

# pick a random user with less than 30 ratings (so we can visualize their ratings)
n_rtgs_by_user = numpy.bincount(ratings['user_id'])
user_id = numpy.random.choice(numpy.where(n_rtgs_by_user < 30)[0])

# print the user ratings
user_rtgs = numpy.sort(ratings[ratings['user_id'] == user_id], order='rating')[::-1]
print_items(user_rtgs['item_id'], user_rtgs['rating'])
"""
  1831 10.0 Lost in Space (1998)
  1196 10.0 Star Wars: Episode V - The Empire Strikes Back (1980)
  1210  9.0 Star Wars: Episode VI - Return of the Jedi (1983)
 60684  8.0 Watchmen (2009)
  3439  8.0 Teenage Mutant Ninja Turtles II: The Secret of the Ooze (1991)
  1270  8.0 Back to the Future (1985)
  1261  8.0 Evil Dead II (Dead by Dawn) (1987)
   780  8.0 Independence Day (a.k.a. ID4) (1996)
   480  8.0 Jurassic Park (1993)
   260  8.0 Star Wars: Episode IV - A New Hope (1977)
 27660  7.0 Animatrix, The (2003)
  4571  7.0 Bill & Ted's Excellent Adventure (1989)
  2953  7.0 Home Alone 2: Lost in New York (1992)
  2231  7.0 Rounders (1998)
  1373  7.0 Star Trek V: The Final Frontier (1989)
    47  7.0 Seven (a.k.a. Se7en) (1995)
  3740  6.0 Big Trouble in Little China (1986)
  2004  6.0 Gremlins 2: The New Batch (1990)
  2841  5.0 Stir of Echoes (1999)
   986  4.0 Fly Away Home (1996)
  8376  1.0 Napoleon Dynamite (2004)
  1924  1.0 Plan 9 from Outer Space (1959)
"""

# get recommendations
response = api.get_reco_user_to_items(user_id, amt=8)
print_items(response['items_id'])
"""
  260 Star Wars: Episode IV - A New Hope (1977)
 2571 Matrix, The (1999)
 1196 Star Wars: Episode V - The Empire Strikes Back (1980)
 1198 Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
 1210 Star Wars: Episode VI - Return of the Jedi (1983)
  589 Terminator 2: Judgment Day (1991)
 1270 Back to the Future (1985)
 2028 Saving Private Ryan (1998)
"""

Session-based Recommendations

The last kind of recommendations are session-item recommendations, that is recommending items to an unknown or anonymous user, given the ratings. For this we call the endpoint POST recommendation/sessions/items/. Note that the HTTP verb is POST since we need to use the request body to provide the ratings. Indeed, using GET and query parameters would not be a good solution to encode dozens or hundreds of ratings. Even if the HTTP verb is POST no state is actually stored in the API, no user nor ratings will be created using this endpoint.

This endpoint is ideal for cold-start users, where you need to get recommendation to an anonymous user before they even signed-up to your service.

session_ratings = numpy.asarray([
    (1196, 10.),
    (1210, 10.),
    ( 780,  8.),
    (1373,  7.),
    (1924,  1.),
], [('item_id', ITEM_ID_TYPE), ('rating', 'float32')])
response = api.get_reco_session_to_items(session_ratings, amt=8)
print_items(response['items_id'])
"""
1291 Indiana Jones and the Last Crusade (1989)
1527 Fifth Element, The (1997)
1610 Hunt for Red October, The (1990)
1580 Men in Black (a.k.a. MIB) (1997)
2571 Matrix, The (1999)
1214 Alien (1979)
2115 Indiana Jones and the Temple of Doom (1984)
 637 Sgt. Bilko (1996)
"""

Updating the Models with new Data

Warning

WORK IN PROGRESS

this section will be updated soon

# update user_id, and set the minimal rating for the item "Saving Private Ryan"
api.create_or_update_rating(user_id, 2028, 1.0)

# get recommendations again
# see how "Saving Private Ryan" has been replaced by "Alien"
response = api.get_reco_user_to_items(user_id, amt=8)
print_items(response['items_id'])
"""
  260 Star Wars: Episode IV - A New Hope (1977)
 2571 Matrix, The (1999)
  589 Terminator 2: Judgment Day (1991)
 1196 Star Wars: Episode V - The Empire Strikes Back (1980)
 1210 Star Wars: Episode VI - Return of the Jedi (1983)
 1214 Alien (1979)
 1198 Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
 1270 Back to the Future (1985)
"""