Curiousily

Build a Recommender System using Keras and TensorFlow 2 in Python

20.01.2020 — Deep Learning, Keras, Recommender Systems, Python — 2 min read

TL;DR Learn how to create new examples for your dataset using image augmentation techniques. Load a scanned document image and apply various augmentations. Create an augmented dataset for Object Detection.

Data

The data we’re going to use is a collection of Amazon reviews. The dataset is hosted on the Stanford Network Analysis Project. We’re going to focus on Movie and TV reviews subset.

We need to convert the data into a user-item matrix. Let’s start by loading the data:

1df = pd.read_json('reviews_Movies_and_TV_5.json.gz', lines=True)

Pandas is nice enough to allow us creating a DataFrame without actually extracting the archive. We also instruct it to read each line as an individual record by setting lines=True.

We’re interested in a subset of the available columns:

1df = df[['reviewerID', 'asin', 'unixReviewTime', 'overall']]

The overall column is simply the rating (between 1 and 5). What about the unixReviewTime?

We’ll rank the items that the users reviewed based on it. So we’ll preserve the temporal order. But let’s change those column names:

1df.columns = ['user', 'item', 'timestamp', 'rating']

Let’s rank the reviews for each user and add a new column with the rank itself:

1df['rank'] = df.groupby("user")["timestamp"].rank(ascending=True, method='dense')

We no longer need the timestamp column:

1df.drop("timestamp", axis=1, inplace=True)

Next, we’ll assign integer ids to each item and user:

1user_mappings = {k:v for v, k in enumerate(df.user.unique())}
2item_mappings = {k:v for v, k in enumerate(df.item.unique())}
3df['user'] = df['user'].map(user_mappings)
4df['item'] = df['item'].map(item_mappings)

We can now convert all values to integers. We’ll also get the user and item counts:

1df = df[['user','item','rank','rating']].astype(np.int64)
2n_users = df.user.nunique()
3n_items = df.item.nunique()

Finally, we’ll create a train and test datasets. Let’s start with sorting by rank:

1dfc = df.copy()
2dfc.sort_values(['user','rank'], ascending=[True,True], inplace=True)
3dfc.reset_index(inplace=True, drop=True)

Splitting our data is a bit tricky:

1test = dfc.groupby('user').tail(1)
2train = pd.merge(dfc, test, on=['user','item'],
3    how='outer', suffixes=('', '_y'))
4train = train[train.rating_y.isnull()]
5test = test[['user','item','rating']]
6train = train[['user','item','rating']]

We use the last user rating for testing. To get the rest for training, we add the test dataset to the dataframe and remove the rating we’ve already used.

Testing Method

We’ll interleave the one rating we know about with 99 random items that the user hasn’t rated. We’ll rank the results using our model and evaluate how high the rated item is placed. We’re going to use two metrics: Hit Ratio and Normalised Discounted Cumulative Gain.

Hit Ratio - the number of hits in an list with n ranked items. In our case, the hit is item rated by the user.

Normalized Discounted Cumulative Gain - is a measure of how high relevant (useful) results are ranked in a list. The Wikipedia entry has a good in-depth explanation and example.

We’ll start by building a lists of rated and non-rated items:

1all_items = dfc.item.unique()
2rated_items = (dfc.groupby("user")['item']
3    .apply(list)
4    .reset_index()
5    ).item.tolist()
6
7def sample_not_rated(item_list, n=99):
8  return np.random.choice(np.setdiff1d(all_items, item_list), n)
9
10non_rated_items = Parallel(n_jobs=4)(
11  delayed(sample_not_rated)(ri) for ri in rated_items
12)

Next, we’ll pick the negative (non-rated items) examples:

1negative = pd.DataFrame({'negative':non_rated_items})
2negative[['item_n'+str(i) for i in range(99)]] =\
3    pd.DataFrame(negative.negative.values.tolist(), index= negative.index)
4negative.drop('negative', axis=1, inplace=True)
5negative = negative.stack().reset_index()
6negative = negative.iloc[:, [0,2]]
7negative.columns = ['user','item']
8negative['rating'] = 0
9
10test_negative = (pd.concat([test,negative])
11    .sort_values('user', ascending=True)
12    .reset_index(drop=True)
13    )

And ensure every 1st element is the actual rating:

1test_negative.sort_values(
2  ['user', 'rating'],
3  ascending=[True,False],
4  inplace=True
5)

Neural Collaborative Filtering

Neural Collaborative Filtering (NCF) (introduced in this paper) is a general framework for building Recommender Systems using (Deep) Neural Networks.

One of the main contributions is the idea that one can replace the matrix factorization with a Neural Network. That way, you can learn an arbitrary function that explains the interaction between users and items.

NCF seems to provide a great performance improvement over conventional approaches. Another point by the authors is that the model performance seems to increase as the Neural Nets become deeper.

The authors present one realization of the NCF framework: the “NeuMF” (Neural Matrix Factorization) model. It has two parts (basically two Neural Nets, combined) - Generalized Matrix Factorization (GMF) and a good old Multi-Layer Perceptron (MLP).

The GMF is an element-wise product of user and item embeddings. The MLP model is using the same embeddings as input. Of course, it can be arbitrary deep.

A final layer is using the concatenation of both models and outputs the predictions.

Model

Conclusion

Run the complete notebook in your browser

The complete project on GitHub

References

Want to be a Machine Learning expert?

Join the weekly newsletter on Data Science, Deep Learning and Machine Learning in your inbox, curated by me! Chosen by 10,000+ Machine Learning practitioners. (There might be some exclusive content, too!)

You'll never get spam from me

Hacker's Guide to Neural Networks in JavaScript

Build Machine Learning models (especially Deep Neural Networks) that you can easily integrate with existing or new web apps. Think of your ReactJs, Vue, or Angular app enhanced with the power of Machine Learning models.

Get SH*T Done with PyTorch

Learn how to solve real-world problems with Deep Learning models (NLP, Computer Vision, and Time Series). Go from prototyping to deployment with PyTorch and Python!

Hacker's Guide to Machine Learning with Python

This book brings the fundamentals of Machine Learning to you, using tools and techniques used to solve real-world problems in Computer Vision, Natural Language Processing, and Time Series analysis. The skills taught in this book will lay the foundation for you to advance your journey to Machine Learning Mastery!

Hands-On Machine Learning from Scratch

This book will guide you on your journey to deeper Machine Learning understanding by developing algorithms in Python from scratch! Learn why and when Machine learning is the right tool for the job and how to improve low performing models!