— Deep Learning, Keras, Recommender Systems, Python — 2 min read
Share
TL;DR Learn how to create new examples for your dataset using image augmentation techniques. Load a scanned document image and apply various augmentations. Create an augmented dataset for Object Detection.
The data we’re going to use is a collection of Amazon reviews. The dataset is hosted on the Stanford Network Analysis Project. We’re going to focus on Movie and TV reviews subset.
We need to convert the data into a user-item matrix. Let’s start by loading the data:
1df = pd.read_json('reviews_Movies_and_TV_5.json.gz', lines=True)
Pandas is nice enough to allow us creating a DataFrame without actually extracting the archive. We also instruct it to read each line as an individual record by setting lines=True
.
We’re interested in a subset of the available columns:
1df = df[['reviewerID', 'asin', 'unixReviewTime', 'overall']]
The overall column is simply the rating (between 1 and 5). What about the unixReviewTime?
We’ll rank the items that the users reviewed based on it. So we’ll preserve the temporal order. But let’s change those column names:
1df.columns = ['user', 'item', 'timestamp', 'rating']
Let’s rank the reviews for each user and add a new column with the rank itself:
1df['rank'] = df.groupby("user")["timestamp"].rank(ascending=True, method='dense')
We no longer need the timestamp column:
1df.drop("timestamp", axis=1, inplace=True)
Next, we’ll assign integer ids to each item and user:
1user_mappings = {k:v for v, k in enumerate(df.user.unique())}2item_mappings = {k:v for v, k in enumerate(df.item.unique())}3df['user'] = df['user'].map(user_mappings)4df['item'] = df['item'].map(item_mappings)
We can now convert all values to integers. We’ll also get the user and item counts:
1df = df[['user','item','rank','rating']].astype(np.int64)2n_users = df.user.nunique()3n_items = df.item.nunique()
Finally, we’ll create a train and test datasets. Let’s start with sorting by rank:
1dfc = df.copy()2dfc.sort_values(['user','rank'], ascending=[True,True], inplace=True)3dfc.reset_index(inplace=True, drop=True)
Splitting our data is a bit tricky:
1test = dfc.groupby('user').tail(1)2train = pd.merge(dfc, test, on=['user','item'],3 how='outer', suffixes=('', '_y'))4train = train[train.rating_y.isnull()]5test = test[['user','item','rating']]6train = train[['user','item','rating']]
We use the last user rating for testing. To get the rest for training, we add the test dataset to the dataframe and remove the rating we’ve already used.
We’ll interleave the one rating we know about with 99 random items that the user hasn’t rated. We’ll rank the results using our model and evaluate how high the rated item is placed. We’re going to use two metrics: Hit Ratio and Normalised Discounted Cumulative Gain.
Hit Ratio - the number of hits in an list with n ranked items. In our case, the hit is item rated by the user.
Normalized Discounted Cumulative Gain - is a measure of how high relevant (useful) results are ranked in a list. The Wikipedia entry has a good in-depth explanation and example.
We’ll start by building a lists of rated and non-rated items:
1all_items = dfc.item.unique()2rated_items = (dfc.groupby("user")['item']3 .apply(list)4 .reset_index()5 ).item.tolist()67def sample_not_rated(item_list, n=99):8 return np.random.choice(np.setdiff1d(all_items, item_list), n)910non_rated_items = Parallel(n_jobs=4)(11 delayed(sample_not_rated)(ri) for ri in rated_items12)
Next, we’ll pick the negative (non-rated items) examples:
1negative = pd.DataFrame({'negative':non_rated_items})2negative[['item_n'+str(i) for i in range(99)]] =\3 pd.DataFrame(negative.negative.values.tolist(), index= negative.index)4negative.drop('negative', axis=1, inplace=True)5negative = negative.stack().reset_index()6negative = negative.iloc[:, [0,2]]7negative.columns = ['user','item']8negative['rating'] = 0910test_negative = (pd.concat([test,negative])11 .sort_values('user', ascending=True)12 .reset_index(drop=True)13 )
And ensure every 1st element is the actual rating:
1test_negative.sort_values(2 ['user', 'rating'],3 ascending=[True,False],4 inplace=True5)
Neural Collaborative Filtering (NCF) (introduced in this paper) is a general framework for building Recommender Systems using (Deep) Neural Networks.
One of the main contributions is the idea that one can replace the matrix factorization with a Neural Network. That way, you can learn an arbitrary function that explains the interaction between users and items.
NCF seems to provide a great performance improvement over conventional approaches. Another point by the authors is that the model performance seems to increase as the Neural Nets become deeper.
The authors present one realization of the NCF framework: the “NeuMF” (Neural Matrix Factorization) model. It has two parts (basically two Neural Nets, combined) - Generalized Matrix Factorization (GMF) and a good old Multi-Layer Perceptron (MLP).
The GMF is an element-wise product of user and item embeddings. The MLP model is using the same embeddings as input. Of course, it can be arbitrary deep.
A final layer is using the concatenation of both models and outputs the predictions.
Run the complete notebook in your browser
The complete project on GitHub
Share
You'll never get spam from me