[Edit: WordPress doesn’t import images very nicely, I’ll update with the original images as soon as possible]
I had the pleasure of interning with the data science team at RetailMeNot this past summer and at the end of the summer I wrote up a blog post that explains some of the high level concepts that surround my project.
Recommender systems are an integral part of the user experience of major online services. The front page of Amazon and Netflix alone contain numerous recommendations of products and movies tailored to the interest of the user. Of course, curating these types of suggestions at scale requires leveraging user’s history to gain some insight about what might be of interest to them. There are several different approaches to suggesting content to users and given the immense business value of making proper recommendations, it’s worth understanding a variety of techniques employed by major applications like RetailMeNot.
Perhaps the most well-known use of the collaborative filtering method was in the competition for the Netflix Prize, a million dollar challenge that provided contestants nearly 100 million ratings that users gave to Netflix movies in an attempt to find the algorithm that could best predict the ratings in a test set of nearly 3 million ratings. With data that was limited to just triples formed by the user ID, movie ID, and the rating, it’s not possible to explicitly utilize any characteristics about the user like their age, income, or gender to infer what a good recommendation might be. With limited information, collaborative filtering leverages the assumption that similar users should have similar preferences.
Measuring similarity between users requires some method of representing the tastes of a user that can directly be compared.
Representing a user as a vector of their ratings for every movie allows the recommender system to leverage some useful matrix properties for collaborative filtering. First, it gives the recommender some intuitive functions to measure user to user similarity. The following are all metrics for the similarity between two vectors:
Pearson Correlation Coefficient
Jaccard Similarity (for bit vectors/binary attributes)
Of these methods, cosine similarity is the metric that performs the best across cross validation tests on several data sets. With this notion of similarity, the concept of finding two similar users from a pool of users can be extended to a more systematic approach using matrix operations. Starting with a user-item matrix where the entries in the matrix are ratings or numerical expressions of the preference that the user u has for item i. After adjusting matrix A such that the columns of A all have norm 1 (meaning that each column in A is divided by the norm of that column vector), computing AAT returns a |U| x |U| user similarity matrix, where each entry represents the cosine similarity between user r and user c. From here, it is a pretty natural and trivial operation to find a user quite similar to user r through sorting that user’s row on the matrix and then finding a movie that user c has rated favorably that user r has not yet rated.
This approach is a fairly naïve approach to a recommender system, and there have been several variations of algorithms that manipulate the user-item matrix to generate recommendations. For example, the previous method leverages what is traditionally known as User-User collaborative filtering, but in a similar manner, Item-Item collaborative filtering computes an Item-Item matrix by normalizing the row vectors of the user-item matrix A to unit vectors and calculating ATA. This |I| x |I| matrix conveys the similarity between item r and item c in each entry. Using a user’s historical interactions with certain items, this matrix can easily be sorted to find similar undiscovered items that the user may rate favorably.
The use cases for certain implementations of collaborative filtering are dependent on the circumstances in which that recommender may be used and the costs and benefits that they present. For example, an advantage of item-item collaborative filtering is that given enough training data, the item-item similarity model does not have to be recomputed often. This is because new users being added to the platform shouldn’t substantially affect how conceptually similar two items are. Thus, new users can be given recommendations without having to reconstruct the user-item matrix since the model already can provide similar items to that which the user has interacted with. The same is not true of user-user collaborative filtering since a new user would need to be incorporated into the user-item matrix to find another similar user for recommendations.
Graph Based Collaborative Filtering
Part of the 2016 Summer Intern program at RetailMeNot was an exploration of graph based approaches to recommendation problems. The spectral representation of the traditional collaborative filtering algorithm is a sparse bipartite graph that treats users and items as distinct types of nodes and draws edges between these nodes based on the type of interaction the user had with a particular item. For rating based algorithms, these edges may be weighted based on the rating, but other algorithms may choose to make these edges unweighted.
An advantage of using graphs for collaborative filtering is that they can give an intuitive interpretation of what a similar user or item looks like based on the degree of separation between them and the number of paths between a pair of users/items.
For example, it’s pretty easy to see that items 3-7 are candidates for recommendations for user A because they are some of the items that users of User A’s items also use. In other words, they are two degrees of separation removed from user A. As more information such as the rating and general popularity of these relationships are taken into account, the ranking of these recommendations can be refined further.
One important consideration that has been glanced over so far is the inherent nature of the user-item matrix to have missing entries where a user has simply not interacted with a certain item. In fact, if there were no missing entries in this matrix, then there would be no need for any type of recommender system at all. But unsurprisingly, the user-item matrix is quite sparse since most users won’t have any rating for the vast majority of items that a platform offers. Deciding how to deal with these missing entries is a central parameter in many collaborative filtering algorithms and can make a substantial difference in the performance of one recommender system compared to another.
Matrix completion algorithms are methods of computing an approximation of the user-item matrix often through some factorization of the matrix that uses some assumption about the missing entries. For instance, we can either assume that the missing entries are zero or that they are the average rating for that item:
The next step is to factorize the matrix Q into two or more matrices such that the matrix multiplication of these matrices result in an approximation of Q. This approximation is then treated as the rating matrix that can be sorted to make recommendations.
One common approach for matrix completion is known as Alternating Least Squares. It approximates the m x n matrix Q by factorizing it into a m x k and a k x n matrix. k represents the number of latent factors that ALS tries to capture in the matrix.
ALS takes a gradient descent approach to matrix factorization meaning that it takes steps towards minimizing a cost function associated with the problem. In this case, the cost that we’re attempting to minimize is the difference between the approximated matrix and Q. ALS gets its name from the process of alternating the approximation and update of each of the factor matrices based on the other through the following cost functions:
Each cost function above is in regards to one user or item. The first term in each cost function explains the factorizations’ distance away from Q. The weight matrix is a diagonal matrix with 0 or 1 entries on the diagonal depending on whether there was a rating for that user-item pair (1 if there was). So Wu may represent an |I| x |I| weight matrix for User Bob with a 1 on every diagonal entry corresponding to an item that Bob rated. This is to emphasize fitting the approximation to the rating that are in the training set and not the ones that are subject to prediction. The second term on each of these matrices are regularization terms that prevent overfitting to Q so that the recommendations are more meaningful than the original sparse matrix Q.
Finally, one other way to complete the matrix without factorization is to use something akin to a K-Nearest Neighbors algorithm to fill in the rating of a user. For instance, the similarity of two users can be defined as (1-mean squared difference) between the entries that the two vectors have completed. From here, each missing rating for user u can be the average of the ratings made for those items by the k nearest neighbors of user u.
Unlike collaborative filtering, content based filtering does try to infer something about the user’s interests and the item’s properties and doesn’t rely on the tastes of other users to make recommendations. For example, associating tags with movies and using the user’s historical interactions and ratings with those movies to generate recommendations is a method of content based filtering. Of course, leveraging content based filtering requires more information than just user and item IDs; thus a couple of important questions have to be answered. First, formulating a meaningful way to describe items is necessary for this type of analysis. One idea associated with collaborative filtering is to use tags or categories that would determine what buckets certain items belong too. Then a user can be recommended other items with similar tags based on their interactions with items on the site. Of course, the difficulty here is determining what meaningful tags would be for each item that a site offers and scaling up the tagging process for a service with several offerings.
Depending on what type of data an item is associated with, there are different ways of relating tags to items. Social tagging is the process by which items are assigned crowd or user sourced tags. Tweets and YouTube videos go through this process. If this process thoroughly tags items, then using these item-tag associations is a natural way to do content-based filtering.
Otherwise, if items have a title or description associated with them, it is possible to decompose the title or the description into the more relevant words in these phrases. This process is subject to more noise and needs to be accompanied by a few extra steps. First, it helps to filter out common stop words that aren’t relevant in describing a product like “the” or “a” or any other words that might be common in these descriptions despite not adding any valuable insight into what the product is. Additionally, item descriptions can often come from various sources, contain irregular characters, and consist of useless information like serial numbers. It can be helpful to use a regex filter to remove non-alphabetical characters or non-alphanumeric characters. Finally, several words that have the same root contribute the same meaning to an item description but would be registered as different words because they are spelled differently (Examples: “women” and “woman”, “phone” and “phones”). At a high level, lemmatization is the process of reducing a word down to its root, and the nltk module in Python has a good implementation of this approach, which can help convert similar tags into identical tags.
With item to tag associations constructed, it’s not too difficult to create the item-tag matrix to represent this data. In much the same way that the item similarity matrix was constructed in the collaborative filtering method, it’s possible to construct an item similarity matrix with the item-tag matrix as well. The difference here is that instead of relating items through their common users, this algorithm relates items through their tags.
Content based filtering however, is unique from collaborative filtering methods in that there are some information retrieval and natural language processing tools that can be used to enhance the intuition of item similarity by inferring some insight about the context in which these words appear.
The first common method of contextualizing terms in documents is known as Latent Semantic Analysis. LSA leverages the property that all matrices have a Singular Value Decomposition to construct comparable vectors out of terms and the “documents” that they appear in. LSA is often used in information retrieval, in which the literature often refers to terms and documents. For the purposes of recommender systems, tags represent terms and items represent documents. The SVD is a factorization of an m x n matrix A into 3 new matrices:
- U is an m x m matrix composed of m orthogonal unit eigenvectors of the term-term matrix (AAT). Also known as the left singular vectors of A.
- Σ is an m x n matrix with k non-negative diagonal entries where k is the rank of the matrix. These are known as the singular values of A which are also the square roots of the eigenvalues of (AAT) or (ATA).
- VT is an n x n matrix composed of n orthogonal unit vectors of the document-document matrix (ATA). Also known as the right singular vectors of A.
The SVD can be truncated to form a low rank approximation of the original matrix and reduce the number of dimensions that represent each document and term. The significance of this truncation is that the reduced term and document vectors capture information about terms that commonly co-occur with other terms.
The last component involves formulating a way to represent a query. Since a query is just a collection of words, a query vector can be represented as the centroid of the word vectors that make up that query. Because documents are now mapped into the same space that terms and queries are due to the truncated SVD, it is now possible to assess the similarity between a query and a document based on the vectors associated with either. The cosine similarity metric from earlier is a good candidate here as well, and with the following matrix operation it’s trivial to find the document most relevant to the query.
A user interacts with different items on the site that all have these tags and over time these tags can be accumulated into a query. This approach works well for users that have interacted more thoroughly with the site, but there is often not enough context for this algorithm to work well with new users.
Graph Based Content Filtering
Most approaches to graph based content filtering involve representing tags and items as nodes with edges drawn between item nodes and the relevant tag nodes that describe them. There is a many to many relationship between tag nodes and item nodes. For example, there may be edges between the item “laptop” and the tags “electronics” and “portable”; likewise, “electronics” may be adjacent to the nodes “TV” and “computer” as well. Most graph based approaches to this problem differ in the ways that they use edge weights to curate recommendations.
While it’s possible to represent the LSA algorithm as a graph traversal problem, there are other interesting ways to use a graph for content based filtering as well. One interesting approach to this utilizes the famous Word2Vec algorithm alongside a graph to generate recommendations.
Word2Vec is also an algorithm that tries to learn the context in which certain keywords appear, but unlike LSA, it uses a shallow neural network to either predict the words around a term or predict a term based on the words around it. This algorithm is actually one that works better for item titles and descriptions than it would for social tagging because it relies on words having context. The nature of tags dictates that they are unordered and independent of each other making them a poor candidate for this model.
An important feature of Word2Vec is its ability to map terms to an n-dimensional vector using gradient descent. Unlike LSA which transforms documents to a lower dimensional vector based on context information of term-concurrence, this model treats both the documents and queries as the centroid of the term vectors associated with the document or query.
Edges can be drawn between category nodes and items where the edge weights correspond to the cosine similarity between the item/document vector and the category/query vector. From here, it is possible to traverse the graph to find the longest path given a path length constraint and the initial category nodes that correspond to the user’s interests.
As the name suggests, hybrid filtering adopts principles from collaborative and content based filtering. This method uses the information in the user-tag and item-tag matrix to augment the user-item matrix. Similar to the pagerank algorithm, hybrid filtering uses a graph to represent the idea of a stochastic random walk with a Markov Chain.
Given these three matrices, formulated in the same way as in the two prior methods, it’s possible to compute the similarity between each pair of items or users in the following manner:
Then initialize the user-item matrix to the sparse matrix:
These similarity matrices can be used as a state transition probability matrix. For example, to simulate users talking a random walk on the item graph, we use the user-item graph and the item similarity matrix:
Conversely, it’s possible to simulate an item taking a random walk on the user graph by using the user-item matrix and the user similarity matrix:
Combining these results upon convergence yields a new dense, complete user-item matrix:
The new user-item matrix can be used for inferences about recommendations by sorting and filtering out the new user preference vectors just like in ALS. While the result of ALS and the random walk method are both a complete user-item matrix, the random walk approach is not directly trying to minimize a cost function, and it incorporates information pertaining to more than just one matrix rather than trying to factorize just the user-item matrix.
The goal of a recommender system is ideally to recommend items that a user would otherwise not interact with. A common problem that recommender systems may face is that they often recommend content that is already popular. To account for this bias, we can borrow a technique used in textual mining to determine the value of a word to a document is inverse document frequency.
Textual mining often deals with documents and terms where documents contain terms. Since certain words like “the” and “a” appear often across many documents, these terms are not as significant to the meaning of the document as other more unique terms like “basketball” or “water” that may appear less frequently between documents. For short documents, a good way of ranking these words is often through the inverse document frequency metric:
Where N is the number of documents in the set and nt is the number of documents that contain the term in question. So how does this relate to recommendation? The analogy of documents and terms can be translated to represent users and items respectively instead. Now N represents the number of users and nt represents the number of users that have interacted with an item in question. From here, one approach may be to multiply the scalar IDF score of each item with the item vector associated with it before proceeding with any recommendation algorithm such that the original ratings are adjusted. Another approach may weight the final recommendation list that the algorithm develops against the IDF scores to penalize items that are already popular and strengthen those that aren’t as well known.
Depending on how important ranking is to your use case, it may be okay to approach recommendation with a testing mindset similar to that of classification. For example, on mobile, RetailMeNot recommends 6 stores you may like in a 3 by 2 grid, and arguably, the order in which they are presented in that grid is not of principal concern, but what is presented is more important. Just like with the MPR measure, this validation technique holds out a set of user-item interactions for training the recommender, and marks the rest of the items that a user interacts with as positive test examples. For user-item pairs that have ratings like Netflix movies, there may need to be additional criteria to determine if the user had a positive interaction with this item. In this case, an assumption could be that a user had a positive experience if they interacted with a movie and gave it a rating greater than 4 stars.
The recommender system can then return a ranked list of recommendations once it is a trained for a user. In order to validate this list, the first n recommendations can be classified as positive examples by the recommendation classifier. This assigns equal importance to all of the top n rankings, but allows access to the vast suite of validation tools available to binary classifiers using the confusion matrix such as precision, recall, ROC curves, F-measure, etc…
Mean Percentile Rank is a weighted form of recall that validates ranked lists based on the following formula:
rui represents the ratings of the held out user-item ratings. rankui is a value between 0 and 1 based on the rank predicted by the recommender. The function to translate between the raw ranking to rankui is:
An MPR closer to 0 indicates a higher performing recommender system while a score of .5 indicates virtually random recommendation performance. One concern with MPR is that the metric can punish the recommender just for having higher dimensional data on the ranking list, but this effect is diminished as the test set expands.
Alternating Least Squares:
Lantent Semantic Analysis:
k-NN recommendation and others: