In data mining, a recommender system is an active information filtering system that aims to present the information items that will likely interest the user. For example, Google uses this to show you relevant advertisements, Netflix to recommend you movies that you might like, and Amazon to recommend you relevant products.
The steps to create a recommender system are:
Organize this information
Use this information for the purpose of making a recommendation, as accurate as possible.
The challenge here is to get a dataset and to use it in order to be as accurate as possible in the recommendation process. As example, let’s create a music recommender system.
We will use a database based on Million Song Dataset. The database has two tables : train(userID, songID, plays) for train triplets, and song(songID, title, release, artistName, year) for songs metadata.
There are many frameworks and libraries for data mining, for this example :
Scikit-learn: machine learning library for the Python programming language. (free software)
Graphlab: is a graph-based, high performance, distributed computation framework. (free for academic use)
First, we consider the songs plays as “ratings” for the songs by users, in other words, more a user listens to the same song more he likes it (and higher he evaluates it). We can do some analysis of the data to get as much information as possible to improve our prediction system afterward.
We note here about 58% of the songs were listened to only once.
Basic Recommender System
The most basic method is simply to recommend the most listened songs ! this method may seem obvious and too easy, but it actually works in many cases and to solve many problems like the cold start.
The principle is to base the similarity on the songs listened to by the user. Therefore two songs are considered similar if they were already listened to by the same user (which means the plays/ratings are ignored). There is many algorithms for item similarity like cosine, pearson and jaccard.
In the following code, we will use the Graphlab’s item similarity recommender (which uses jaccard by default) to calculate similarity, train the model, and calculate the RMSE.
We get an overall RMSE of 6.776336098094174
In this category of recommendation algorithm, we have the choice between favoring “ranking performance” (predecting the order of the songs that a user will like) or “rating performance” (predicting the exact number of songs plays by a user).
We note from the analysis we made at the beginning of the database that we can exclude listening to songs listened only once, and we assume that a single play of a song is not enough to take it into consideration:
Clearly, we get even better results with this optimization, the overall RMSE is 9.804972175313402
Let’s use an example user to see songs the model recommends for him:
TA DA ! we get songs recommendations as expected :
If we recommend songs for all the users, we will get the follow plot showing the distribution of calculated scores by rank :
Recommendations can be generated by a wide range of algorithms. While user-based or item-based collaborative filtering methods are simple and intuitive, matrix factorization techniques are usually more effective because they allow us to discover the latent features underlying the interactions between users and items.