Michael Rose

Software generalist, backend engineer, bit-herder, building distributed systems on the JVM for FullContact.

Mahout & Dropwizard: Collaborative Filtering & Recommenders

| Comments

Recommenders are in use all throughout the web. Chances are, you’ve interacted with dozens of recommenders systems thought the course of simply finding this article.

Amazon is built on recommender systems.

Amazon Recommendations

Using techniques such as item-based recommenders and itemset generation they can not only find what’s relevant to you, they can determine things related to what you’re browsing and further find items frequently bought with yours (and perhaps provide a combo bonus to shake cobwebs from your wallet).

As with any Machine Learning task, the first step is to define your problem. What are you trying to do? Who is your customer? Is it your Business Intelligence team? Is it the consumer browsing your webapp? Is it your fault detection system attempting to draw relations? Spike detection?

The problem I’m attempting to solve is a somewhat common problem. On a site (which will remain nameless for now) members rate threads containing media. For the sake of simplicity, lets say the threads are albums posted by members along with their reviews of said albums. The members of this site are extremely opinionated, and want good recommendations.

The problem with a 1-5 star rating system is that it really doesn’t tell you anything about the content other than the percieved quality of the resource. It unfortunately loses a lot of resolution which might be important to your use case. Some members may enjoy an album for its rich harmonics and energetic, uplifting lyrics. Some may hate it because it’s a social commentary on the sad state of US Foreign Policy. Such users generally fall into cohorts of like-minded users who all tend to rate things on similar scales. Nickleback’s newest album could easily garner a rotten 1-star from all the Sabaton and Hammerfall fans.

The theory is to exploit this similarity between members to cluster them. Once clustered, the sets of all the users can be overlayed and with very basic set magic, find new content the user may not have seen previously that based on his cohort could lead to a very interesting musical adventure for him. Absolutely fantastic! How about some details? This is known as collaborative filtering.

The first step is a neighborhood generation algorithm. A users’ neighborhood is the attempt to define a user’s cohort. Having a neighborhood allows you to sample from it and create a network. In Mahout, there are two implementations: ThresholdUserNeighborhood and UserNeighborhood. Threshold Neighborhoods include all users nearest you in terms of similarity.

Similarity is defined as taking a mathematical operation to take two users (or items) and compare them in a quantitative fashion. You can read more on similarity metrics.

Pearson Similarity is a good choice for rating-based data. It’s the same algoritm used in determining the correlation of data — see a scatterplot, is it trending or is it unfocused bullshit?

Scatter Plot & Correlation

Pearson correlation determines if your data generally trends. We’ll exploit this to find users similar to each other, even the ones that consistently rate items at different baselines.

Cool. So we’ve defined our similarity metrics. We’ve explored neighborhoods and decided a threshold-based neighborhood might work well. What’s next? Lets get some data.

Preparing your data

Preparing your data is easy. All you need is a CSV (comma-separated) or TSV (tab-separated). For ease of reading, I’ll be using CSV but TSV works just as well.

The format of a mahout datafile looks a little like this:

userID,itemID[,preference[,timestamp]]

  • userid and itemid are parsed as longs
  • preference value is a double, and is optional. If omitted, it’ll be a binary preference.
  • timestamp, useful but optional
  • Additional fields or lines prefixed with # are ignored

By the end of your extraction, any .csv will do. You can optionally compress your datafile with gzip or zip and the FileDataModel will have no problems reading it.

My MySQL extraction script looks like this:

extract_ratings.sh
1
2
3
4
echo "SELECT tr.userid,tr.threadid,tr.vote,UNIX_TIMESTAMP(tr.date) as date,t.forumid FROM threadrate tr LEFT JOIN thread t USING(threadid) WHERE date IS NOT NULL" | \
mysql vbulletin | \
tr '\t' ',' | \
tail -n +2 > ratings.csv

The Code

Such code can be written with a basic understand of the algorithm and a head for details. Fortunately, all the hard work has been done for you in Apache Mahout. Mahout is a machine learning toolkit, a swiss-army knife of getting stuff done, and doing it fast. Much of Mahout’s work can be farmed off to your favorite Hadoop cluster which makes it ideal for a lot of data.

I don’t have a lot of data, it’s a little over 50,000 rows but it’s all real data. There isn’t a right answer. Another good dataset to play with is the Grouplens dataset as the parameters of that data has been well established.

The next stop is to build a basic evaluator. You can use this to explore the effect of different similarity measures, user neighborhoods, preference inferrers, and other techniques to improve your algorithm for your use case.

ThreadRecommenderEvaluator.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
public class ThreadRecommenderEvaluator {
    public static void main(String[] args) throws IOException, TasteException {
        DataModel model = new GenericDataModel(GenericDataModel.toDataMap(new FileDataModel(new File("ratings.csv"))));

        List<UserSimilarity> similarities = Lists.newArrayList(
                new LogLikelihoodSimilarity(model),
                new PearsonCorrelationSimilarity(model, Weighting.UNWEIGHTED),
                new PearsonCorrelationSimilarity(model, Weighting.WEIGHTED),
                new EuclideanDistanceSimilarity(model, Weighting.UNWEIGHTED),
                new EuclideanDistanceSimilarity(model, Weighting.WEIGHTED),
                new TanimotoCoefficientSimilarity(model),
                new SpearmanCorrelationSimilarity(model),
                new CityBlockSimilarity(model)
        );

        for (final UserSimilarity similarity : similarities) {
            RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
                public Recommender buildRecommender(DataModel model) throws TasteException {
                    //UserNeighborhood neighborhood = new NearestNUserNeighborhood(50, similarity, model);
                    //similarity.setPreferenceInferrer(new AveragingPreferenceInferrer(model));
                    UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.8, similarity, model);
                    return new GenericUserBasedRecommender(model, neighborhood, similarity);
                }
            };

            evaluate(similarity, recommenderBuilder, model);
        }
    }

    public static void evaluate(UserSimilarity similarity, RecommenderBuilder builder, DataModel model) throws TasteException {
        RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
        RecommenderIRStatsEvaluator evaluator1 = new GenericRecommenderIRStatsEvaluator();

        double score = evaluator.evaluate(builder, null, model, .8, .95);
        System.out.println(similarity.getClass().getName());
        System.out.println(score);
    }
}

The closer the score is to zero, theoretically the results are ‘better’.

ThreadRecommenderEvaluator.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
20:57:34.105 [main] INFO  o.a.m.c.t.i.e.AbstractDifferenceRecommenderEvaluator - Beginning evaluation using 0.8 of GenericDataModel[users:5,6,20...]
20:57:34.112 [main] INFO  o.a.m.c.t.i.model.GenericDataModel - Processed 1500 users
20:57:34.114 [main] INFO  o.a.m.c.t.i.e.AbstractDifferenceRecommenderEvaluator - Beginning evaluation of 833 users
20:57:34.114 [main] INFO  o.a.m.c.t.i.e.AbstractDifferenceRecommenderEvaluator - Starting timing of 833 tasks in 8 threads
20:57:34.122 [pool-4-thread-1] INFO  o.a.m.c.t.impl.eval.StatsCallable - Average time per recommendation: 1ms
20:57:34.123 [pool-4-thread-1] INFO  o.a.m.c.t.impl.eval.StatsCallable - Approximate memory used: 13MB / 85MB
20:57:34.123 [pool-4-thread-1] INFO  o.a.m.c.t.impl.eval.StatsCallable - Unable to recommend in 69 cases
20:57:34.570 [main] INFO  o.a.m.c.t.i.e.AbstractDifferenceRecommenderEvaluator - Evaluation result: 0.6703931756508649
org.apache.mahout.cf.taste.impl.similarity.EuclideanDistanceSimilarity
0.6703931756508649
20:57:34.578 [main] INFO  o.a.m.c.t.i.e.AbstractDifferenceRecommenderEvaluator - Beginning evaluation using 0.8 of GenericDataModel[users:5,6,20...]
20:57:34.585 [main] INFO  o.a.m.c.t.i.model.GenericDataModel - Processed 1504 users
20:57:34.588 [main] INFO  o.a.m.c.t.i.e.AbstractDifferenceRecommenderEvaluator - Beginning evaluation of 858 users
20:57:34.588 [main] INFO  o.a.m.c.t.i.e.AbstractDifferenceRecommenderEvaluator - Starting timing of 858 tasks in 8 threads
20:57:34.598 [pool-5-thread-1] INFO  o.a.m.c.t.impl.eval.StatsCallable - Average time per recommendation: 2ms
20:57:34.599 [pool-5-thread-1] INFO  o.a.m.c.t.impl.eval.StatsCallable - Approximate memory used: 18MB / 85MB
20:57:34.599 [pool-5-thread-1] INFO  o.a.m.c.t.impl.eval.StatsCallable - Unable to recommend in 54 cases
20:57:35.061 [main] INFO  o.a.m.c.t.i.e.AbstractDifferenceRecommenderEvaluator - Evaluation result: 0.01664333884893026
org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity
0.01664333884893026

We now have a decent recommender which appears to be returning results well in the validation set. By not training with 100% of the data we can avoid model overfitting and see how well our model and data really fit the problem we’re trying to solve.

For my use case, Pearson worked better than the rest. Extend GenericUserBasedRecommender and we’re off to the races!

ThreadRecommender.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
public class ThreadRecommender extends GenericUserBasedRecommender {
    public ThreadRecommender(DataModel dataModel, UserNeighborhood neighborhood, UserSimilarity similarity) {
        super(dataModel, neighborhood, similarity);
    }

    public static ThreadRecommender newRecommender() throws Exception {
        File f = new File("ratings.csv");
        FileDataModel model = new FileDataModel(f, false, 0);


        UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
        UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.8, similarity, model);
        return new ThreadRecommender(model, neighborhood, similarity);
    }

The recommender works, but then the question becomes: how do you make it useful? In a Java application, the answer is relatively simple: just embed it. But this vBulletin, a PHP forum package, we need to interface the Java and PHP components. With that in mind, I set out to wrap the recommender in a HTTP interface useful cross-process and via XHR (AJAX) requests.

The first technology I tried was the mahout-integrations package. The mahout-integrations package contains a servlet useful for serving out recommendations as a plain-text response, XML, or JSON. The configuration is easy as well, and you can embed Jetty in a shaded jar with almost no issues. The code is pretty weak sauce, but if it works it works right?

RecServer.java
1
2
3
4
5
6
7
8
9
10
public class RecServer {
    public static void main(String[] args) throws Exception{
        Server server = new Server(3000);
        Context root = new Context(server,"/", Context.SESSIONS);
        ServletHolder sh = new ServletHolder(RecommenderServlet.class);
        sh.setInitParameter("recommender-class", "com.xorlev.ml.ThreadRecommender");
        root.addServlet(sh, "/recommender");
        server.start();
    }
}

Wrong. While the Jetty+Servlet approach is easy, I have the requirement of updating tastes on my recommender as well as periodically refreshing the datamodel without a full stop/start cycle of the application (during which time it’d be unavailable — unacceptable, this is life or death :) ). I set out to enhance the default servlet to support this behavior, only to be handily thwarted by final classes. My initial thought was to delegate to the recommender (hah).

RecommenderServlet.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
    @Override
    public void doGet(HttpServletRequest request,
                      HttpServletResponse response) throws ServletException {

        if (request.getParameter("refresh") != null) {
            recommender.refresh(null);
            recommender.getDataModel().refresh(null);
            response.setContentType("text/plain");
            response.setCharacterEncoding("UTF-8");

            try {
                response.getWriter().write("OK");
            } catch (IOException ioe) {
                throw new ServletException(ioe);
            }
        } else {
            recommendStories(request, response);
        }

    }

Okay, I understand the Open/Closed Principle, but in practice it’s a headache and a half of the class isn’t written to support common operations. Normally I’d delegate to the servlet, but the recommender its self is marked private and I didn’t really want to deal with using the RecommenderSingleton to get the recommender again for an operation which should already be supported. In the end I grabbed the source code and modified it to my own needs, flying brazenly in the face of DRY code and OCP.

This worked in production, but I really didn’t like it. As my requirements shifted and I felt worse and worse about the code and the momentary downtimes (I needed to support faster reloading) I decided to rewrite the service in Dropwizard. Dropwizard is a high-productivity, lightweight collection of some of the best battle-tested libraries Java has to offer in terms of performance and productivity wrapped up and given a great environment manager and configuration manager. Mix in Coda Hale’s best-of-breed Metrics library and you’re all set. Like many of the most brilliant of ideas, in hindsight Coda’s mashup seems as obvious as it is powerful.

ThreadRecommendationResource.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
@Path("/recommend/thread")
@Produces(MediaType.APPLICATION_JSON)
public class ThreadRecommendationResource {
    private Recommender recommender;

    public ThreadRecommendationResource() throws Exception {
        recommender = new ThreadRecommender();
    }

    @GET
    @Timed
    public List<RecommendedItem> getRecommendations(@QueryParam("userId") Optional<Integer> userId) throws TasteException {
        if (userId.isPresent()) {
            return recommender.recommend(userId.get(), 10);
        }

        return Collections.emptyList();
    }
}

Writing a basic resource to serve out RecommendedItems is incredibly easy. RecommendedItem, already being a basic POJO is effortlessly serialized by Jackson to JSON. XML is just as easy to produce.

By the end of it, I felt a little disappointed that it wasn’t more arcane, but pleased at how easy everything was.

vBulletin Integration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
if (isset($vbulletin->userinfo['userid']) && is_numeric($vbulletin->userinfo['userid'])) {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, "http://127.0.0.1:3000/recommend/story?userId=" . $vbulletin->userinfo['userid']);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    $recs = json_decode(curl_exec($ch));

    function extract_ids($item) {
        return $item->itemID;
    }


    if (is_array($recs)) {
        $ids = array_map("extract_ids", $recs);
    } else {
        $ids = array();
    }
    curl_close($ch);
}

On the other end, a basic json_decode() makes the data useful again to display to users.

In a future post, I’ll go over using the JDBC backend to recommendations to allow for seamless reloading of preferences and building endpoints to establish new preferences in real time.