Michael Rose

Software generalist, backend engineer, Hadoop/Storm bit-herder, building distributed systems on the JVM. Working for FullContact.

Dropwizard Deployment for the Lazy Using Rake and Net::SSH

| Comments

Today I embarked on automated deployment of one of my basic Dropwizard/Angular.JS projects. This project’s purpose is to extract unstructured data from forum posts and turn them into data directly consumable by an API. It’s split into a four major pieces:

  • Core — reusable representation objects to keep things tidy between phases
  • Extraction — Extracts data from forum posts by executing a large unwieldy JDBC query and iterating over the results, then pushing each into MongoDB
  • Query — Dropwizard/JAX-RS REST API for exposing the data found by the extraction phase
  • Frontend — Angular.JS frontend app that consumes the REST API

As with many projects, they all begin somewhat manually, but today I wanted to automate the deployment of my project. Jenkins is a little heavy weight for a personal project, and I don’t really need a full-blown continuous integration framework. Capistrano is more along the lines I wanted, but is too Rails-flavored to be immediately useful for my needs.


Rake is simple. Stupid simple. It’s a Ruby DSL for running commands, a step up from a hand-rolled bash script.

Mahout & Dropwizard: Collaborative Filtering & Recommenders

| Comments

Recommenders are in use all throughout the web. Chances are, you’ve interacted with dozens of recommenders systems thought the course of simply finding this article.

Amazon is built on recommender systems.

Amazon Recommendations

Using techniques such as item-based recommenders and itemset generation they can not only find what’s relevant to you, they can determine things related to what you’re browsing and further find items frequently bought with yours (and perhaps provide a combo bonus to shake cobwebs from your wallet).

As with any Machine Learning task, the first step is to define your problem. What are you trying to do? Who is your customer? Is it your Business Intelligence team? Is it the consumer browsing your webapp? Is it your fault detection system attempting to draw relations? Spike detection?

The problem I’m attempting to solve is a somewhat common problem. On a site (which will remain nameless for now) members rate threads containing media. For the sake of simplicity, lets say the threads are albums posted by members along with their reviews of said albums. The members of this site are extremely opinionated, and want good recommendations.

Groovy on Storm

| Comments

At work, we end up using Groovy. A lot. In fact, most of our infrastructure is built using some combination of Groovy, Java, and a hint of Clojure. When Nathan Marz’s Storm came out we were understandably excited: forcing Hadoop to attempt to be a real time system was an absolute nightmare. Don’t try that. It’s not smart. Hadoop is a batch system.

Get out of there cat. You are not a data-centric batch processing solution.

We rearchitected, clearly. Hadoop is being used in the way its creator intended: bulk processing, and we’re playing with Storm as a means of going forward with other cool projects such as moving our realtime backend processes. This includes webhook processing and a lot more.

The first step is getting Groovy to run on Storm. In fit of sleeplessness, I created a shell project and corresponding pom.xml to get everything going.


A very basic project with a slightly modified ExclamationTopology (no inner classes or semicolons) and a working pom.xml which uses antrun to compile the Groovy code into an executable jar. After all is said and done, one need only run java -jar target/groovy-storm-0.1.0-jar-with-dependencies.jar

Pretty simple, now I can get started on the hard work and use the storm-contrib packages to feed from SQS (our primary message queue).


Data Platforms & ActiveRecord

| Comments

Working on a data platform for my 9th and final data mining project on the KDD Cup dataset. Transforming four text files of data and relations into a real, queryable dataset. In practice, I’m creating a data warehouse, albeit extraordinarily simplified.

Data transformation is almost never fun. It’s rote work, it’s dry, and time consuming. Today I decided to move away from my general MO for one-off data transformers and not embed SQL directly into the project.

Instead I decided to work with ActiveRecord. ActiveRecord is a wonderful Object-Relational Mapper pattern in use by many different libraries in tons of languages. Ruby, the general language of my choice in this course (and one of my favorite languages in general, I love Ruby) has a great ActiveRecord package in use by Ruby on Rails.

Initial Post

| Comments

I’ve read that all programmers should blog. Doing so forces us to think critically about what we do and to explain ourselves – certainly something I lack at times. This is my attempt to rectify my shortcomings in terms of being able to express myself in words.

This blog, for myself, is also somewhat catharsis. I’m a Computer Science student, a scant semester away from graduating from the somewhat known Colorado School of Mines. Though in graduating, I’ll turn around the same day and ensure I’m registered for my classes of Fall 2012. I’ve become entangled in a 5-year combined bachelors/masters program. A frustrating endeavor. I love my field dearly but loathe the oftentimes frustrating coursework and exams which do little to test knowledge or problem solving, only testing recall. 

As anyone who knows me will tell you, recall isn’t always my strongest quality.

To whinge about my academic woes is not the focus of this blog. I’m a Software Engineer, Computer Scientist, and tinkerer. I’ve found myself wanting to write down some of my thoughts on software and design practices. My explorations into Scala, Clojure are topics I feel worthy of writing on. Maybe one day someone will read my blog and gain some useful insight (or indeed, correct my incorrect insights).

Here’s to hoping I succeed.