Hal Varian, Google’s chief economist, gave a nice summary of a major need of our era.
“The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.
“I think statisticians are part of it, but it’s just a part. You also want to be able to visualize the data, communicate the data, and utilize it effectively. … being able to access, understand, and communicate the insights you get from data analysis —are going to be extremely important.”
Hal Varian, Google’s Chief Economist, 2009
Martin O’Leary recently posted some sound advice for Kaggle competitors. You can find the three-graph version in the Kaggle wiki.
Here I’ll break it into four key points:
- Spend a while on visualization, making graphs of various properties of the data and trying to get a feel for how everything fits together.
- Test the performance of a variety of standard algorithms (random forests, SVMs, elastic net, etc.) to see how they compare. It’s often very informative to look at which data points are the least well predicted by standard algorithms, as this can give you a good idea of what direction to move in. (Be warned: Home-brew algorithms can be useful later on in a project, but in the early stages you want to try out as many things as possible, not get bogged down in the details of implementing a particular algorithm.)
- Then move into the nitty-gritty details once you have a sense for the lay of the land.
- Of course, all this assumes a certain kind of problem, where the data is already in numeric/categorical form. For more “interesting” datasets, such as the recent Automated Essay Scoring competition, a lot of the early work is in feature extraction — just looking for numbers which you can pull out of the data. That tends to be a bit more creative, and I use a variety of tools to see what works best. However, one of the joys of this kind of problem is that every one is different, so it’s hard to give general advice.