Martin O’Leary recently posted some sound advice for Kaggle competitors. You can find the three-graph version in the Kaggle wiki.
Here I’ll break it into four key points:
- Spend a while on visualization, making graphs of various properties of the data and trying to get a feel for how everything fits together.
- Test the performance of a variety of standard algorithms (random forests, SVMs, elastic net, etc.) to see how they compare. It’s often very informative to look at which data points are the least well predicted by standard algorithms, as this can give you a good idea of what direction to move in. (Be warned: Home-brew algorithms can be useful later on in a project, but in the early stages you want to try out as many things as possible, not get bogged down in the details of implementing a particular algorithm.)
- Then move into the nitty-gritty details once you have a sense for the lay of the land.
- Of course, all this assumes a certain kind of problem, where the data is already in numeric/categorical form. For more “interesting” datasets, such as the recent Automated Essay Scoring competition, a lot of the early work is in feature extraction — just looking for numbers which you can pull out of the data. That tends to be a bit more creative, and I use a variety of tools to see what works best. However, one of the joys of this kind of problem is that every one is different, so it’s hard to give general advice.