Jan 2008
Book Review: Collective Intelligence, Part I
16-Jan-2008 Filed in: Software
Development
- Useful algorithms
- Shown in real-life situations
- With real data that you can get yourself
This book is the first I've read that relies on the current maturity level of the web. A few years ago, it wouldn't have been possible to provide so many relevant examples. In a few years, the API's might be out-of-date -- but don't let that stop you -- the algorithms are timeless.
The choice of Python is a good one for a few reasons. Firstly, it's pretty clear to read even if you don't know it. Secondly, list-comprehensions are a pretty good way to think about data crunching (which these algorithms do a lot), and, finally, there are many freely available Python libraries to deal with the public APIs he uses.
To give you an idea of Toby's approach, I'm going to run through my favorite chapter so far, the one on Optimization. The quick description of the problem domain should be familiar to every programmer -- solve NP complete problems like the traveling salesperson. Basically, Optimization techniques can be used in situations where (1) you can't analytically solve the problem (2) the solution set is too large to exhaustively search (3) you can express the cost of a possible solution such that better solutions have a lower cost than worse ones and (4) solutions that are "near" each other have similar values for cost.
Toby's first example is how a family can choose flights so that they arrive at the same airport from all over the country at nearly the same time and depart for home a few days later at nearly the same time. The cost function is a combination of the actual costs of the tickets and the difference in time between the earliest and latest flights. He goes on to provide a dataset of flights, describes setting up optimization problems in general, and shows the python code to implement the cost function.
Then he introduces a few optimization algorithms, implements them, and discusses their relative strengths and weaknesses. The next section applies these techniques to real data on Kayak.com using the standard Python DOM parser, minidom. The final section shows how to apply these techniques to other problems like pairing students in a dorm and network visualization. Other chapters get a similarly exhaustive treatment.
I really hope that this is the future of technical books.