Chapter 6 of Programming Collective Intelligence (PCI) demonstrates how to classify documents based on their content.
I used one extra Java open source library for this chapter, and it's implementation was completely painless.
What a pleasure, simple maven include, and thats it's little file or memory based SQL based db in your code.
My full java implementation of some of the topics are available on my GitHub repo, but will highlight the Fisher Method (or Fisher's discriminant analysis or LDA) if you want to get a lot more technical.What has made PCI a good book is it's ability to summarise quite complex theoretical and mathematical concepts down to basics and code, for us lowly developers use to practically.
"the Fisher method calculates the probability of a category for each feature of the document, then combines the probabilities and test to see if the set of probabilities is more or less likely than a random set. This method also returns a probability for each category that can be compared to others"
During the writing of this post, I discovered the following blog:
Shape of data
Seems well worth the read, will be spending the next couple days on that before continuing with PCI, chapter 7.. Decision Trees.