Tuesday, July 30, 2013

Document Filtering - Classifiers

Chapter 6 of Programming Collective Intelligence (PCI) demonstrates how to classify documents based on their content.
I used one extra Java open source library for this chapter, and it's implementation was completely painless.
What a pleasure, simple maven include, and thats it's little file or memory based SQL based db in your code.

My full java implementation of some of the topics are available on my GitHub repo, but will highlight the Fisher Method (or  Fisher's discriminant analysis or LDA) if you want to get a lot more technical.
What has made PCI a good book is it's ability to summarise quite complex theoretical and mathematical concepts down to basics and code, for us lowly developers use to practically.
To Quote:
"the Fisher method calculates the probability of a category for each feature of the document, then combines the probabilities and test to see if the set of probabilities is more or less likely than a random set. This method also returns a probability for each category that can be compared to others"

Although for this particular chapter and example, I would have liked a bit more layman's information surrounding Chi-squared distribution, as you will notice in the TODO of the following code example.
During the writing of this post, I discovered the following blog:
Shape of data
Seems well worth the read, will be spending the next couple days on that before continuing with PCI, chapter 7.. Decision Trees.

Monday, July 22, 2013

Optimization techniques - Simulated Annealing & Genetic Algorithms

Chapter 5 of Programming Collective Intelligence (PCI) deals with optimisation problems.

To Quote:
"Optimisation finds the best solution to a problem by trying many different solution and scoring them to determine their quality. Optimisation is typically used in cases where there are too many possible solutions to try them all"

Before embarking on this chapter I decided that it would be best to quickly learn Python, there just seems to be a lot of Python around as soon as you start learning and reading about machine learning and data analysis, it can't actually be ignored.
(Still not sure why this is the case, but set out to get up an running with Python, in 1 weekend)

Some of the resources I used:

As a developer, learning the basics of Python really isn't very difficult,  to be honest it probably took me longer to find an development environment I was happy with, consoles and text editors just don't do it for me.

The main ones I investigated were:
Ninja IDE (Free)
Eclipse + PyDev (Free)
PyCharm ($99)

I spent quite a bit of time playing with Ninja IDE and Eclipse, but there were just little things that kept bugging me, from strange short cuts to highlighting correct code / syntax as incorrect.

10 minutes after installing PyCharm, I was sold. To be fair, I am probably not the best person to judge.
I code in IntelliJ daily and actually ended up converting all the java developers in my department to drop Eclipse and start using IntelliJ.... I also did the majority of my Objective-C work in AppCode, in other words... I am a JetBrains fanboy, happy to hand over my money for an awesome tool.

Getting back to PCI, where were a couple issues with the code in this chapter, which caused me (a person that just learnt Python) a little bit of pain, 'cause I figured the code had to be right and I was just doing something wrong, eventually I went searching and found:

PCI Errata

With that I corrected the issues in the hillclimb and genetic algorithm functions:

The java implementation for the 3 functions ended up twice as long and looking like:

And unlike my previous posts on PCI, I didn't use a whole bunch of open source libraries, only added one.
Java Tuples.

The whole Chapter 5 Optimisation solution is in my Blog Github repo, the concepts used in both the Simulated Annealing and Genetic Algorithm could easily be adapted and used again if looking for a simple example of those concepts.

Now for Chapter 6 ... Document Filtering...

Wednesday, July 3, 2013

Mini Search Engine - Just the basics, using Neo4j, Crawler4j, Graphstream and Encog

Continuing to chapter 4 of Programming Collection Intelligence  (PCI) which is implementing a search engine.
I may have bitten off a little more than I should of in 1 exercise. Instead of using the normal relational database construct as used in the book, I figured, I always wanted to have a look at Neo4J so now was the time. Just to say, this isn't necessarily the ideal use case for a graph db, but how hard could to be to kill 3 birds with 1 stone.

Working through the tutorials trying to reset my SQL Server, Oracle mindset took a little longer than expected, but thankfully there are some great resources around Neo4j.

Just a couple:
neo4j - learn
Graph theory for busy developers

Since I just wanted to run this as a little exercise, I decided to go for a in memory implementation and not run it as a service on my machine. In hindsight this was probably a mistake and the tools and web interface would have helped me visualise my data graph quicker in the beginning.

As you can only have 1 writable instance of the in memory implementation, I made a little double lock singleton factory to create and clear the DB.

Then using Crawler4j created a graph of all the URLs starting with my blog, their relationships to other URLs and all the words and indexes of the words that those URLs contain.

After the data was collected, I could query it and perform the functions of a search engine. For this I decided to use java futures as it was another thing I had only read about and not yet implemented. In my day to day working environment we use Weblogic  / CommonJ work managers within the application server to perform the same task.
I then went about creating a task for each of the following counting the word frequency, document location, Page Rank and neural network (with fake input / training data) to rank the pages returned based on the search criteria. All the code is in my public github blog repo.

Disclaimer: The Neural Network task, either didn't have enough data to be affective, or I implemented the data normalisation incorrectly, so it is currently not very useful, I'll return to it once I have completed the journey through the while PCI book.

The one task worth sharing was the Page Rank one, I quickly read some of the theory for it, decided I am not that clever and went searching for a library that had it implemented. I discovered Graphstream a wonderful opensource project that does a WHOLE lot more than just PageRank, check out their video.

From that it was then simple to implement my PageRank task of this exercise.

In between all of this I found a great implementation of sorting a map by values on Stackoverflow.

The Maven dependencies used to implement all of this

Now to chapter 5 on PCI... Optimisation.

Monday, July 1, 2013

A couple useful Oracle XE admin commands

I struggled a bit trying to get my local Oracle XE up and running after a couple months of being dormant.

Firstly: Oracle XE 11g sets password expiry by default. Quiet annoying...
So my system account was locked.
How to unlock that I did the following on the window command prompt:
set ORACLE_HOME= "ORACLE_PATH" (D:\OracleXe\app\oracle\product\11.2.0\server) in my case.
sqlplus / as sysdba
ALTER USER SYSTEM identified by password;

If the account is locked run:

Then, to ensure that it does not expire again:


One more thing I needed to change since I had installed a local Tomcat, is the default HTTP port for XE.
This can be done with 3010 is the new port:

Popular Posts