Sunday, August 25, 2013

Things I learned while reading Programming Collective Intelligence.

I have been working through Programming Collective Intelligence over the last couple months. I have to say it's probably been one of the best learning experiences I have had in my years programming. Comparing to some of my previous technology stack / paradigm change experiences:
Muggle -> VB4
VB6 - > Java
 Java -> .Net
 Java -> iOS mobile / game development
This is the biggest, not so much just from the technology stack, but more purely due to the size and complexity of all things ML, AI. Not coming from a mathematical / statistical background, it's really quite a deep hole to jump into, and quite a challenge.

Not only did this book walk me through a bunch of machine learning and data analysis theory, it got me to learn Python and in translating to Java I also got introduced to a whole bunch on Java related tools and frameworks.

I created blog posts for chapters 2-8, and decided to just work through the Python for chapters 9, 10, 11 and 12, for 2 reasons;
1. Improve my Python
2. Get it done so I can move onto my new personal project, using all this ML and Python knowledge to create an cross platform application with a rich UI using either Kivy or QT.

To list some the ML / Data Analysis topics covered in PCI:

  • Classifiers 
  • Neural Networks
  • Clustering
  • Web crawlers 
  • Data indexers 
  • PageRank algorithm 
  • Genetic Algorithms
  • Simulated Annealing
  • K-Nearest Neighbours
  • Bayesian filtering
  • Decision trees 
  • Support vector machines
  • Kernel Methods
  • Linear Regression
  • Evolving intelligence 



The Java tools, libs and frameworks investigated:



Python tools, libs and resources discovered:



Thursday, August 22, 2013

Getting Kivy to run on MacOSX with PyCharm and Virtual Env

Just had a little bit of a struggle getting Kivy to run from my PyCharm IDE, this is how I solved it

My initial Python environment setup was done are follows:

I installed my Python framework via MacPorts.
For Python 3:
sudo port install py33-numpy py33-scipy py33-matplotlib py33-ipython +notebook py33-pandas py33-sympy py33-nose
For Python 2.7:
sudo port install py27-numpy py27-scipy py27-matplotlib py27-ipython +notebook py27-pandas py27-sympy py27-nose

To set your MacPort Python to the default:
For Python 3:
sudo port select --set ipython ipython33
sudo port select --set python python33
For Python 2.7:
sudo port select --set ipython ipython27
sudo port select --set python python27

Adding this to you .profile is probably a good idea when you use MacPorts:
export PATH=/opt/local/bin:/opt/local/sbin:$PATH

This installs pretty much all the major packages (some of which I could install via PyCharm's package install interface), including cython needed by Kivy.

Then in PyCharm I created a virtual environment, and installed pip onto that. I could install Kivy directly in PyCharm, but it still requires PyGame to actually run.
PyGame I found requires X11 / XQuartz, which is no longer bundled with OSX and can be downloaded from:
http://xquartz.macosforge.org/landing/
Once that is installed.


Run the MacPort mercurial install first else you'll get "The command named 'hg' could not be found"
sudo port install mercurial
Then from the bin of my virtual env I could install pygame:
./pip-2.7 install hg+http://bitbucket.org/pygame/pygame

After that I could execute my App from the run configurations within PyCharm

Saturday, August 17, 2013

Creating a price model using k-Nearest Neighbours + Genetic Algorithm

Chapter 8 of Programming Collective Intelligence (PCI) explains the usage and implementation of the k-Nearest Neighbours algorithm. (k-NN).

Simply put:
k-NN is a classification algorithm that uses (k) for the number of neighbours to determine what class an item will belong to.  To determine the neighbours to be used the algorithm uses a distance / similarity score function, in this example (Euclidian Distance).

PCI takes it a little further to help with accuracy in some scenarios. This includes the usage of a weighted average of the neighbours, as well as then using either simulated annealing or genetic algorithms to determine the best weights, building on Optimization techniques - Simulated Annealing & Genetic Algorithms

As with all the previous chapters the code is in my github repository.

So the similarity score function looked like (slightly different to the one used earlier, which was inverted to return 1 if equals):

The simulated annealing and genetic algorithm code I updated as I originally implemented them using Ints... (lesson learnt when doing anything it do with ML or AI, stick to doubles).

Then finally putting it all together my Java implementation of the PCI example

While reading up some more on k-NN I also stumbled upon the following blog posts
First one describing some of the difficulties around using k-NN.
k-Nearest Neighbors - dangerously simple

And then one giving a great overview of k-NN
A detailed introduction to k-NN algorithm

Sunday, August 11, 2013

Decision Trees

I just completed working through Chapter 7 of Programming Collective Intelligence (PCI). This chapter demonstrates how, when and who you should use the decision tree construct. The method described was the CART technique.

The basic summary is: A decision tree has each branch node represent a choice between a number of alternatives, and each leaf node represents a decision or (classification). This makes decision tree another supervised machine learning algorithm useful in classifying information.

The main problem it overcome in defining a decision tree is how to identify the best split of the data points. To find this you need to go through all the sets of data, and identify which will give you the best split (gain) and start from there.
For some more technical information about this split / gain:
http://en.wikipedia.org/wiki/Information_gain_in_decision_trees

The biggest advantages I see in using a decision tree are:
It's easy it is to interpret and visualise.
Data didn't need to be normalised or something between -1 and 1.

Decision trees however cant be effectively used on large datasets with a large number of results.

As with my previous Classifiers post, I ended up using SQLite in memory db as it's such a pleasure to use. I did venture into using LambdaJ, but it actually ended up being such an ugly line of code I left it and simply did it manually. I have not looked at the Java 8 implementation of lambdas yet, I just hope it doesn't end in code like (with a whole bunch of static imports):

falseList.add(filter(not(having(on(List.class).get(col).toString(), equalTo((String) value))), asList(rows)));

So my java implementation of the PCI decision tree ended up looking like (All code in Github) :

(once again ...  about 50% more code :) ).. really beginning to enjoy Python, I do see me using that for all future AI / ML type work as a first choice.


Popular Posts

Followers