Zen in the art of IT: Creating a price model using k-Nearest Neighbours + Genetic Algorithm

Saturday, August 17, 2013

Creating a price model using k-Nearest Neighbours + Genetic Algorithm

Chapter 8 of Programming Collective Intelligence (PCI) explains the usage and implementation of the k-Nearest Neighbours algorithm. (k-NN).

Simply put:
k-NN is a classification algorithm that uses (k) for the number of neighbours to determine what class an item will belong to. To determine the neighbours to be used the algorithm uses a distance / similarity score function, in this example (Euclidian Distance).

PCI takes it a little further to help with accuracy in some scenarios. This includes the usage of a weighted average of the neighbours, as well as then using either simulated annealing or genetic algorithms to determine the best weights, building on Optimization techniques - Simulated Annealing & Genetic Algorithms

As with all the previous chapters the code is in my github repository.

So the similarity score function looked like (slightly different to the one used earlier, which was inverted to return 1 if equals):

The simulated annealing and genetic algorithm code I updated as I originally implemented them using Ints... (lesson learnt when doing anything it do with ML or AI, stick to doubles).

Then finally putting it all together my Java implementation of the PCI example

While reading up some more on k-NN I also stumbled upon the following blog posts
First one describing some of the difficulties around using k-NN.
k-Nearest Neighbors - dangerously simple

And then one giving a great overview of k-NN
A detailed introduction to k-NN algorithm