Zen in the art of IT: Predicting the next most probable part of speech.

Sunday, December 15, 2013

Predicting the next most probable part of speech.

I have recently been spending some of my spare time learning and about AI and machine learning , after a couple books, a bunch of tutorials and most of Andrew Ng's Coursera course. I decided enough with the theory, time for some real code.
During all my late night reading I also stumbled across some of the following. The Loebner Prize, and it's most recent winner Mitsuku, A.L.I.C.E and Cleverbot and to be honest, maybe given my naivety of the field of AI, I expected much more from the above "AI" / technology, most of these current chatbots are easily confused and honestly not very impressive.
Thankfully I also found Eric Horvitz's video of his AI personal assistant, which resonated with what I wanted to achieve with my ventures into AI.

So, with human interaction as a focus point, I started designing: "Aria" - (Artificially Intelligent Research Assistant - in reverse). Which, since most of my development experience is based in the Java Enterprise environment, will be built on a distributed enterprise scale using the amazing technologies that it offers to mention some Hadoop, Spark, Mahout, Solr, MySQL, Neo4J, Spring...

My moonshot/daydream goal is to better the interactions of people and computers, but in reality if I only learn to use, implement and enjoy all that is involved with AI and ML, I will see myself as successful.

So, for my first bit of functional machine learning...

Predicting the next most probable part of speech. One of the issues with natural language processing is that words used in different contexts end up having different meanings and synonyms. To try assist with this I figured I would train a neural network with the relevant parts of speech, and then use that to assist in understanding user submitted text.

This full code for this example is available on Github.

I used a number of Java open source libraries for this:
Encog
Neuroph
Stanford NLP
Google Guava

I used a dataset of 29 000 English sentences that I sourced from a bunch of websites and open corpus's. I won't be sharing those as I have no clue what the state of the copyright is, so unfortunately to recreate this you'd need to source our own data.

For the neural network implementation, I tried both Neuroph and Encog. Neuroph got my attention first with their great UI to allow me to experiment with my neural network visually in the beginning, but as soon as I created my training data with ended up being about 300MB of 0's and 1's it fell over and didn't allow me to use it. I then began looking at Encog again as I had used initially when just starting to read about ML and AI

When using Neuroph in code it worked with the dataset, but then only with BackPropagation the ResilientPropagation implementation never seemed to return.
So I ended up much preferring Encog, it's resilient propagation implementation (iRPROP+) worked well and reduces the network error to about 0.018 in under 100 iterations, without me having to fine tune the settings and network architecture.

How this works, I take text data, I use the Stanford NLP library to generate a list of the parts of speech in the document. I translate their Annotation into an internal enum, and then use that to build up a training data set. I persist that to file currently, just to save some time while testing. I then train and persist the neural network and test it.

The Parts of Speech Enum:

The creation of the training data:

Train the network:

Test: