Sunday, February 16, 2014

Local Wikipedia with Solr and Spring Data

Continuing with my little AI / Machine Learning research project... I wanted to have a decent sized repo of English text, that was not in a complete mess like a large percentage of data on the internet.  I figured I would try Wikipedia, but what to do with about 40Gb of XML? how do I work / query with all that data. I figured based on recent work implementation where we load something like 200 000 000 records on into a Solr cache, Solr would be the way to go, so the is an example of my basic implementation.

Required for this example:

Wikipedia download (warning it is a 9.9Gb file, extracts to about 42Gb)
Spring Data (Great Blog / Examples on Spring Data:  Petri Kainulainen's blog)

All the code and unit test for this post is on my blog GitHub Repo

When setting up Solr from scratch, you can have a look at Solr's wiki or documentation, their documentation is pretty good. There is also an example of importing Wikipedia here, I started with that and made some minor modifications.

For this specific example the Solr config needed (/conf):
For this example (and in the below config files),
Solr home: /Development/Solr
Index / Data: /Development/Data/solr_data/wikipedia
Import File: /Development/Data/enwiki-latest-pages-articles.xml

The full import into Solr took about 48 hours on my old 2011 i5 iMac and the index on my current setup is about 52Gb.

Data Config for the import:


Solr Config:

The code for this ended up being quite clean, Spring Data - Solr, gives 2 main interfaces SolrIndexService, and SolrCrudRespository, you simply extend / implement these 2, wrap that in a single interface, autowire from a Spring Java context and you good to go.





Next thing for me to look at for sourcing data is Spring Social.

Sunday, January 12, 2014

BYG (Bing, Yahoo, Google) Search Wrapper

One small section of my Aria project will be to interface with the current search engines out there. To do this I will require a module that will give me a consistent interface to work with the 3 main providers; Bing, Yahoo! and Google. (and any future ones I may want to add). This is a basic example or that module.

First thing required is to set up accounts / projects and the like with the relevant providers.
I won't describe this process as they were all pretty well documented.

Bing Developer Center
Yahoo Developer Network
Google Developers Console

A couple tips for the above sites.

  • Bing: Setup both the web and synonym searches.
  • Yahoo: In the BOSS console, under manage account, put in a daily limit $ amount (or turn of limit), as they only allow 1 free query a day... so only the first request works.
  • Google: It doesn't seem that you can set it up to search the whole web, but after creating your custom search engine, you can select  "Search the entire web but emphasize included sites" so don't worry about that.

All these providers allow for many options while searching ( e.g. images, location, news, video etc.) , however in this initial example I have limited it to just a pure and simple web search.

All the code will be available in my blog Github repository.

Going through the main points.
There is a BasicWebSearch interface, that takes the search term and returns SearchResults. 
SearchResults contains results in a map based on a result type enum. 
The implementations of BasicWebSearch namely: BingSearch, GoogleSearch and YahooSearch call the relevant search engine with the search term and then convert the results into a SearchResult. In the case of Yahoo and Bing, I map the JSON result to the SearchResult. Google however does that in their search client included in the dependencies.

Now for the main code bits:

As this is just an example, I use included the search settings in the following class, be sure to replace with the relevant values.

As both Bing and Yahoo use an HttpUrlConnection, I figured I would centralise the handling of that, the only difference between the 2 is that Bing used basic authentication and Yahoo I went with the OAuth implementation.






Google has a whole bunch of extra information being returned so I extended the base SearchResult so add all the information just in case I ever need it.

Maven Dependencies

Sunday, December 15, 2013

Predicting the next most probable part of speech.

I have recently been spending some of my spare time learning and about AI and machine learning , after a couple books, a bunch of tutorials and most of Andrew Ng's Coursera course. I decided enough with the theory, time for some real code.
During all my late night reading I also stumbled across some of the following. The Loebner Prize, and it's most recent winner Mitsuku, A.L.I.C.E and Cleverbot and to be honest, maybe given my naivety of the field of AI, I expected much more from the above "AI" / technology, most of these current chatbots are easily confused and honestly not very impressive.
Thankfully I also found Eric Horvitz's video of his AI personal assistant, which resonated with what I wanted to achieve with my ventures into AI.

So, with human interaction as a focus point, I started designing: "Aria" - (Artificially Intelligent Research Assistant - in reverse). Which, since most of my development experience is based in the Java Enterprise environment, will be built on a distributed enterprise scale using the amazing technologies that it offers to mention some Hadoop, Spark, Mahout, Solr, MySQL, Neo4J, Spring...

My moonshot/daydream goal is to better the interactions of people and computers, but in reality if I only learn to use, implement and enjoy all that is involved with AI and ML, I will see myself as successful.

So, for my first bit of functional machine learning...

Predicting the next most probable part of speech. One of the issues with natural language processing is that words used in different contexts end up having different meanings and synonyms. To try assist with this I figured I would train a neural network with the relevant parts of speech, and then use that to assist in understanding user submitted text.

This full code for this example is available on Github.

I used a number of Java open source libraries for this:
Stanford NLP
Google Guava

I used a dataset of 29 000 English sentences that I sourced from a bunch of websites and open corpus's. I won't be sharing those as I have no clue what the state of the copyright is, so unfortunately to recreate this you'd need to source our own data.

For the neural network implementation, I tried both Neuroph and Encog. Neuroph got my attention first with their great UI to allow me to experiment with my neural network visually in the beginning, but as soon as I created my training data with ended up being about 300MB of 0's and 1's it fell over and didn't allow me to use it. I then began looking at Encog again as I had used initially when just starting to read about ML and AI

When using Neuroph in code it worked with the dataset, but then only with BackPropagation the ResilientPropagation implementation never seemed to return.
So I ended up much preferring Encog, it's resilient propagation implementation (iRPROP+) worked well and reduces the network error to about 0.018 in under 100 iterations, without me having to fine tune the settings and network architecture.

How this works, I take text data, I use the Stanford NLP library to generate a list of the parts of speech in the document. I translate their Annotation into an internal enum, and then use that to build up a training data set. I persist that to file currently, just to save some time while testing. I then train and persist the neural network and test it.

The Parts of Speech Enum:

The creation of the training data:

Train the network:


Sunday, October 13, 2013

Setting up multiple versions of Python on Ubuntu

I recently switched from using a Mac back to a PC, I had originally planned to use both windows and linux via dual-boot, but having purchased a Radeon and Ubuntu not even starting from the bootable USB, I decided to try run my Python development environment on Windows. After playing with python on Windows, I found it quite tedious to have both a 2.7.5 and a 3.3.2 environment. I also didn't like having to rely on for all the 'pain' free install, since trying to compile some the libs with the required C++ compiler is even a bigger pain.

So I went with a colleagues suggestion of  VMWare Player 6, and installed Ubuntu.

After breaking a couple installs and recreating VMs left and right, I finally have a process to install and work with multiple versions of Python.

First up, get a whole bunch of dependencies:
sudo apt-get install python-dev build-essential  
sudo apt-get install python-pip
sudo apt-get install libsqlite3-dev sqlite3
sudo apt-get install libreadline-dev libncurses5-dev 
sudo apt-get install libssl1.0.0 tk8.5-dev zlib1g-dev liblzma-dev
sudo apt-get build-dep python2.7
sudo apt-get build-dep python3.3

sudo pip install virtualenv
sudo pip install virtualenvwrapper

Add the virtualenvwrapper settings to ~.bashrc:
export WORKON_HOME="$HOME/.virtualenvs"
source /usr/local/bin/

Then for Python 2.7:
sudo mkdir /opt/python2.7.5

tar xvfz Python-2.7.5.tgz
cd Python-2.7.5/
./configure --prefix=/opt/python2.7.5
sudo make install

mkvirtualenv --python /opt/python2.7.5/bin/python2 v-2.7.5

Then for Python 3.3:
sudo mkdir /opt/python3.3.2

tar xvfz Python-3.3.2.tgz
cd Python-3.3.2
./configure --prefix=/opt/python3.3.2
sudo make install

mkvirtualenv --python /opt/python3.3.2/bin/python3 v-3.3.2

To view the virtual environments:

To change between them:
workon  [env name] e.g. v-3.3.2

Then to install some of the major scientific and machine learning related packages:
pip install numpy
pip install ipython[all]
pip install cython
sudo apt-get build-dep python-scipy
pip install scipy
pip install matplotlib
pip install scikit-learn
pip install pandas

To stop working on a particular version:

Sunday, September 15, 2013

Wordle... so nicely done

Discovered Wordle this morning, pointed to my blog... guess my recent posts really haven't been about java much :)

Wordle: My blog - recent posts

Popular Posts