Wednesday, July 3, 2013

Mini Search Engine - Just the basics, using Neo4j, Crawler4j, Graphstream and Encog

Continuing to chapter 4 of Programming Collection Intelligence  (PCI) which is implementing a search engine.
I may have bitten off a little more than I should of in 1 exercise. Instead of using the normal relational database construct as used in the book, I figured, I always wanted to have a look at Neo4J so now was the time. Just to say, this isn't necessarily the ideal use case for a graph db, but how hard could to be to kill 3 birds with 1 stone.

Working through the tutorials trying to reset my SQL Server, Oracle mindset took a little longer than expected, but thankfully there are some great resources around Neo4j.

Just a couple:
neo4j - learn
Graph theory for busy developers
Graphdatabases

Since I just wanted to run this as a little exercise, I decided to go for a in memory implementation and not run it as a service on my machine. In hindsight this was probably a mistake and the tools and web interface would have helped me visualise my data graph quicker in the beginning.

As you can only have 1 writable instance of the in memory implementation, I made a little double lock singleton factory to create and clear the DB.


Then using Crawler4j created a graph of all the URLs starting with my blog, their relationships to other URLs and all the words and indexes of the words that those URLs contain.

After the data was collected, I could query it and perform the functions of a search engine. For this I decided to use java futures as it was another thing I had only read about and not yet implemented. In my day to day working environment we use Weblogic  / CommonJ work managers within the application server to perform the same task.
I then went about creating a task for each of the following counting the word frequency, document location, Page Rank and neural network (with fake input / training data) to rank the pages returned based on the search criteria. All the code is in my public github blog repo.

Disclaimer: The Neural Network task, either didn't have enough data to be affective, or I implemented the data normalisation incorrectly, so it is currently not very useful, I'll return to it once I have completed the journey through the while PCI book.

The one task worth sharing was the Page Rank one, I quickly read some of the theory for it, decided I am not that clever and went searching for a library that had it implemented. I discovered Graphstream a wonderful opensource project that does a WHOLE lot more than just PageRank, check out their video.

From that it was then simple to implement my PageRank task of this exercise.



In between all of this I found a great implementation of sorting a map by values on Stackoverflow.

The Maven dependencies used to implement all of this


Now to chapter 5 on PCI... Optimisation.

Monday, July 1, 2013

A couple useful Oracle XE admin commands

I struggled a bit trying to get my local Oracle XE up and running after a couple months of being dormant.

Firstly: Oracle XE 11g sets password expiry by default. Quiet annoying...
So my system account was locked.
How to unlock that I did the following on the window command prompt:
set ORACLE_SID=XE 
set ORACLE_HOME= "ORACLE_PATH" (D:\OracleXe\app\oracle\product\11.2.0\server) in my case.
sqlplus / as sysdba
ALTER USER SYSTEM identified by password;

If the account is locked run:
ALTER USER system ACCOUNT UNLOCK;


Then, to ensure that it does not expire again:

ALTER PROFILE DEFAULT LIMIT
FAILED_LOGIN_ATTEMPTS UNLIMITED
PASSWORD_LIFE_TIME UNLIMITED;

One more thing I needed to change since I had installed a local Tomcat, is the default HTTP port for XE.
This can be done with 3010 is the new port:
Exec DBMS_XDB.SETHTTPPORT(3010)

Sunday, June 16, 2013

Blog Categorisation using Encog, ROME, JSoup and Google Guava

Continuing with Programming Collection Intelligence  (PCI) the next exercise was using the distance scores to pigeonhole a list of blogs based on the words used within the relevant blog.

I had already found Encog as the framework for the AI / Machine learning algorithms, for this exercise I needed an RSS reader and a HTML parser.
The 2 libraries I ended up using were:
ROME
JSoup

For general other utilities and collection manipulations I used:
Google Guava

I kept the list of blogs short, included some of the software bloggers I follow, just to make testing quick, had to alter the %'s a little from the implementation in (PCI), but still got the desired result.

Blogs Used:

http://blog.guykawasaki.com/index.rdf
http://blog.outer-court.com/rss.xml
http://flagrantdisregard.com/index.php/feed/
http://gizmodo.com/index.xml
http://googleblog.blogspot.com/rss.xml
http://radar.oreilly.com/index.rdf
http://www.wired.com/rss/index.xml
http://feeds.feedburner.com/codinghorror
http://feeds.feedburner.com/joelonsoftware
http://martinfowler.com/feed.atom
http://www.briandupreez.net/feeds/posts/default

For the implementation I just went with a main class and a reader class:


Main:


The Results:


*** Cluster 1 ***
[http://www.briandupreez.net/feeds/posts/default]
*** Cluster 2 ***
[http://blog.guykawasaki.com/index.rdf]
[http://radar.oreilly.com/index.rdf]
[http://googleblog.blogspot.com/rss.xml]
[http://blog.outer-court.com/rss.xml]
[http://gizmodo.com/index.xml]
[http://flagrantdisregard.com/index.php/feed/]
[http://www.wired.com/rss/index.xml]
*** Cluster 3 ***
[http://feeds.feedburner.com/joelonsoftware]
[http://feeds.feedburner.com/codinghorror]
[http://martinfowler.com/feed.atom]

Wednesday, June 12, 2013

Regex POSIX expressions

I cant believe I only found out about these today, I obviously don't use regular expressions enough.

  Posix Brackets

Quick Reference:

POSIXDescriptionASCIIUnicodeShorthandJava
[:alnum:]Alphanumeric characters[a-zA-Z0-9][\p{L&}\p{Nd}]\p{Alnum}
[:alpha:]Alphabetic characters[a-zA-Z]\p{L&}\p{Alpha}
[:ascii:]ASCII characters[\x00-\x7F]\p{InBasicLatin}\p{ASCII}
[:blank:]Space and tab[ \t][\p{Zs}\t]\p{Blank}
[:cntrl:]Control characters[\x00-\x1F\x7F]\p{Cc}\p{Cntrl}
[:digit:]Digits[0-9]\p{Nd}\d\p{Digit}
[:graph:]Visible characters (i.e. anything except spaces, control characters, etc.)[\x21-\x7E][^\p{Z}\p{C}]\p{Graph}
[:lower:]Lowercase letters[a-z]\p{Ll}\p{Lower}
[:print:]Visible characters and spaces (i.e. anything except control characters, etc.)[\x20-\x7E]\P{C}\p{Print}
[:punct:]Punctuation and symbols.[!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~][\p{P}\p{S}]\p{Punct}
[:space:]All whitespace characters, including line breaks[ \t\r\n\v\f][\p{Z}\t\r\n\v\f]\s\p{Space}
[:upper:]Uppercase letters[A-Z]\p{Lu}\p{Upper}
[:word:]Word characters (letters, numbers and underscores)[A-Za-z0-9_][\p{L}\p{N}\p{Pc}]\w
[:xdigit:]Hexadecimal digits[A-Fa-f0-9][A-Fa-f0-9]\p{XDigit}

Sunday, May 19, 2013

Some Java based AI Frameworks : Encog, JavaML, Weka

While working through I am working through Programming Collection Intelligence I found myself sending a lot of time translating the Python code to java, being typically impatient at my slow progress, I went searching for alternatives.

I found 3:
Encog - Heaton Research
JavaML
Weka

This is by no means an in-depth investigation, I simply downloaded what the relevant projects had available and quickly compared what was available to me to learn and implement AI related samples / applications.
 

Encog

Advantages

  1. You Tube video tutorials
  2. E-Books available for both Java and .Net
  3. C# implementation
  4. Closure wrapper
  5. Seems active

Disadvantages

  1. Quite large code base to wrap your head around, this is probably due to the size of the domain we are looking at, but still much more intimidating to start off with vs. the Java ML library.

JavaML


Advantages

  1. Seems reasonably stable
  2. Well documented source code
  3. Well defined simple algorithm implementations

Disadvantages

  1. Lacks the tutorial support for a AI newbie like myself

Weka


Advantages

Disadvantages

  1. Could not install Weka 3-7-9 dmg... kept on giving me a "is damaged and can't be opened error, so left it there, as Sweet Brown says: "Ain't nobody got time for that". 

So no surprise I went with Encog, and started on their video tutorials....
A couple hours later, first JUnit test understanding, training and testing a Hopfield neural network using the Encog libs.




Saturday, May 11, 2013

Similarity Score Algorithms

As per my previous post, I am working through Programming Collection Intelligence the first couple algorithms described in this book are regarding finding a similarity score, the methods they work through are Euclidean Distance and the Pearson Correlation Coefficient. The Manhattan distance score is also mentioned but some what I could find it seems that it is just the sum of the (absolute) differences of their coordinates, instead of Math.pow 2 used in Euclidean distance.

I worked through this and wrote/found some java equivalents for future use:

Euclidean Distance:

Pearson Correlation Coefficient:

Friday, May 3, 2013

Venture into AI, Machine Learning and all those algorithms that go with it.

It's been a 4 months since my last blog entry, I took it easy for a little while as we all need to do from time to time... but before long my brain got these nagging ideas and questions:

How hard can AI and Machine learning actually be?
How does it work?
I bet people are just over complicating it..
How are they currently trying to solve it?
Is it actually that difficult?
Could it be done it differently?

So off I went search the internet, some of useful sites I came across:
http://www.ai-junkie.com
Machine-learning Stanford Video course
Genetic algorithm example

I also ended up buying 2 books on Amazon:

Firstly, from many different recommendations:
Programming Collective Intelligence

I will be "working" through this book. While reading I will be translating, implementing and blogging the algorithms defined (in Python) as well as any mentioned that I will research separately in Java. Mainly for my own understanding and for the benefit of reusing them later, and an excuse to play with Java v7.

However, since I want to practically work through that book, I needed another for some "light" reading before sleep, I found another book from an article on MIT technology review Deep Learning, a bit that caught my eye was:


For all the advances, not everyone thinks deep learning can move artificial intelligence toward something rivaling human intelligence. Some critics say deep learning and AI in general ignore too much of the brain’s biology in favor of brute-force computing.
One such critic is Jeff Hawkins, founder of Palm Computing, whose latest venture, Numenta, is developing a machine-learning system that is biologically inspired but does not use deep learning. Numenta’s system can help predict energy consumption patterns and the likelihood that a machine such as a windmill is about to fail. Hawkins, author of On Intelligence, a 2004 book on how the brain works and how it might provide a guide to building intelligent machines, says deep learning fails to account for the concept of time. Brains process streams of sensory data, he says, and human learning depends on our ability to recall sequences of patterns: when you watch a video of a cat doing something funny, it’s the motion that matters, not a series of still images like those Google used in its experiment. “Google’s attitude is: lots of data makes up for everything,” Hawkins says.



So the second book I purchased - On Intelligence
So far (only page upto page 54) 2 things have from this book have imbedded themselves in my brain:
"Complexity is a symptom of confusion, not a cause" - so so common in the software development world.
&
"AI defenders also like to point out historical instances in which the engineering solution differs radically from natures version"
...
"Some philosophers of mind have taken a shine to the metaphor of the cognitive wheel, that is, an AI solution to some problem that although entirely different from how the brain does it is just as good"

Jeff himself believes we need to look deeper into the brain for a better understanding, but could it be possible to have completely different approach to solve the "intelligence" problem?

Thursday, January 3, 2013

Weblogic JNDI & Security Contexts

Quite often when using multiple services / ejbs from different internal teams we have run into weblogic context / security errors, we always deduced the issue was how Weblogic handles it's contexts, I finally found weblogics' explanations their documents:

JNDI Contexts and Threads

When you create a JNDI Context with a username and password, you associate a user with a thread. When the Context is created, the user is pushed onto the context stack associated with the thread. Before starting a new Context on the thread, you must close the first Context so that the first user is no longer associated with the thread. Otherwise, users are pushed down in the stack each time a new context created. This is not an efficient use of resources and may result in the incorrect user being returned by ctx.lookup() calls. This scenario is illustrated by the following steps:
  1. Create a Context (with username and credential) called ctx1 for user1. In the process of creating the context, user1 is associated with the thread and pushed onto the stack associated with the thread. The current user is now user1.
  2. Create a second Context (with username and credential) called ctx2 for user2. At this point, the thread has a stack of users associated with it. User2 is at the top of the stack and user1 is below it in the stack, so user2 is used is the current user.
  3. If you do a ctx1.lookup("abc") call, user2 is used as the identity rather than user1, because user2 is at the top of the stack. To get the expected result, which is to have ctx1.lookup("abc") call performed as user1, you need to do a ctx2.close() call. The ctx2.close() call removes user2 from the stack associated with the thread and so that a ctx1.lookup("abc") call now uses user1 as expected.
  4. Note: When the weblogic.jndi.enableDefaultUser flag is enabled, there are two situations where a close() call does not remove the current user from the stack and this can cause JNDI context problems. For information on how to avoid JNDI context problems, see How to Avoid Potential JNDI Context Problems.

How to Avoid Potential JNDI Context Problems

Issuing a close() call is usually as described in JNDI Contexts and Threads. However, the following is an exception to the expected behavior that occur when the weblogic.jndi.enableDefaultUser flag is enabled:
Last Used
When using IIOP, an exception to expected behavior arises when there is one Context on the stack and that Context is removed by a close(). The identity of the last context removed from the stack determines the current identity of the user. This scenario is described in the following steps:
  1. Create a Context (with username and credential) called ctx1 for user1. In the process of creating the context, user1 is associated with the thread and stored in the stack, that is, the current identity is set to user1.
  2. Do a ctx1.close() call.
  3. Do a ctx1.lookup()call. The current identity is user1.
  4. Create a Context (with username and credential) called ctx2 for user2. In the process of creating the context, user2 is associated with the thread and stored in the stack, that is, the current identity is set to user2.
  5. Do a ctx2.close() call.
  6. Do a ctx2.lookup()call. The current identity is user2.

Link to the source Weblogic Docs: Weblogic JNDI

Wednesday, October 17, 2012

Setting up and playing with Apache Solr on Tomcat

A while back a had a little time to play with Solr, and was instantly blown away by the performance we could achieve on some of our bigger datasets.
Here is some of my initial setup and configuration learnings to maybe help someone get it up and running a little faster.
Starting with setting both up on windows.

Download and extract Apache Tomcat and Solr and copy into your working folders.
Tomcat Setup
If you want tomcat as a service install it using the following:
bin\service.bat install 
Edit the tomcat users under conf.:

If you are going to query Solr using international characters (>127) using HTTP-GET, you must configure Tomcat to conform to the URI standard by accepting percent-encoded UTF-8. Add: URIEncoding="UTF-8"
to the conf/server.xml

Copy the contents of the example\solr your solr home directory D:\Java\apache-solr-3.6.0\home
create the code fragment on $CATALINA_HOME/conf/Catalina/localhost/solr.xml pointing to your solr home.

Startup tomcat, login, deploy the solr.war. Solr Setup
It should be available at http://localhost:8080/solr/admin/ To create a quick test using SolrJ the creates and reads data: Grab the following Maven Libs: JUnit test: Adding data directly from the DB Firstly you need to add the relevant DB libs to the add classpath. Then create data-config.xml as below, if you require custom fields, those can be specified under the fieldstag in the schema.xml shown below the dataconfig.xml A custom field in the schema.xml: Add in the solrconfig.xml make sure to point the the data-config.xml, the handler has to be registered in the solrconfig.xml as follows. Once that is all setup a full import can be done with the following: http://localhost:8080/solr/admin/dataimport?command=full-import Then you should be good to go with some lightning fast data retrieval.

Sunday, September 9, 2012

Android App : iBoincStats


A while back I did an iOS app iBoincStats which has since been downloaded about 2300 times.

I have recently submitted another tiny game to Apple, and in the doldrums that is the app store approval process I 
set myself a little challenge: download, learn, write and publish iBoincStats for Android be the other application get approved.

I have to give Android full credit, if you are a Java developer, developing for Android is really simple. I tried getting it all
up and running a couple years back, but with the simulator taking 20 minutes+ to start up, I deleted it very quickly. This time with the latest SDK and intelliJ 11 it was just a little slower than the iOS environment and much more usable.

The default "look and feel" on Android really takes a lot more work to make it look as good an iOS app, I didn't really spend
enough time on that.
If anyone actually downloads it, I'll dedicate a little more time to it.
  

iBoincStats (For Android)

This is a simple stats client to view your BOINC project processing statistics.
Enter your cross project id and access your latest stats.
Some of the popular BOINC projects include:
Seti@home
climateprediction.net
Einstein@home
POEM@home
rosetta@home

More information regarding the BOINC project can be found at:
BOINC home
Wikipedia - Berkely Open Infrastructure for Network Computing
Screen Shots:



Thursday, August 2, 2012

sun.* Packages

Came accross this the other day when a coworker used sun.misc.Base64Decoder, something didn't feel right about the package name, but I had no justification on why he shouldn't use it. After a bit of searching I found the following link:
Sun Packages FAQ  

Quote:
The java.*, javax.* and org.* packages documented in the Java 2 Platform Standard Edition API Specification make up the official, supported, public interface.
 If a Java program directly calls only API in these packages, it will operate on all Java-compatible platforms, regardless of the underlying OS platform.


 The sun.* packages are not part of the supported, public interface. A Java program that directly calls into sun.* packages is not guaranteed to work on all Java-compatible platforms. In fact, such a program is not guaranteed to work even in future versions on the same platform.

Makes obvious sense now, but it was actually something I was not aware of.

Building KubeSkippy: Learnings from a thought experiment

So, I got Claude Code Max and I thought of what would be the most ambitious thing I could try "vibe"? As my team looks after Kuber...