Sunday, June 16, 2013

Blog Categorisation using Encog, ROME, JSoup and Google Guava

Continuing with Programming Collection Intelligence  (PCI) the next exercise was using the distance scores to pigeonhole a list of blogs based on the words used within the relevant blog.

I had already found Encog as the framework for the AI / Machine learning algorithms, for this exercise I needed an RSS reader and a HTML parser.
The 2 libraries I ended up using were:

For general other utilities and collection manipulations I used:
Google Guava

I kept the list of blogs short, included some of the software bloggers I follow, just to make testing quick, had to alter the %'s a little from the implementation in (PCI), but still got the desired result.

Blogs Used:

For the implementation I just went with a main class and a reader class:


The Results:

*** Cluster 1 ***
*** Cluster 2 ***
*** Cluster 3 ***

Wednesday, June 12, 2013

Regex POSIX expressions

I cant believe I only found out about these today, I obviously don't use regular expressions enough.

  Posix Brackets

Quick Reference:

[:alnum:]Alphanumeric characters[a-zA-Z0-9][\p{L&}\p{Nd}]\p{Alnum}
[:alpha:]Alphabetic characters[a-zA-Z]\p{L&}\p{Alpha}
[:ascii:]ASCII characters[\x00-\x7F]\p{InBasicLatin}\p{ASCII}
[:blank:]Space and tab[ \t][\p{Zs}\t]\p{Blank}
[:cntrl:]Control characters[\x00-\x1F\x7F]\p{Cc}\p{Cntrl}
[:graph:]Visible characters (i.e. anything except spaces, control characters, etc.)[\x21-\x7E][^\p{Z}\p{C}]\p{Graph}
[:lower:]Lowercase letters[a-z]\p{Ll}\p{Lower}
[:print:]Visible characters and spaces (i.e. anything except control characters, etc.)[\x20-\x7E]\P{C}\p{Print}
[:punct:]Punctuation and symbols.[!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~][\p{P}\p{S}]\p{Punct}
[:space:]All whitespace characters, including line breaks[ \t\r\n\v\f][\p{Z}\t\r\n\v\f]\s\p{Space}
[:upper:]Uppercase letters[A-Z]\p{Lu}\p{Upper}
[:word:]Word characters (letters, numbers and underscores)[A-Za-z0-9_][\p{L}\p{N}\p{Pc}]\w
[:xdigit:]Hexadecimal digits[A-Fa-f0-9][A-Fa-f0-9]\p{XDigit}

Popular Posts