Sunday, June 16, 2013

Blog Categorisation using Encog, ROME, JSoup and Google Guava

Continuing with Programming Collection Intelligence  (PCI) the next exercise was using the distance scores to pigeonhole a list of blogs based on the words used within the relevant blog.

I had already found Encog as the framework for the AI / Machine learning algorithms, for this exercise I needed an RSS reader and a HTML parser.
The 2 libraries I ended up using were:
ROME
JSoup

For general other utilities and collection manipulations I used:
Google Guava

I kept the list of blogs short, included some of the software bloggers I follow, just to make testing quick, had to alter the %'s a little from the implementation in (PCI), but still got the desired result.

Blogs Used:

http://blog.guykawasaki.com/index.rdf
http://blog.outer-court.com/rss.xml
http://flagrantdisregard.com/index.php/feed/
http://gizmodo.com/index.xml
http://googleblog.blogspot.com/rss.xml
http://radar.oreilly.com/index.rdf
http://www.wired.com/rss/index.xml
http://feeds.feedburner.com/codinghorror
http://feeds.feedburner.com/joelonsoftware
http://martinfowler.com/feed.atom
http://www.briandupreez.net/feeds/posts/default

For the implementation I just went with a main class and a reader class:


Main:


The Results:


*** Cluster 1 ***
[http://www.briandupreez.net/feeds/posts/default]
*** Cluster 2 ***
[http://blog.guykawasaki.com/index.rdf]
[http://radar.oreilly.com/index.rdf]
[http://googleblog.blogspot.com/rss.xml]
[http://blog.outer-court.com/rss.xml]
[http://gizmodo.com/index.xml]
[http://flagrantdisregard.com/index.php/feed/]
[http://www.wired.com/rss/index.xml]
*** Cluster 3 ***
[http://feeds.feedburner.com/joelonsoftware]
[http://feeds.feedburner.com/codinghorror]
[http://martinfowler.com/feed.atom]

Wednesday, June 12, 2013

Regex POSIX expressions

I cant believe I only found out about these today, I obviously don't use regular expressions enough.

  Posix Brackets

Quick Reference:

POSIXDescriptionASCIIUnicodeShorthandJava
[:alnum:]Alphanumeric characters[a-zA-Z0-9][\p{L&}\p{Nd}]\p{Alnum}
[:alpha:]Alphabetic characters[a-zA-Z]\p{L&}\p{Alpha}
[:ascii:]ASCII characters[\x00-\x7F]\p{InBasicLatin}\p{ASCII}
[:blank:]Space and tab[ \t][\p{Zs}\t]\p{Blank}
[:cntrl:]Control characters[\x00-\x1F\x7F]\p{Cc}\p{Cntrl}
[:digit:]Digits[0-9]\p{Nd}\d\p{Digit}
[:graph:]Visible characters (i.e. anything except spaces, control characters, etc.)[\x21-\x7E][^\p{Z}\p{C}]\p{Graph}
[:lower:]Lowercase letters[a-z]\p{Ll}\p{Lower}
[:print:]Visible characters and spaces (i.e. anything except control characters, etc.)[\x20-\x7E]\P{C}\p{Print}
[:punct:]Punctuation and symbols.[!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~][\p{P}\p{S}]\p{Punct}
[:space:]All whitespace characters, including line breaks[ \t\r\n\v\f][\p{Z}\t\r\n\v\f]\s\p{Space}
[:upper:]Uppercase letters[A-Z]\p{Lu}\p{Upper}
[:word:]Word characters (letters, numbers and underscores)[A-Za-z0-9_][\p{L}\p{N}\p{Pc}]\w
[:xdigit:]Hexadecimal digits[A-Fa-f0-9][A-Fa-f0-9]\p{XDigit}

Popular Posts

Followers