Blog Categorisation using Encog, ROME, JSoup and Google Guava
Continuing with Programming Collection Intelligence (PCI) the next exercise was using the distance scores to pigeonhole a list of blogs based on the words used within the relevant blog.
I had already found Encog as the framework for the AI / Machine learning algorithms, for this exercise I needed an RSS reader and a HTML parser.
The 2 libraries I ended up using were:
ROME
JSoup
For general other utilities and collection manipulations I used:
Google Guava
I kept the list of blogs short, included some of the software bloggers I follow, just to make testing quick, had to alter the %'s a little from the implementation in (PCI), but still got the desired result.
Blogs Used:
http://blog.guykawasaki.com/index.rdf
http://blog.outer-court.com/rss.xml
http://flagrantdisregard.com/index.php/feed/
http://gizmodo.com/index.xml
http://googleblog.blogspot.com/rss.xml
http://radar.oreilly.com/index.rdf
http://www.wired.com/rss/index.xml
http://feeds.feedburner.com/codinghorror
http://feeds.feedburner.com/joelonsoftware
http://martinfowler.com/feed.atom
http://www.briandupreez.net/feeds/posts/default
For the implementation I just went with a main class and a reader class:
Main:
The Results:
*** Cluster 1 ***
[http://www.briandupreez.net/feeds/posts/default]
*** Cluster 2 ***
[http://blog.guykawasaki.com/index.rdf]
[http://radar.oreilly.com/index.rdf]
[http://googleblog.blogspot.com/rss.xml]
[http://blog.outer-court.com/rss.xml]
[http://gizmodo.com/index.xml]
[http://flagrantdisregard.com/index.php/feed/]
[http://www.wired.com/rss/index.xml]
*** Cluster 3 ***
[http://feeds.feedburner.com/joelonsoftware]
[http://feeds.feedburner.com/codinghorror]
[http://martinfowler.com/feed.atom]
I had already found Encog as the framework for the AI / Machine learning algorithms, for this exercise I needed an RSS reader and a HTML parser.
The 2 libraries I ended up using were:
ROME
JSoup
For general other utilities and collection manipulations I used:
Google Guava
I kept the list of blogs short, included some of the software bloggers I follow, just to make testing quick, had to alter the %'s a little from the implementation in (PCI), but still got the desired result.
Blogs Used:
http://blog.guykawasaki.com/index.rdf
http://blog.outer-court.com/rss.xml
http://flagrantdisregard.com/index.php/feed/
http://gizmodo.com/index.xml
http://googleblog.blogspot.com/rss.xml
http://radar.oreilly.com/index.rdf
http://www.wired.com/rss/index.xml
http://feeds.feedburner.com/codinghorror
http://feeds.feedburner.com/joelonsoftware
http://martinfowler.com/feed.atom
http://www.briandupreez.net/feeds/posts/default
For the implementation I just went with a main class and a reader class:
Main:
The Results:
*** Cluster 1 ***
[http://www.briandupreez.net/feeds/posts/default]
*** Cluster 2 ***
[http://blog.guykawasaki.com/index.rdf]
[http://radar.oreilly.com/index.rdf]
[http://googleblog.blogspot.com/rss.xml]
[http://blog.outer-court.com/rss.xml]
[http://gizmodo.com/index.xml]
[http://flagrantdisregard.com/index.php/feed/]
[http://www.wired.com/rss/index.xml]
*** Cluster 3 ***
[http://feeds.feedburner.com/joelonsoftware]
[http://feeds.feedburner.com/codinghorror]
[http://martinfowler.com/feed.atom]
This message is perfect.
ReplyDeleteMua vé tại đại lý vé máy bay Aivivu, tham khảo
ReplyDeletevé máy bay đi Mỹ bao nhiêu
giá vé máy bay từ mỹ về vn
vé máy bay khứ hồi từ đức về việt nam
giá vé máy bay nga về việt nam
các chuyến bay từ anh về việt nam
chuyến bay từ paris về hà nội
chuyến bay chuyên gia trung quốc
Lovely ppost
ReplyDelete