Sunday, February 16, 2014

Local Wikipedia with Solr and Spring Data

Continuing with my little AI / Machine Learning research project... I wanted to have a decent sized repo of English text, that was not in a complete mess like a large percentage of data on the internet.  I figured I would try Wikipedia, but what to do with about 40Gb of XML? how do I work / query with all that data. I figured based on recent work implementation where we load something like 200 000 000 records on into a Solr cache, Solr would be the way to go, so the is an example of my basic implementation.

Required for this example:

Wikipedia download (warning it is a 9.9Gb file, extracts to about 42Gb)
Spring Data (Great Blog / Examples on Spring Data:  Petri Kainulainen's blog)

All the code and unit test for this post is on my blog GitHub Repo

When setting up Solr from scratch, you can have a look at Solr's wiki or documentation, their documentation is pretty good. There is also an example of importing Wikipedia here, I started with that and made some minor modifications.

For this specific example the Solr config needed (/conf):
For this example (and in the below config files),
Solr home: /Development/Solr
Index / Data: /Development/Data/solr_data/wikipedia
Import File: /Development/Data/enwiki-latest-pages-articles.xml

The full import into Solr took about 48 hours on my old 2011 i5 iMac and the index on my current setup is about 52Gb.

Data Config for the import:


Solr Config:

The code for this ended up being quite clean, Spring Data - Solr, gives 2 main interfaces SolrIndexService, and SolrCrudRespository, you simply extend / implement these 2, wrap that in a single interface, autowire from a Spring Java context and you good to go.





Next thing for me to look at for sourcing data is Spring Social.


  1. Looks like you've done some serious research for this, very informative post for programmers especially amateur programmers like me, keep up the good work, Hope to see more soon!

    1. The effectiveness of IEEE Project Domains depends very much on the situation in which they are applied. In order to further improve IEEE Final Year Project Domains practices we need to explicitly describe and utilise our knowledge about software domains of software engineering Final Year Project Domains for CSE technologies. This paper suggests a modelling formalism for supporting systematic reuse of software engineering technologies during planning of software projects and improvement programmes in Final Year Project Centers in Chennai.

      Spring Framework has already made serious inroads as an integrated technology stack for building user-facing applications. Spring Framework Corporate TRaining the authors explore the idea of using Java in Big Data platforms.
      Specifically, Spring Framework provides various tasks are geared around preparing data for further analysis and visualization. Spring Training in Chennai

  2. very informative and knowledgeable

  3. It is cool that you describe.

  4. As always you explained very well about Wikipedia and Solr cache. We are providing Commercial Electrical Services Los Angeles CA that are reliable and trusted.


Popular Posts