Evolving in an AI world.: Local Wikipedia with Solr and Spring Data

Sunday, February 16, 2014

Local Wikipedia with Solr and Spring Data

Continuing with my little AI / Machine Learning research project... I wanted to have a decent sized repo of English text, that was not in a complete mess like a large percentage of data on the internet. I figured I would try Wikipedia, but what to do with about 40Gb of XML? how do I work / query with all that data. I figured based on recent work implementation where we load something like 200 000 000 records on into a Solr cache, Solr would be the way to go, so the is an example of my basic implementation.

Required for this example:

Wikipedia download (warning it is a 9.9Gb file, extracts to about 42Gb)
Solr
Spring Data (Great Blog / Examples on Spring Data: Petri Kainulainen's blog)

All the code and unit test for this post is on my blog GitHub Repo

When setting up Solr from scratch, you can have a look at Solr's wiki or documentation, their documentation is pretty good. There is also an example of importing Wikipedia here, I started with that and made some minor modifications.

For this specific example the Solr config needed (/conf):
For this example (and in the below config files),
Solr home: /Development/Solr
Index / Data: /Development/Data/solr_data/wikipedia
Import File: /Development/Data/enwiki-latest-pages-articles.xml

The full import into Solr took about 48 hours on my old 2011 i5 iMac and the index on my current setup is about 52Gb.

Data Config for the import:

Schema:

Solr Config:

The code for this ended up being quite clean, Spring Data - Solr, gives 2 main interfaces SolrIndexService, and SolrCrudRespository, you simply extend / implement these 2, wrap that in a single interface, autowire from a Spring Java context and you good to go.

Repository:

IndexService:

SolrService:

SpringContext:

Next thing for me to look at for sourcing data is Spring Social.

9 comments:

custom writing service reviewsOctober 12, 2017 at 12:47 PM
Looks like you've done some serious research for this, very informative post for programmers especially amateur programmers like me, keep up the good work, Hope to see more soon!
ReplyDelete
Replies
rychu44rMay 23, 2018 at 12:17 PM
very informative and knowledgeable
ReplyDelete
Replies
KateJune 7, 2018 at 4:20 PM
It is cool that you describe.
ReplyDelete
Replies
vé máy bay từ canada về việt namSeptember 29, 2021 at 3:11 AM
Aivivu - đại lý chuyên vé máy bay trong nước và quốc tế

vé máy bay đi Mỹ giá rẻ

vé máy bay từ atlanta về việt nam

khi nào có chuyến bay từ đức về việt nam

ve may bay tu nga ve viet nam

khi nào có chuyến bay từ anh về việt nam

chuyến bay từ Paris về Hà Nội

chuyến bay chuyên gia
ReplyDelete
Replies
MartenJamesJanuary 21, 2022 at 8:30 PM
As always you explained very well about Wikipedia and Solr cache. We are providing Commercial Electrical Services Los Angeles CA that are reliable and trusted.
ReplyDelete
Replies
AnonymousSeptember 5, 2025 at 8:11 PM
13335A3421
kiralık hacker
hacker bul
tütün dünyası
hacker bul
hacker kirala
ReplyDelete
Replies
AnonymousMarch 14, 2026 at 5:37 AM

When exploring new websites, it's important to stay safe and informed. Always verify the legitimacy of the links you encounter online. If you're unsure, you can visit trusted sources or use security tools to check the site. For more information, you can visit this page by clicking Click here!. Staying cautious ensures a better browsing experience.
ReplyDelete
Replies
AnonymousMay 17, 2026 at 1:03 AM
8DEE673D
samandağ esçort
maltepe yabancı esçort
derince esçort
esçort ağrı
bilecik esçort
tavşanlı esçort
İstanbul rus esçort
beykoz esçort
esçort isparta
ReplyDelete
Replies
AnonymousMay 27, 2026 at 10:51 PM

Neben technischen Kenntnissen gewinnt auch die Analyse von Bedrohungsszenarien an Bedeutung. Mit Methoden wie Threat Modeling nach STRIDE identifizieren Sicherheitsteams Schwachstellen schon vor einem Angriff. Die Umsetzung solcher Konzepte erfordert spezielles Know-how, das durch gezielte Kurse vermittelt wird. Bundesgesetze verlangen seit 2021 konkrete Nachweise gemäß §8a BSIG, was den Druck auf Unternehmen erhöht, ihr Personal kontinuierlich weiterzubilden.
ReplyDelete
Replies

Subscribe to: Post Comments (Atom)