The utility converts the source Wikipedia text into pair of files:
links.tsv
doc id
- the ID of the Wikipedia articlefirst token id
- the ID of the first token of the linklast token id
- the ID of the last token of the linklink text
- the linked text form the Wikipedia article (the text visible to the user)link target
- the name of the Wikipedia article that is a target of the link
tokens.tsv
- Content of each Wikipedia article broken into tokens:doc id
- the ID of the Wikipedia articletoken id
- the id of the token within the articlespace
- 1 indicates that there was a white character before the tokentoken
- the content of the token
These files are useful for computing statistics of the number of links appearing in the text for a given expression, etc.
The Docker compose file for this service is in SQL subproject. The assumed directory structure is as follows:
cyclopedio/
data/pl/data/
data/pl/rod/
sql/
wikiextract/
The scripts might be run with Docker and docker compose
. The docker-compose
is in sql
subproject.
docker compose up # in cyclopedio/sql
# in a separate window
docker exec cyclopedio-wikiextract bundle exec rake build
docker exec cyclopedio-wikiextract bundle exec rake tokens:extract
If you want to disambiguate the Wikipedia pages against the DB you have to run:
docker exec cyclopedio-wikiextract bundle exec rake tokens:uniq links:count links:convert