Difference between revisions of "Get your own copy of WikiData/2023"
(→Links) |
|||
Line 300: | Line 300: | ||
* https://stackoverflow.com/questions/14494449/virtuoso-system-requirements | * https://stackoverflow.com/questions/14494449/virtuoso-system-requirements | ||
* https://muncca.com/2019/02/14/wikidata-import-in-apache-jena/ | * https://muncca.com/2019/02/14/wikidata-import-in-apache-jena/ | ||
+ | * https://users.jena.apache.narkive.com/J1gsFHRk/tdb2-tdbloader-performance |
Revision as of 13:15, 13 May 2020
Why would you want your own WikiData copy?
The resources behind https://query.wikidata.org/ are scarce and used by a lot of people. You might hit the https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_limits quite quickly.
Prerequisites
Getting a copy of WikiData is not for the faint of heart.
You need quite a bit of patience and some hardware resources to get your own WikiData copy working. The resources you need are a moving target since WikiData is growing all the time. On this page you'll see the documentation for two attempts from
- 2018
- 2020
the successful 2018 attempt was done with a cheap 50 EUR used server from 2009. The server was on sale in Mönchengladbach via ebay. The server originally had 32 GByte of RAM and we increased the amount to 64 GByte by buying a second one and adding the RAM. In 2018 a 512 GByte SSD was sufficient to speed up the import process from some 14 days to 3.8 days. Specs of the server:
ASUS KFSN5-D/IST mainboard Brand Quad-Core AMD Opteron(tm) Processor 2374 H Speed 2.20GHz NB SPEED 2.00GHz
The disadvantage of the server is that running it 24h / 365 days is more costly than the server itself. It has a power consumption of some 3 kWh per day which would cost more than 300 EUR per year to run. We decided to only switch it on when needed.
First Attempt 2018-01
The start of this attempt was on 2018-01-05. I tried to follow the procedure at:
~/wikidata/wikidata-query-rdf/dist/target/service-0.3.0-SNAPSHOT$nohup ./munge.sh -f data/latest-all.ttl.gz -d data/split -l en,de &
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
08:23:02.391 [main] INFO org.wikidata.query.rdf.tool.Munge - Switching to data/split/wikidump-000000001.ttl.gz
08:24:21.249 [main] INFO org.wikidata.query.rdf.tool.Munge - Processed 10000 entities at (105, 47, 33)
08:25:07.369 [main] INFO org.wikidata.query.rdf.tool.Munge - Processed 20000 entities at (162, 70, 41)
08:25:56.862 [main] INFO org.wikidata.query.rdf.tool.Munge - Processed 30000 entities at (186, 91, 50)
08:26:43.594 [main] INFO org.wikidata.query.rdf.tool.Munge - Processed 40000 entities at (203, 109, 59)
08:27:24.042 [main] INFO org.wikidata.query.rdf.tool.Munge - Processed 50000 entities at (224, 126, 67)
...
java.nio.file.NoSuchFileException: ./mwservices.json
Import issues
- https://phabricator.wikimedia.org/T164773
- https://phabricator.wikimedia.org/p/Yurik/
- https://www.mediawiki.org/wiki/User:AKlapper_(WMF)
Queries after import
Number of Triples
SELECT (COUNT(*) as ?Triples) WHERE { ?s ?p ?o}
Triples
3.019.914.549
try it on original WikiData Query Service!
Triples 10.949.664.801
TypeCount
SELECT ?type (COUNT(?type) AS ?typecount)
WHERE {
?subject a ?type.
}
GROUP by ?type
ORDER by desc(?typecount)
LIMIT 7
<http://wikiba.se/ontology#BestRank> 369637917
schema:Article 61229687
<http://wikiba.se/ontology#GlobecoordinateValue> 5379022
<http://wikiba.se/ontology#QuantityValue> 697187
<http://wikiba.se/ontology#TimeValue> 234556
<http://wikiba.se/ontology#GeoAutoPrecision> 101897
<http://www.wikidata.org/prop/novalue/P17> 37884
Second Attempt 2020-05
Test Environment
- Mac Pro Mid 2010
- 12 core 3.46 GHz
- 64 GB RAM
- macOS High Sierra 10.13.6
- 2 TerraByte 5400 rpm hard disk Seagate Barracuda ST2000DM001 Blackmagic speed rating: 130 MB/s write 140 MB/s read
- 4 TerraByte 7200 rpm hard disk WD Gold WDC WD4002FYYZ Blackmagic speed rating: 175 MB/s write 175 MB/s read
Download and unpack
Sizes:
- download: 67 G
- unpacked: 552 G
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
--2020-05-09 17:18:53-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 71897810492 (67G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’
latest-all.ttl.bz2 0%[ ] 147.79M 4.82MB/s eta 3h 56m
...
latest-all.ttl.bz2 100%[===================>] 66.96G 4.99MB/s in 4h 0m
2020-05-09 21:19:25 (4.75 MB/s) - ‘latest-all.ttl.bz2’ saved [71897810492/71897810492]
bzip2 -dk latest-all.ttl.bz2
ls -l
-rw-r--r-- 1 wf admin 592585505631 May 7 08:00 latest-all.ttl
Test counting lines
Simply counting the 15.728.395.994 lines of latest-all.ttl the turtle file which should roughly give the number of triples in that file takes around one hour in the test environment.
Sun May 10 07:13:45 CEST 2020
15728395994 latest-all.ttl
Sun May 10 08:12:50 CEST 2020
Test with Apache Jena
see https://muncca.com/2019/02/14/wikidata-import-in-apache-jena/ After doing the manual download i decided to create the wikidata2jena Script below. With that script the command
nohup ./wikidata2jena&
Will start the processing of the latest-all.ttl file in background. You might want to make sure that your computer does not go to sleep while the script wants to run. I am using Amphetamine macOS App for this. With
tail nohup.out
apache-jena-3.14.0.tar.gz already downloaded
apache-jena-3.14.0 already unpacked
creating data directory
creating temporary directory /Volumes/Tattu/wikidata/tmp
started load phase data at 2020-05-11T16:45:11Z
You can watch the progress of the phases which i assume will take some 2 days to finish for the data phase.
To see more progress details you might want to call:
tail -f tdb-data-err.log
INFO Elapsed: 15,97 seconds [2020/05/11 18:45:29 MESZ]
INFO Add: 1.550.000 Data (Batch: 112.359 / Avg: 94.414)
...
INFO Elapsed: 505,57 seconds [2020/05/11 18:53:39 MESZ]
INFO Add: 54.550.000 Data (Batch: 111.607 / Avg: 107.803)
...
INFO Elapsed: 5.371,68 seconds [2020/05/11 20:14:45 MESZ]
INFO Add: 665.050.000 Data (Batch: 83.333 / Avg: 123.792)
...
INFO Elapsed: 44.571,30 seconds [2020/05/12 07:08:05 MESZ]
INFO Add: 5.439.050.000 Data (Batch: 163.398 / Avg: 122.029)
...
INFO Elapsed: 50.578,97 seconds [2020/05/12 08:48:12 MESZ]
INFO Add: 6.189.550.000 Data (Batch: 112.612 / Avg: 122.372)
...
INFO Elapsed: 61.268,32 seconds [2020/05/12 11:46:22 MESZ]
INFO Add: 7.470.050.000 Data (Batch: 138.121 / Avg: 121.922)
...
INFO Elapsed: 73.785,44 seconds [2020/05/12 15:14:59 MESZ]
INFO Add: 8.729.050.000 Data (Batch: 63.532 / Avg: 118.301)
...
INFO Elapsed: 89.876,01 seconds [2020/05/12 19:43:09 MESZ]
INFO Add: 9.409.050.000 Data (Batch: 33.222 / Avg: 104.687)
...
INFO Elapsed: 97.888,94 seconds [2020/05/12 21:56:42 MESZ]
INFO Add: 9.584.550.000 Data (Batch: 17.692 / Avg: 97.909)
...
INFO Elapsed: 130.683,99 seconds [2020/05/13 07:03:17 MESZ]
INFO Add: 9.947.050.000 Data (Batch: 4.505 / Avg: 76.108)
...
INFO Add: 9.999.950.000 Data (Batch: 10.273 / Avg: 72.642)
INFO Add: 10.000.000.000 Data (Batch: 11.088 / Avg: 72.639)
INFO Elapsed: 137.665,23 seconds [2020/05/13 08:59:38 MESZ]
INFO Add: 10.000.050.000 Data (Batch: 10.397 / Avg: 72.637)
Output of a failed attempt with insufficient disk space
INFO Elapsed: 352,44 seconds [2020/05/10 11:03:30 MESZ]
INFO Add: 37.050.000 Data (Batch: 107.991 / Avg: 104.985)
...
INFO Elapsed: 4.464,92 seconds [2020/05/10 12:12:03 MESZ]
INFO Add: 545.550.000 Data (Batch: 120.481 / Avg: 122.174)
...
INFO Elapsed: 8.611,05 seconds [2020/05/10 13:21:09 MESZ]
INFO Add: 1.026.050.000 Data (Batch: 128.205 / Avg: 119.149)
...
INFO Elapsed: 30.653,35 seconds [2020/05/10 19:28:31 MESZ]
INFO Add: 3.430.050.000 Data (Batch: 105.042 / Avg: 111.896)
...
INFO Elapsed: 70.746,18 seconds [2020/05/11 06:36:44 MESZ]
INFO Add: 6.149.050.000 Data (Batch: 49.358 / Avg: 86.915)
...
INFO Elapsed: 90.976,65 seconds [2020/05/11 12:13:54 MESZ]
INFO Add: 7.674.050.000 Data (Batch: 12.124 / Avg: 84.348)
...
INFO Elapsed: 96.770,91 seconds [2020/05/11 13:50:29 MESZ]
INFO Add: 7.979.050.000 Data (Batch: 51.334 / Avg: 82.452)
org.apache.jena.atlas.AtlasException: java.io.IOException: No space left on device
so the first attempt failed after some 80% of the data was loaded in phase "data" with the hard disk being full. Unfortunately only 1.2 TB of the 2 TB of the disk had been available.
Manual download
wget -c http://mirror.easyname.ch/apache/jena/binaries/apache-jena-3.14.0.tar.gz
--2020-05-10 10:26:38-- http://mirror.easyname.ch/apache/jena/binaries/apache-jena-3.14.0.tar.gz
Resolving mirror.easyname.ch (mirror.easyname.ch)... 77.244.244.134
Connecting to mirror.easyname.ch (mirror.easyname.ch)|77.244.244.134|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20174193 (19M) [application/x-gzip]
Saving to: ‘apache-jena-3.14.0.tar.gz’
apache-jena-3.14.0. 100%[===================>] 19.24M 2.58MB/s in 7.4s
2020-05-10 10:26:45 (2.58 MB/s) - ‘apache-jena-3.14.0.tar.gz’ saved [20174193/20174193]
tar -xvzf apache-jena-3.14.0.tar.gz
wikidata2jena Script
#!/bin/bash
# WF 2020-05-10
# global settings
jena=apache-jena-3.14.0
tgz=$jena.tar.gz
jenaurl=http://mirror.easyname.ch/apache/jena/binaries/$tgz
data=data
tdbloader=$jena/bin/tdbloader2
getjena() {
# download
if [ ! -f $tgz ]
then
echo "downloading $tgz from $jenaurl"
wget $jenaurl
else
echo "$tgz already downloaded"
fi
# unpack
if [ ! -d $jena ]
then
echo "unpacking $jena from $tgz"
tar xvzf $tgz
else
echo "$jena already unpacked"
fi
# create data directory
if [ ! -d $data ]
then
echo "creating $data directory"
mkdir $data
else
echo "$data directory already created"
fi
}
#
# show the given timestamp
#
timestamp() {
local msg="$1"
local ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
echo "$msg at $ts"
}
#
# load data for the given phase
#
loaddata4phase() {
local phase="$1"
local data="$2"
local input="$3"
timestamp "started load phase $phase"
$tdbloader --phase $phase --loc "$data" "$input" > tdb-$phase-out.log 2> tdb-$phase-err.log
timestamp "finished load phase $phase"
}
#
# load data for the given data dir and input
#
loaddata() {
local data="$1"
local input="$2"
loaddata4phase data "$data" "$input"
loaddata4phase index "$data" ""
}
getjena
wd=$(pwd)
export TMPDIR=$wd/tmp
if [ ! -d $TMPDIR ]
then
echo "creating temporary directory $TMPDIR"
mkdir $TMPDIR
else
echo "using temporary directory $TMPDIR"
fi
loaddata $data latest-all.ttl
Links
- https://www.wikidata.org/wiki/Wikidata:Database_download#RDF_dumps
- https://stackoverflow.com/questions/47885637/failed-to-install-wikidata-query-rdf-blazegraph
- https://stackoverflow.com/questions/48020506/wikidata-on-local-blazegraph-expected-an-rdf-value-here-found-line-1/48110100
- https://stackoverflow.com/questions/56768463/wikidata-import-into-virtuoso
- https://stackoverflow.com/questions/14494449/virtuoso-system-requirements
- https://muncca.com/2019/02/14/wikidata-import-in-apache-jena/
- https://users.jena.apache.narkive.com/J1gsFHRk/tdb2-tdbloader-performance