Wikidata Import 2023-04-26
Download
Download Options
https://dumps.wikimedia.org/wikidatawiki/entities
latest-all.json.bz2 19-Apr-2023 19:01 81437052900 latest-all.json.gz 26-Apr-2023 06:43 123717867013 latest-all.nt.bz2 20-Apr-2023 09:17 158037435620 latest-all.nt.gz 19-Apr-2023 15:33 204694424758 latest-all.ttl.bz2 19-Apr-2023 20:37 101383518288 latest-all.ttl.gz 26-Apr-2023 08:18 123942927864 latest-lexemes.json.bz2 26-Apr-2023 03:51 297892886 latest-lexemes.json.gz 26-Apr-2023 03:49 407135019 latest-lexemes.nt.bz2 21-Apr-2023 23:33 768095633 latest-lexemes.nt.gz 21-Apr-2023 23:28 1008192049 latest-lexemes.ttl.bz2 21-Apr-2023 23:29 433401231 latest-lexemes.ttl.gz 21-Apr-2023 23:25 540610049 latest-truthy.nt.bz2 21-Apr-2023 17:41 35992719959 latest-truthy.nt.gz 21-Apr-2023 14:24 59704444949
download result
ls -l latest*.gz
-rw-rw-r-- 1 wf wf 123942927864 Apr 26 10:18 latest-all.ttl.gz
-rw-rw-r-- 1 wf wf 540610049 Apr 22 01:25 latest-lexemes.ttl.gz
download script
cat download.sh
#/bin/bash
# WF 2023-04-26
# download wikidata dumps
baseurl=https://dumps.wikimedia.org/wikidatawiki/entities/
for file in latest-all latest-lexemes
do
for ext in ttl.gz ttl.bz2
do
url=$baseurl/$file.$ext
log=$file-$ext.log
nohup wget $url >> $log&
done
done
download logs
latest-all.ttl.gz 123942927864 8h52m ✓
--2023-04-26 15:38:37-- https://dumps.wikimedia.org/wikidatawiki/entities//latest-all.ttl.gz
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 123942927864 (115G) [application/octet-stream]
Saving to: ‘latest-all.ttl.gz’
0K .......... .......... .......... .......... .......... 0% 335K 4d4h
50K .......... .......... .......... .......... .......... 0% 220K 5d6h
100K .......... .......... .......... .......... .......... 0% 438K 4d14h
...
121037950K .......... .......... .......... .......... .......... 99% 3.91M 0s
121038000K .......... ..... 100% 181M=8h52m
2023-04-27 00:31:28 (3.70 MB/s) - ‘latest-all.ttl.gz’ saved [123942927864/123942927864]
latest-all.ttl.bz2 101383518288 7h27m ✓
--2023-04-26 15:38:37-- https://dumps.wikimedia.org/wikidatawiki/entities//latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 101383518288 (94G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’
0K .......... .......... .......... .......... .......... 0% 219K 5d5h
50K .......... .......... .......... .......... .......... 0% 219K 5d5h
100K .......... .......... .......... .......... .......... 0% 437K 4d8h
99007250K .......... .......... .......... .......... .......... 99% 2.17M 0s
99007300K .......... .......... .......... .......... .. 100% 2.45M=7h27m
2023-04-26 23:06:17 (3.60 MB/s) - ‘latest-all.ttl.bz2’ saved [101383518288/101383518288]
latest-lexemes.ttl.gz 540610049 2m1s ✓
--2023-04-26 15:38:37-- https://dumps.wikimedia.org/wikidatawiki/entities//latest-lexemes.ttl.gz
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 540610049 (516M) [application/octet-stream]
Saving to: ‘latest-lexemes.ttl.gz’
0K .......... .......... .......... .......... .......... 0% 355K 24m45s
50K .......... .......... .......... .......... .......... 0% 209K 33m23s
100K .......... .......... .......... .......... .......... 0% 416K 29m18s
527850K .......... .......... .......... .......... .......... 99% 62.1M 0s
527900K .......... .......... .......... ......... 100% 23.7M=2m1s
2023-04-26 15:40:39 (4.27 MB/s) - ‘latest-lexemes.ttl.gz’ saved [540610049/540610049]
latest-lexemes.ttl.bz2 43340123 1m45s ✓
attempt by script ❌
--2023-04-26 15:38:37-- https://dumps.wikimedia.org/wikidatawiki/entities//latest-lexemes.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 503 Service Temporarily Unavailable
2023-04-26 15:38:38 ERROR 503: Service Temporarily Unavailable.
manual retry latest-lexemes.ttl.bz2
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2
--2023-04-28 13:44:11-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 433401231 (413M) [application/octet-stream]
Saving to: ‘latest-lexemes.ttl.bz2’
latest-lexemes.ttl. 9%[> ] 38.66M 4.19MB/s eta 1m 53
latest-lexemes.ttl. 100%[===================>] 413.32M 4.25MB/s in 1m 45s
2023-04-28 13:45:57 (3.93 MB/s) - ‘latest-lexemes.ttl.bz2’ saved [433401231/433401231]
Munging
- https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md
- https://github.com/wikimedia/wikidata-query-deploy/blob/master/munge.sh
Preparation
Needed installs and settings
sudo apt install maven
export JAVA_HOME=$(update-alternatives --query javadoc | grep Value: | head -n1 | sed 's/Value: //' | sed 's@bin/javadoc$@@')
echo $JAVA_HOME
/usr/lib/jvm/java-11-openjdk-amd64/
clone and package
had to start mvn package twice since javadoc was not available JAVA_HOME was not set on first try
git clone https://gerrit.wikimedia.org/r/wikidata/query/rdf wikidata-query-rdf
Cloning into 'wikidata-query-rdf'...
remote: Counting objects: 111, done
remote: Total 26684 (delta 0), reused 26684 (delta 0)
Receiving objects: 100% (26684/26684), 4.84 MiB | 3.23 MiB/s, done.
Resolving deltas: 100% (13928/13928), done.
cd wikidata-query-rdf/
mvn package
[INFO] Building jar: /home/wf/wikidata-query-rdf/common/target/wikidata-query-common-0.3.124-SNAPSHOT.jar
[INFO]
[INFO] --- maven-javadoc-plugin:3.2.0:jar (attach-javadocs) @ common ---
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Wikidata Query Service 0.3.124-SNAPSHOT:
[INFO]
[INFO] Wikidata Query Service ............................. SUCCESS [ 19.204 s]
...
[INFO] --- maven-assembly-plugin:3.3.0:single (default) @ service ---
[INFO] Reading assembly descriptor: src/assembly/dist.xml
[INFO] Building tar: /home/wf/wikidata-query-rdf/dist/target/service-0.3.124-SNAPSHOT-dist.tar.gz
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Wikidata Query Service 0.3.124-SNAPSHOT:
[INFO]
[INFO] Wikidata Query Service ............................. SUCCESS [ 1.904 s]
[INFO] Shared code ........................................ SUCCESS [ 4.798 s]
[INFO] Wikidata Query RDF Testing Tools ................... SUCCESS [ 15.411 s]
[INFO] Jetty logging dependencies ......................... SUCCESS [ 37.577 s]
[INFO] Blazegraph extension to improve performance for Wikibase SUCCESS [03:14 min]
[INFO] Blazegraph Service Package ......................... SUCCESS [ 33.788 s]
[INFO] Wikidata Query RDF Tools ........................... SUCCESS [01:13 min]
[INFO] Wikidata Query Service Streaming Updater - Common .. SUCCESS [ 4.758 s]
[INFO] Wikidata Query Service Streaming Updater - Producer SUCCESS [04:34 min]
[INFO] Wikidata Query Service Streaming Updater - Consumer SUCCESS [ 7.982 s]
[INFO] MediaWiki OAuth 1.0a Proxy Service ................. SUCCESS [ 37.591 s]
[INFO] rdf-spark-tools .................................... SUCCESS [09:44 min]
[INFO] Wikibase RDF Query Service ......................... SUCCESS [ 10.378 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 21:21 min
[INFO] Finished at: 2023-04-29T14:05:59+02:00
[INFO] ------------------------------------------------------------------------
check dist
cd dist/target
~/wikidata-query-rdf/dist/target$ tar tvfz service-0.3.124-SNAPSHOT-dist.tar.gz
drwxrwxr-x wf/wf 0 2023-04-29 13:32 service-0.3.124-SNAPSHOT/
-rw-rw-r-- wf/wf 1170 2023-04-29 13:32 service-0.3.124-SNAPSHOT/prefixes-sdc.conf
-rwxrwxr-x wf/wf 1277 2023-04-29 13:32 service-0.3.124-SNAPSHOT/wcqs-data-reload.sh
-rwxrwxr-x wf/wf 2656 2023-04-29 13:32 service-0.3.124-SNAPSHOT/runUpdate.sh
-rwxrwxr-x wf/wf 599 2023-04-29 13:32 service-0.3.124-SNAPSHOT/createNamespace.sh
-rw-rw-r-- wf/wf 1483 2023-04-29 13:32 service-0.3.124-SNAPSHOT/default.properties
-rwxrwxr-x wf/wf 6470 2023-04-29 13:32 service-0.3.124-SNAPSHOT/runBlazegraph.sh
-rwxrwxr-x wf/wf 1345 2023-04-29 13:32 service-0.3.124-SNAPSHOT/loadRestAPI.sh
-rwxrwxr-x wf/wf 490 2023-04-29 13:32 service-0.3.124-SNAPSHOT/forAllCategoryWikis.sh
-rwxrwxr-x wf/wf 857 2023-04-29 13:32 service-0.3.124-SNAPSHOT/munge.sh
-rwxrwxr-x wf/wf 1133 2023-04-29 13:32 service-0.3.124-SNAPSHOT/loadCategoryDaily.sh
-rwxrwxr-x wf/wf 882 2023-04-29 13:32 service-0.3.124-SNAPSHOT/loadData.sh
-rw-rw-r-- wf/wf 3412 2023-04-29 13:32 service-0.3.124-SNAPSHOT/RWStore.properties
-rw-rw-r-- wf/wf 315 2023-04-29 13:32 service-0.3.124-SNAPSHOT/prefixes.conf
-rw-rw-r-- wf/wf 2202 2023-04-29 13:32 service-0.3.124-SNAPSHOT/ldf-config.json
-rwxrwxr-x wf/wf 949 2023-04-29 13:32 service-0.3.124-SNAPSHOT/loadCategoryDump.sh
-rwxrwxr-x wf/wf 181 2023-04-29 13:32 service-0.3.124-SNAPSHOT/summarizeEvents.sh
-rw-rw-r-- wf/wf 2307 2023-04-29 13:32 service-0.3.124-SNAPSHOT/mwservices.json
-rwxrwxr-x wf/wf 2167 2023-04-29 13:32 service-0.3.124-SNAPSHOT/runStreamingUpdater.sh
-rw-rw-r-- wf/wf 20669767 2023-04-29 13:50 service-0.3.124-SNAPSHOT/lib/wikidata-query-tools-0.3.124-SNAPSHOT-jar-with-dependencies.jar
-rw-rw-r-- wf/wf 20713599 2023-04-29 13:55 service-0.3.124-SNAPSHOT/lib/streaming-updater-consumer-0.3.124-SNAPSHOT-jar-with-dependencies.jar
-rw-rw-r-- wf/wf 34014443 2023-04-29 13:55 service-0.3.124-SNAPSHOT/lib/streaming-updater-producer-0.3.124-SNAPSHOT-jar-with-dependencies.jar
-rw-rw-r-- wf/wf 6143989 2023-04-29 13:45 service-0.3.124-SNAPSHOT/lib/logging/jetty-logging-0.3.124-SNAPSHOT-jar-with-dependencies.jar
drwxrwxr-x wf/wf 0 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/
drwxrwxr-x wf/wf 0 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/diagrams/
-rw-rw-r-- wf/wf 11435 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/sparql-query-examples.md
-rw-rw-r-- wf/wf 9803 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/exploring-linked-data.md
-rw-rw-r-- wf/wf 11358 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/LICENSE.Apache
-rw-rw-r-- wf/wf 17986 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/LICENSE.GPL
-rw-rw-r-- wf/wf 877 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/diagrams/streaming-updater-components.puml
-rw-rw-r-- wf/wf 740 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/diagrams/streaming-updater-sequence.puml
-rw-rw-r-- wf/wf 1086 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/diagrams/wdqs-high-level.puml
-rw-rw-r-- wf/wf 1556 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/diagrams/streaming-updater-deployment.puml
-rw-rw-r-- wf/wf 3014 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/getting-started.md
-rw-rw-r-- wf/wf 341 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/TODO.md
-rw-rw-r-- wf/wf 1546 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/configs.md
-rw-rw-r-- wf/wf 476 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/Categories.md
-rw-rw-r-- wf/wf 79416954 2023-04-29 13:49 service-0.3.124-SNAPSHOT/blazegraph-service-0.3.124-SNAPSHOT.war
-rw-rw-r-- wf/wf 9809699 2023-04-29 13:56 service-0.3.124-SNAPSHOT/mw-oauth-proxy-0.3.124-SNAPSHOT.war
-rw-rw-r-- wf/wf 7074499 2023-04-29 13:40 service-0.3.124-SNAPSHOT/jetty-runner-9.4.12.v20180830.jar
Unpack and make available via symlink
tar xvfz service-0.3.124-SNAPSHOT-dist.tar.gz
# in target directory
ln -s /home/wf/wikidata-query-rdf/dist/target/service-0.3.124-SNAPSHOT service
calling munge.sh
domunge.sh
#!/bin/bash
# WF 2023-04-29
# start munge in background
bzcat latest-all.ttl.bz2 | service/munge.sh -f - -d data -- --skolemize
nohup ./domunge.sh &
tail -f nohup.out
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
14:23:31.529 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO org.wikidata.query.rdf.tool.Munge - Switching to data/wikidump-000000001.ttl.gz
14:24:17.795 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO o.w.q.r.t.r.EntityMungingRdfHandler - Processed 10000 entities at (154, 93, 79)
...
20:20:19.606 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO org.wikidata.query.rdf.tool.Munge - Switching to data/wikidump-000000221.ttl.gz
20:20:25.215 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO o.w.q.r.t.r.EntityMungingRdfHandler - Processed 22010000 entities at (1464, 1343, 1284)
20:20:30.268 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO o.w.q.r.t.r.EntityMungingRdfHandler - Processed 22020000 entities at (1496, 1351, 1287)