Wikidata Import 2023-04-26

From BITPlan Wiki
Jump to navigation Jump to search

Download ~9 hours

Download Options

https://dumps.wikimedia.org/wikidatawiki/entities

latest-all.json.bz2                                19-Apr-2023 19:01         81437052900
latest-all.json.gz                                 26-Apr-2023 06:43        123717867013
latest-all.nt.bz2                                  20-Apr-2023 09:17        158037435620
latest-all.nt.gz                                   19-Apr-2023 15:33        204694424758
latest-all.ttl.bz2                                 19-Apr-2023 20:37        101383518288
latest-all.ttl.gz                                  26-Apr-2023 08:18        123942927864
latest-lexemes.json.bz2                            26-Apr-2023 03:51           297892886
latest-lexemes.json.gz                             26-Apr-2023 03:49           407135019
latest-lexemes.nt.bz2                              21-Apr-2023 23:33           768095633
latest-lexemes.nt.gz                               21-Apr-2023 23:28          1008192049
latest-lexemes.ttl.bz2                             21-Apr-2023 23:29           433401231
latest-lexemes.ttl.gz                              21-Apr-2023 23:25           540610049
latest-truthy.nt.bz2                               21-Apr-2023 17:41         35992719959
latest-truthy.nt.gz                                21-Apr-2023 14:24         59704444949

download result

ls -l latest*.gz
-rw-rw-r-- 1 wf wf 123942927864 Apr 26 10:18 latest-all.ttl.gz
-rw-rw-r-- 1 wf wf    540610049 Apr 22 01:25 latest-lexemes.ttl.gz

download script

cat download.sh 
#/bin/bash
# WF 2023-04-26
# download wikidata dumps
baseurl=https://dumps.wikimedia.org/wikidatawiki/entities/
for file in latest-all latest-lexemes
do
  for ext in ttl.gz ttl.bz2
  do
    url=$baseurl/$file.$ext
    log=$file-$ext.log
    nohup wget $url >> $log&
  done
done

download logs

latest-all.ttl.gz 123942927864 8h52m

--2023-04-26 15:38:37--  https://dumps.wikimedia.org/wikidatawiki/entities//latest-all.ttl.gz
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 123942927864 (115G) [application/octet-stream]
Saving to: ‘latest-all.ttl.gz’

     0K .......... .......... .......... .......... ..........  0%  335K 4d4h
    50K .......... .......... .......... .......... ..........  0%  220K 5d6h
   100K .......... .......... .......... .......... ..........  0%  438K 4d14h
...
121037950K .......... .......... .......... .......... .......... 99% 3.91M 0s
121038000K .......... .....                                      100%  181M=8h52m

2023-04-27 00:31:28 (3.70 MB/s) - ‘latest-all.ttl.gz’ saved [123942927864/123942927864]

latest-all.ttl.bz2 101383518288 7h27m

--2023-04-26 15:38:37--  https://dumps.wikimedia.org/wikidatawiki/entities//latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 101383518288 (94G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’

     0K .......... .......... .......... .......... ..........  0%  219K 5d5h
    50K .......... .......... .......... .......... ..........  0%  219K 5d5h
   100K .......... .......... .......... .......... ..........  0%  437K 4d8h
99007250K .......... .......... .......... .......... .......... 99% 2.17M 0s

99007300K .......... .......... .......... .......... ..        100% 2.45M=7h27m

2023-04-26 23:06:17 (3.60 MB/s) - ‘latest-all.ttl.bz2’ saved [101383518288/101383518288]

latest-lexemes.ttl.gz 540610049 2m1s

--2023-04-26 15:38:37--  https://dumps.wikimedia.org/wikidatawiki/entities//latest-lexemes.ttl.gz
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 540610049 (516M) [application/octet-stream]
Saving to: ‘latest-lexemes.ttl.gz’

     0K .......... .......... .......... .......... ..........  0%  355K 24m45s
    50K .......... .......... .......... .......... ..........  0%  209K 33m23s
   100K .......... .......... .......... .......... ..........  0%  416K 29m18s

527850K .......... .......... .......... .......... .......... 99% 62.1M 0s
527900K .......... .......... .......... .........            100% 23.7M=2m1s

2023-04-26 15:40:39 (4.27 MB/s) - ‘latest-lexemes.ttl.gz’ saved [540610049/540610049]

latest-lexemes.ttl.bz2 43340123 1m45s

attempt by script ❌
--2023-04-26 15:38:37--  https://dumps.wikimedia.org/wikidatawiki/entities//latest-lexemes.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 503 Service Temporarily Unavailable
2023-04-26 15:38:38 ERROR 503: Service Temporarily Unavailable.

manual retry latest-lexemes.ttl.bz2

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2
--2023-04-28 13:44:11--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 433401231 (413M) [application/octet-stream]
Saving to: ‘latest-lexemes.ttl.bz2’

latest-lexemes.ttl.   9%[>                   ]  38.66M  4.19MB/s    eta 1m 53
latest-lexemes.ttl. 100%[===================>] 413.32M  4.25MB/s    in 1m 45s  

2023-04-28 13:45:57 (3.93 MB/s) - ‘latest-lexemes.ttl.bz2’ saved [433401231/433401231]

Munging

Preparation

Needed installs and settings

sudo apt install maven
export JAVA_HOME=$(update-alternatives --query javadoc | grep Value: | head -n1 | sed 's/Value: //' | sed 's@bin/javadoc$@@')
echo $JAVA_HOME
/usr/lib/jvm/java-11-openjdk-amd64/

clone and package

had to start mvn package twice since javadoc was not available JAVA_HOME was not set on first try

git clone https://gerrit.wikimedia.org/r/wikidata/query/rdf wikidata-query-rdf
Cloning into 'wikidata-query-rdf'...
remote: Counting objects: 111, done
remote: Total 26684 (delta 0), reused 26684 (delta 0)
Receiving objects: 100% (26684/26684), 4.84 MiB | 3.23 MiB/s, done.
Resolving deltas: 100% (13928/13928), done.
cd wikidata-query-rdf/
mvn package
[INFO] Building jar: /home/wf/wikidata-query-rdf/common/target/wikidata-query-common-0.3.124-SNAPSHOT.jar
[INFO] 
[INFO] --- maven-javadoc-plugin:3.2.0:jar (attach-javadocs) @ common ---
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Wikidata Query Service 0.3.124-SNAPSHOT:
[INFO] 
[INFO] Wikidata Query Service ............................. SUCCESS [ 19.204 s]
...
[INFO] --- maven-assembly-plugin:3.3.0:single (default) @ service ---
[INFO] Reading assembly descriptor: src/assembly/dist.xml
[INFO] Building tar: /home/wf/wikidata-query-rdf/dist/target/service-0.3.124-SNAPSHOT-dist.tar.gz
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Wikidata Query Service 0.3.124-SNAPSHOT:
[INFO] 
[INFO] Wikidata Query Service ............................. SUCCESS [  1.904 s]
[INFO] Shared code ........................................ SUCCESS [  4.798 s]
[INFO] Wikidata Query RDF Testing Tools ................... SUCCESS [ 15.411 s]
[INFO] Jetty logging dependencies ......................... SUCCESS [ 37.577 s]
[INFO] Blazegraph extension to improve performance for Wikibase SUCCESS [03:14 min]
[INFO] Blazegraph Service Package ......................... SUCCESS [ 33.788 s]
[INFO] Wikidata Query RDF Tools ........................... SUCCESS [01:13 min]
[INFO] Wikidata Query Service Streaming Updater - Common .. SUCCESS [  4.758 s]
[INFO] Wikidata Query Service Streaming Updater - Producer  SUCCESS [04:34 min]
[INFO] Wikidata Query Service Streaming Updater - Consumer  SUCCESS [  7.982 s]
[INFO] MediaWiki OAuth 1.0a Proxy Service ................. SUCCESS [ 37.591 s]
[INFO] rdf-spark-tools .................................... SUCCESS [09:44 min]
[INFO] Wikibase RDF Query Service ......................... SUCCESS [ 10.378 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  21:21 min
[INFO] Finished at: 2023-04-29T14:05:59+02:00
[INFO] ------------------------------------------------------------------------

check dist

cd dist/target
~/wikidata-query-rdf/dist/target$ tar tvfz service-0.3.124-SNAPSHOT-dist.tar.gz 
drwxrwxr-x wf/wf             0 2023-04-29 13:32 service-0.3.124-SNAPSHOT/
-rw-rw-r-- wf/wf          1170 2023-04-29 13:32 service-0.3.124-SNAPSHOT/prefixes-sdc.conf
-rwxrwxr-x wf/wf          1277 2023-04-29 13:32 service-0.3.124-SNAPSHOT/wcqs-data-reload.sh
-rwxrwxr-x wf/wf          2656 2023-04-29 13:32 service-0.3.124-SNAPSHOT/runUpdate.sh
-rwxrwxr-x wf/wf           599 2023-04-29 13:32 service-0.3.124-SNAPSHOT/createNamespace.sh
-rw-rw-r-- wf/wf          1483 2023-04-29 13:32 service-0.3.124-SNAPSHOT/default.properties
-rwxrwxr-x wf/wf          6470 2023-04-29 13:32 service-0.3.124-SNAPSHOT/runBlazegraph.sh
-rwxrwxr-x wf/wf          1345 2023-04-29 13:32 service-0.3.124-SNAPSHOT/loadRestAPI.sh
-rwxrwxr-x wf/wf           490 2023-04-29 13:32 service-0.3.124-SNAPSHOT/forAllCategoryWikis.sh
-rwxrwxr-x wf/wf           857 2023-04-29 13:32 service-0.3.124-SNAPSHOT/munge.sh
-rwxrwxr-x wf/wf          1133 2023-04-29 13:32 service-0.3.124-SNAPSHOT/loadCategoryDaily.sh
-rwxrwxr-x wf/wf           882 2023-04-29 13:32 service-0.3.124-SNAPSHOT/loadData.sh
-rw-rw-r-- wf/wf          3412 2023-04-29 13:32 service-0.3.124-SNAPSHOT/RWStore.properties
-rw-rw-r-- wf/wf           315 2023-04-29 13:32 service-0.3.124-SNAPSHOT/prefixes.conf
-rw-rw-r-- wf/wf          2202 2023-04-29 13:32 service-0.3.124-SNAPSHOT/ldf-config.json
-rwxrwxr-x wf/wf           949 2023-04-29 13:32 service-0.3.124-SNAPSHOT/loadCategoryDump.sh
-rwxrwxr-x wf/wf           181 2023-04-29 13:32 service-0.3.124-SNAPSHOT/summarizeEvents.sh
-rw-rw-r-- wf/wf          2307 2023-04-29 13:32 service-0.3.124-SNAPSHOT/mwservices.json
-rwxrwxr-x wf/wf          2167 2023-04-29 13:32 service-0.3.124-SNAPSHOT/runStreamingUpdater.sh
-rw-rw-r-- wf/wf      20669767 2023-04-29 13:50 service-0.3.124-SNAPSHOT/lib/wikidata-query-tools-0.3.124-SNAPSHOT-jar-with-dependencies.jar
-rw-rw-r-- wf/wf      20713599 2023-04-29 13:55 service-0.3.124-SNAPSHOT/lib/streaming-updater-consumer-0.3.124-SNAPSHOT-jar-with-dependencies.jar
-rw-rw-r-- wf/wf      34014443 2023-04-29 13:55 service-0.3.124-SNAPSHOT/lib/streaming-updater-producer-0.3.124-SNAPSHOT-jar-with-dependencies.jar
-rw-rw-r-- wf/wf       6143989 2023-04-29 13:45 service-0.3.124-SNAPSHOT/lib/logging/jetty-logging-0.3.124-SNAPSHOT-jar-with-dependencies.jar
drwxrwxr-x wf/wf             0 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/
drwxrwxr-x wf/wf             0 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/diagrams/
-rw-rw-r-- wf/wf         11435 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/sparql-query-examples.md
-rw-rw-r-- wf/wf          9803 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/exploring-linked-data.md
-rw-rw-r-- wf/wf         11358 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/LICENSE.Apache
-rw-rw-r-- wf/wf         17986 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/LICENSE.GPL
-rw-rw-r-- wf/wf           877 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/diagrams/streaming-updater-components.puml
-rw-rw-r-- wf/wf           740 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/diagrams/streaming-updater-sequence.puml
-rw-rw-r-- wf/wf          1086 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/diagrams/wdqs-high-level.puml
-rw-rw-r-- wf/wf          1556 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/diagrams/streaming-updater-deployment.puml
-rw-rw-r-- wf/wf          3014 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/getting-started.md
-rw-rw-r-- wf/wf           341 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/TODO.md
-rw-rw-r-- wf/wf          1546 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/configs.md
-rw-rw-r-- wf/wf           476 2023-04-29 13:32 service-0.3.124-SNAPSHOT/docs/Categories.md
-rw-rw-r-- wf/wf      79416954 2023-04-29 13:49 service-0.3.124-SNAPSHOT/blazegraph-service-0.3.124-SNAPSHOT.war
-rw-rw-r-- wf/wf       9809699 2023-04-29 13:56 service-0.3.124-SNAPSHOT/mw-oauth-proxy-0.3.124-SNAPSHOT.war
-rw-rw-r-- wf/wf       7074499 2023-04-29 13:40 service-0.3.124-SNAPSHOT/jetty-runner-9.4.12.v20180830.jar

Unpack and make available via symlink

tar xvfz service-0.3.124-SNAPSHOT-dist.tar.gz
# in target directory
ln -s /home/wf/wikidata-query-rdf/dist/target/service-0.3.124-SNAPSHOT service

calling munge.sh

domunge.sh

#!/bin/bash
# WF 2023-04-29
# start munge in background
bzcat latest-all.ttl.bz2 | service/munge.sh -f - -d data -- --skolemize

start domunge.sh and show nohup.out log

nohup ./domunge.sh &
tail -f  nohup.out
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
14:23:31.529 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  org.wikidata.query.rdf.tool.Munge - Switching to data/wikidump-000000001.ttl.gz
14:24:17.795 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  o.w.q.r.t.r.EntityMungingRdfHandler - Processed 10000 entities at (154, 93, 79)
...
17:10:05.507 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  o.w.q.r.t.r.EntityMungingRdfHandler - Processed 10000000 entities at (936, 919, 1011)
...
19:56:14.353 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  o.w.q.r.t.r.EntityMungingRdfHandler - Processed 20000000 entities at (1191, 1097, 979)
...
22:21:33.096 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  o.w.q.r.t.r.EntityMungingRdfHandler - Processed 30000000 entities at (1228, 1136, 1221)
...
01:10:14.808 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  o.w.q.r.t.r.EntityMungingRdfHandler - Processed 40000000 entities at (766, 764, 988)
...
03:55:33.809 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  o.w.q.r.t.r.EntityMungingRdfHandler - Processed 50000000 entities at (786, 779, 893)
...
06:38:44.146 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  o.w.q.r.t.r.EntityMungingRdfHandler - Processed 60000000 entities at (1343, 1272, 1081)
...
09:07:55.295 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  o.w.q.r.t.r.EntityMungingRdfHandler - Processed 70000000 entities at (527, 804, 1053)
...
11:56:15.113 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  o.w.q.r.t.r.EntityMungingRdfHandler - Processed 80000000 entities at (1103, 944, 993)
...
14:40:23.875 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  o.w.q.r.t.r.EntityMungingRdfHandler - Processed 90000000 entities at (903, 850, 889)
...
17:19:18.387 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  o.w.q.r.t.r.EntityMungingRdfHandler - Processed 100000000 entities at (1416, 1392, 1182)
...
18:42:24.907 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  o.w.q.r.t.r.EntityMungingRdfHandler - Processed 105730000 entities at (2206, 1688, 1409)

check created munge files

grep gz nohup.out | cut -f9 -d" " | tail -5
data/wikidump-000000699.ttl.gz
data/wikidump-000000700.ttl.gz
data/wikidump-000000701.ttl.gz
data/wikidump-000000702.ttl.gz
data/wikidump-000000703.ttl.gz

inspect a sample file

zcat data/wikidump-000000702.ttl.gz  | head -40
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ontolex: <http://www.w3.org/ns/lemon/ontolex#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix wikibase: <http://wikiba.se/ontology#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix schema: <http://schema.org/> .
@prefix cc: <http://creativecommons.org/ns#> .
@prefix geo: <http://www.opengis.net/ont/geosparql#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix wd: <http://www.wikidata.org/entity/> .
@prefix data: <https://www.wikidata.org/wiki/Special:EntityData/> .
@prefix s: <http://www.wikidata.org/entity/statement/> .
@prefix ref: <http://www.wikidata.org/reference/> .
@prefix v: <http://www.wikidata.org/value/> .
@prefix wdt: <http://www.wikidata.org/prop/direct/> .
@prefix wdtn: <http://www.wikidata.org/prop/direct-normalized/> .
@prefix p: <http://www.wikidata.org/prop/> .
@prefix ps: <http://www.wikidata.org/prop/statement/> .
@prefix psv: <http://www.wikidata.org/prop/statement/value/> .
@prefix psn: <http://www.wikidata.org/prop/statement/value-normalized/> .
@prefix pq: <http://www.wikidata.org/prop/qualifier/> .
@prefix pqv: <http://www.wikidata.org/prop/qualifier/value/> .
@prefix pqn: <http://www.wikidata.org/prop/qualifier/value-normalized/> .
@prefix pr: <http://www.wikidata.org/prop/reference/> .
@prefix prv: <http://www.wikidata.org/prop/reference/value/> .
@prefix prn: <http://www.wikidata.org/prop/reference/value-normalized/> .
@prefix wdno: <http://www.wikidata.org/prop/novalue/> .

<https://ceb.wikipedia.org/wiki/R%C3%ADo_Pitara> a schema:Article ;
	schema:about wd:Q35416151 ;
	schema:inLanguage "ceb" ;
	schema:isPartOf <https://ceb.wikipedia.org/> ;
	schema:name "Río Pitara"@ceb .

wd:Q35416151 wdt:P625 "Point(-67.2225 9.3544444444444)"^^geo:wktLiteral ;
	wdt:P31 wd:Q47521 ;
	wdt:P1566 "3630072" ;

see https://www.wikidata.org/wiki/Q35416151