Difference between revisions of "Wikidata Import 2023-04-26"
Jump to navigation
Jump to search
Line 129: | Line 129: | ||
* https://github.com/wikimedia/wikidata-query-deploy/blob/master/munge.sh | * https://github.com/wikimedia/wikidata-query-deploy/blob/master/munge.sh | ||
== Preparation == | == Preparation == | ||
− | === Needed installs === | + | === Needed installs and settings === |
− | <source lang='bash'> | + | <source lang='bash' highlight='1-3'> |
sudo apt install maven | sudo apt install maven | ||
+ | export JAVA_HOME=$(update-alternatives --query javadoc | grep Value: | head -n1 | sed 's/Value: //' | sed 's@bin/javadoc$@@') | ||
+ | echo $JAVA_HOME | ||
+ | /usr/lib/jvm/java-11-openjdk-amd64/ | ||
</source> | </source> | ||
+ | |||
=== clone and package === | === clone and package === | ||
<source lang='bash' highlight='1,7-8'> | <source lang='bash' highlight='1,7-8'> |
Revision as of 12:44, 29 April 2023
Download
Download Options
https://dumps.wikimedia.org/wikidatawiki/entities
latest-all.json.bz2 19-Apr-2023 19:01 81437052900 latest-all.json.gz 26-Apr-2023 06:43 123717867013 latest-all.nt.bz2 20-Apr-2023 09:17 158037435620 latest-all.nt.gz 19-Apr-2023 15:33 204694424758 latest-all.ttl.bz2 19-Apr-2023 20:37 101383518288 latest-all.ttl.gz 26-Apr-2023 08:18 123942927864 latest-lexemes.json.bz2 26-Apr-2023 03:51 297892886 latest-lexemes.json.gz 26-Apr-2023 03:49 407135019 latest-lexemes.nt.bz2 21-Apr-2023 23:33 768095633 latest-lexemes.nt.gz 21-Apr-2023 23:28 1008192049 latest-lexemes.ttl.bz2 21-Apr-2023 23:29 433401231 latest-lexemes.ttl.gz 21-Apr-2023 23:25 540610049 latest-truthy.nt.bz2 21-Apr-2023 17:41 35992719959 latest-truthy.nt.gz 21-Apr-2023 14:24 59704444949
download result
ls -l latest*.gz
-rw-rw-r-- 1 wf wf 123942927864 Apr 26 10:18 latest-all.ttl.gz
-rw-rw-r-- 1 wf wf 540610049 Apr 22 01:25 latest-lexemes.ttl.gz
download script
cat download.sh
#/bin/bash
# WF 2023-04-26
# download wikidata dumps
baseurl=https://dumps.wikimedia.org/wikidatawiki/entities/
for file in latest-all latest-lexemes
do
for ext in ttl.gz ttl.bz2
do
url=$baseurl/$file.$ext
log=$file-$ext.log
nohup wget $url >> $log&
done
done
download logs
latest-all.ttl.gz 123942927864 8h52m ✓
--2023-04-26 15:38:37-- https://dumps.wikimedia.org/wikidatawiki/entities//latest-all.ttl.gz
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 123942927864 (115G) [application/octet-stream]
Saving to: ‘latest-all.ttl.gz’
0K .......... .......... .......... .......... .......... 0% 335K 4d4h
50K .......... .......... .......... .......... .......... 0% 220K 5d6h
100K .......... .......... .......... .......... .......... 0% 438K 4d14h
...
121037950K .......... .......... .......... .......... .......... 99% 3.91M 0s
121038000K .......... ..... 100% 181M=8h52m
2023-04-27 00:31:28 (3.70 MB/s) - ‘latest-all.ttl.gz’ saved [123942927864/123942927864]
latest-all.ttl.bz2 101383518288 7h27m ✓
--2023-04-26 15:38:37-- https://dumps.wikimedia.org/wikidatawiki/entities//latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 101383518288 (94G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’
0K .......... .......... .......... .......... .......... 0% 219K 5d5h
50K .......... .......... .......... .......... .......... 0% 219K 5d5h
100K .......... .......... .......... .......... .......... 0% 437K 4d8h
99007250K .......... .......... .......... .......... .......... 99% 2.17M 0s
99007300K .......... .......... .......... .......... .. 100% 2.45M=7h27m
2023-04-26 23:06:17 (3.60 MB/s) - ‘latest-all.ttl.bz2’ saved [101383518288/101383518288]
latest-lexemes.ttl.gz 540610049 2m1s ✓
--2023-04-26 15:38:37-- https://dumps.wikimedia.org/wikidatawiki/entities//latest-lexemes.ttl.gz
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 540610049 (516M) [application/octet-stream]
Saving to: ‘latest-lexemes.ttl.gz’
0K .......... .......... .......... .......... .......... 0% 355K 24m45s
50K .......... .......... .......... .......... .......... 0% 209K 33m23s
100K .......... .......... .......... .......... .......... 0% 416K 29m18s
527850K .......... .......... .......... .......... .......... 99% 62.1M 0s
527900K .......... .......... .......... ......... 100% 23.7M=2m1s
2023-04-26 15:40:39 (4.27 MB/s) - ‘latest-lexemes.ttl.gz’ saved [540610049/540610049]
latest-lexemes.ttl.bz2 43340123 1m45s ✓
attempt by script ❌
--2023-04-26 15:38:37-- https://dumps.wikimedia.org/wikidatawiki/entities//latest-lexemes.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 503 Service Temporarily Unavailable
2023-04-26 15:38:38 ERROR 503: Service Temporarily Unavailable.
manual retry latest-lexemes.ttl.bz2
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2
--2023-04-28 13:44:11-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 433401231 (413M) [application/octet-stream]
Saving to: ‘latest-lexemes.ttl.bz2’
latest-lexemes.ttl. 9%[> ] 38.66M 4.19MB/s eta 1m 53
latest-lexemes.ttl. 100%[===================>] 413.32M 4.25MB/s in 1m 45s
2023-04-28 13:45:57 (3.93 MB/s) - ‘latest-lexemes.ttl.bz2’ saved [433401231/433401231]
Munging
- https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md
- https://github.com/wikimedia/wikidata-query-deploy/blob/master/munge.sh
Preparation
Needed installs and settings
sudo apt install maven
export JAVA_HOME=$(update-alternatives --query javadoc | grep Value: | head -n1 | sed 's/Value: //' | sed 's@bin/javadoc$@@')
echo $JAVA_HOME
/usr/lib/jvm/java-11-openjdk-amd64/
clone and package
git clone https://gerrit.wikimedia.org/r/wikidata/query/rdf wikidata-query-rdf
Cloning into 'wikidata-query-rdf'...
remote: Counting objects: 111, done
remote: Total 26684 (delta 0), reused 26684 (delta 0)
Receiving objects: 100% (26684/26684), 4.84 MiB | 3.23 MiB/s, done.
Resolving deltas: 100% (13928/13928), done.
cd wikidata-query-rdf/
mvn package
[INFO] Building jar: /home/wf/wikidata-query-rdf/common/target/wikidata-query-common-0.3.124-SNAPSHOT.jar
[INFO]
[INFO] --- maven-javadoc-plugin:3.2.0:jar (attach-javadocs) @ common ---
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Wikidata Query Service 0.3.124-SNAPSHOT:
[INFO]
[INFO] Wikidata Query Service ............................. SUCCESS [ 19.204 s]
calling munges.h
bzcat dump-ttl.bz2 | munge.sh -f - -d MUNGER_OUTPUT -- --skolemize
# MUNGER_OUTPUT is a folder that you need to create and will contain many ttl files (the output of the munger)