Wikidata Import 2023-04-26

From BITPlan Wiki
Revision as of 15:24, 28 April 2023 by Wf (talk | contribs) (→‎Munging)
Jump to navigation Jump to search

Download

Download Options

https://dumps.wikimedia.org/wikidatawiki/entities

latest-all.json.bz2                                19-Apr-2023 19:01         81437052900
latest-all.json.gz                                 26-Apr-2023 06:43        123717867013
latest-all.nt.bz2                                  20-Apr-2023 09:17        158037435620
latest-all.nt.gz                                   19-Apr-2023 15:33        204694424758
latest-all.ttl.bz2                                 19-Apr-2023 20:37        101383518288
latest-all.ttl.gz                                  26-Apr-2023 08:18        123942927864
latest-lexemes.json.bz2                            26-Apr-2023 03:51           297892886
latest-lexemes.json.gz                             26-Apr-2023 03:49           407135019
latest-lexemes.nt.bz2                              21-Apr-2023 23:33           768095633
latest-lexemes.nt.gz                               21-Apr-2023 23:28          1008192049
latest-lexemes.ttl.bz2                             21-Apr-2023 23:29           433401231
latest-lexemes.ttl.gz                              21-Apr-2023 23:25           540610049
latest-truthy.nt.bz2                               21-Apr-2023 17:41         35992719959
latest-truthy.nt.gz                                21-Apr-2023 14:24         59704444949

download result

ls -l latest*.gz
-rw-rw-r-- 1 wf wf 123942927864 Apr 26 10:18 latest-all.ttl.gz
-rw-rw-r-- 1 wf wf    540610049 Apr 22 01:25 latest-lexemes.ttl.gz

download script

cat download.sh 
#/bin/bash
# WF 2023-04-26
# download wikidata dumps
baseurl=https://dumps.wikimedia.org/wikidatawiki/entities/
for file in latest-all latest-lexemes
do
  for ext in ttl.gz ttl.bz2
  do
    url=$baseurl/$file.$ext
    log=$file-$ext.log
    nohup wget $url >> $log&
  done
done

download logs

latest-all.ttl.gz 123942927864 8h52m

--2023-04-26 15:38:37--  https://dumps.wikimedia.org/wikidatawiki/entities//latest-all.ttl.gz
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 123942927864 (115G) [application/octet-stream]
Saving to: ‘latest-all.ttl.gz’

     0K .......... .......... .......... .......... ..........  0%  335K 4d4h
    50K .......... .......... .......... .......... ..........  0%  220K 5d6h
   100K .......... .......... .......... .......... ..........  0%  438K 4d14h
...
121037950K .......... .......... .......... .......... .......... 99% 3.91M 0s
121038000K .......... .....                                      100%  181M=8h52m

2023-04-27 00:31:28 (3.70 MB/s) - ‘latest-all.ttl.gz’ saved [123942927864/123942927864]

latest-all.ttl.bz2 101383518288 7h27m

--2023-04-26 15:38:37--  https://dumps.wikimedia.org/wikidatawiki/entities//latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 101383518288 (94G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’

     0K .......... .......... .......... .......... ..........  0%  219K 5d5h
    50K .......... .......... .......... .......... ..........  0%  219K 5d5h
   100K .......... .......... .......... .......... ..........  0%  437K 4d8h
99007250K .......... .......... .......... .......... .......... 99% 2.17M 0s

99007300K .......... .......... .......... .......... ..        100% 2.45M=7h27m

2023-04-26 23:06:17 (3.60 MB/s) - ‘latest-all.ttl.bz2’ saved [101383518288/101383518288]

latest-lexemes.ttl.gz 540610049 2m1s

--2023-04-26 15:38:37--  https://dumps.wikimedia.org/wikidatawiki/entities//latest-lexemes.ttl.gz
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 540610049 (516M) [application/octet-stream]
Saving to: ‘latest-lexemes.ttl.gz’

     0K .......... .......... .......... .......... ..........  0%  355K 24m45s
    50K .......... .......... .......... .......... ..........  0%  209K 33m23s
   100K .......... .......... .......... .......... ..........  0%  416K 29m18s

527850K .......... .......... .......... .......... .......... 99% 62.1M 0s
527900K .......... .......... .......... .........            100% 23.7M=2m1s

2023-04-26 15:40:39 (4.27 MB/s) - ‘latest-lexemes.ttl.gz’ saved [540610049/540610049]

latest-lexemes.ttl.bz2 43340123 1m45s

attempt by script ❌
--2023-04-26 15:38:37--  https://dumps.wikimedia.org/wikidatawiki/entities//latest-lexemes.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 503 Service Temporarily Unavailable
2023-04-26 15:38:38 ERROR 503: Service Temporarily Unavailable.

manual retry latest-lexemes.ttl.bz2

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2
--2023-04-28 13:44:11--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 433401231 (413M) [application/octet-stream]
Saving to: ‘latest-lexemes.ttl.bz2’

latest-lexemes.ttl.   9%[>                   ]  38.66M  4.19MB/s    eta 1m 53
latest-lexemes.ttl. 100%[===================>] 413.32M  4.25MB/s    in 1m 45s  

2023-04-28 13:45:57 (3.93 MB/s) - ‘latest-lexemes.ttl.bz2’ saved [433401231/433401231]

Munging

bzcat dump-ttl.bz2 | munge.sh -f - -d MUNGER_OUTPUT -- --skolemize
# MUNGER_OUTPUT is a folder that you need to create and will contain many ttl files (the output of the munger)