Difference between revisions of "Wikidata Import 2023-01-24"
Jump to navigation
Jump to search
Line 19: | Line 19: | ||
</pre> | </pre> | ||
<source lang='bash' highlight='1-2'> | <source lang='bash' highlight='1-2'> | ||
+ | wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.nt.bz2 | ||
+ | --2023-01-25 11:41:45-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.nt.bz2 | ||
+ | Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142 | ||
+ | Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected. | ||
+ | HTTP request sent, awaiting response... 200 OK | ||
+ | Length: 717929951 (685M) [application/octet-stream] | ||
+ | Saving to: ‘latest-lexemes.nt.bz2’ | ||
+ | |||
+ | ... | ||
+ | |||
sudo nohup wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2& | sudo nohup wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2& | ||
tail -f nohup.out | tail -f nohup.out |
Revision as of 11:42, 25 January 2023
Download latest wikidata Dump ~10 hours
https://dumps.wikimedia.org/wikidatawiki/entities
latest-all.json.bz2 18-Jan-2023 17:40 79779054481 latest-all.json.gz 18-Jan-2023 10:51 121027823223 latest-all.nt.bz2 19-Jan-2023 17:00 155239026614 latest-all.nt.gz 18-Jan-2023 23:55 200917826250 latest-all.ttl.bz2 19-Jan-2023 04:34 99583991786 latest-all.ttl.gz 18-Jan-2023 19:25 121477047220 latest-lexemes.json.bz2 18-Jan-2023 03:47 270280878 latest-lexemes.json.gz 18-Jan-2023 03:46 369955852 latest-lexemes.nt.bz2 20-Jan-2023 23:32 717929951 latest-lexemes.nt.gz 20-Jan-2023 23:27 947996669 latest-lexemes.ttl.bz2 20-Jan-2023 23:28 402494804 latest-lexemes.ttl.gz 20-Jan-2023 23:25 503140103 latest-truthy.nt.bz2 20-Jan-2023 19:30 35434201681 latest-truthy.nt.gz 20-Jan-2023 16:19 58740712185
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.nt.bz2
--2023-01-25 11:41:45-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.nt.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 717929951 (685M) [application/octet-stream]
Saving to: ‘latest-lexemes.nt.bz2’
...
sudo nohup wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2&
tail -f nohup.out
--2023-01-24 10:33:23-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 155239026614 (145G) [application/octet-stream]
Saving to: ‘latest-all.nt.bz2’
0K .......... .......... .......... .......... .......... 0% 337K 5d4h
50K .......... .......... .......... .......... .......... 0% 245K 6d4h
100K .......... .......... .......... .......... .......... 0% 425K 5d11h
...
1000K .......... .......... .......... .......... .......... 0% 58.4M 47h38m
...
10000K .......... .......... .......... .......... .......... 0% 3.66M 11h20m
...
100000K .......... .......... .......... .......... .......... 0% 23.6M 12h37m
...
1000000K .......... .......... .......... .......... .......... 0% 3.40M 9h1m
...
10000000K .......... .......... .......... .......... .......... 6% 3.05M 8h39m
...
100000000K .......... .......... .......... .......... .......... 65% 87.7M 3h18m
...
130000000K .......... .......... .......... .......... .......... 85% 101M 82m47s
...
140000000K .......... .......... .......... .......... .......... 92% 9.05M 44m30s
...
150000000K .......... .......... .......... .......... .......... 98% 3.94M 6m8s
...
151600600K .......... . 100% 623K=9h41m
2023-01-24 20:14:37 (4.25 MB/s) - ‘latest-all.nt.bz2’ saved [155239026614/155239026614]
unzip
bunzip2 with nohup does not work properly without being careful in its use!
nohup bunzip2 latest-all.nt.bz2 &
bunzip2: Control-C or similar caught, quitting.
bunzip2: Deleting output file latest-all.nt, if it exists.
see
- https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616002
- https://stackoverflow.com/a/50673585/1497139
redirect bunzip output and show progress
progress script
cat progress
#!/bin/bash
# show decompression progress
while :
do
date
du -sm *.nt
sleep 10
done
uncompress
nohup bunzip2 latest-all.nt.bz2 &> /dev/null &
./progress
Wed Jan 25 06:15:32 AM CET 2023
5714 latest-all.nt
...
Wed Jan 25 07:36:32 AM CET 2023
300255 latest-all.nt
QLever control
https://github.com/ad-freiburg/qlever-control
mkdir qlever
cd qlever
git clone https://github.com/ad-freiburg/qlever-control
Cloning into 'qlever-control'...
remote: Enumerating objects: 426, done.
remote: Counting objects: 100% (266/266), done.
remote: Compressing objects: 100% (170/170), done.
remote: Total 426 (delta 108), reused 231 (delta 95), pack-reused 160
Receiving objects: 100% (426/426), 131.00 KiB | 585.00 KiB/s, done.
Resolving deltas: 100% (163/163), done.
setup wikidata
mkdir wikidata
cd wikidata/
. ../qlever-control/qlever wikidata
QLEVER CONFIG
Checking your PATH ...
The directory "/hd/eneco/qlever/qlever-control" is already contained in your PATH
Setting up bash autocompletion ...
Done, number of completions: 35
Creating new Qleverfile ...
Copied pre-configured Qleverfile for "wikidata" into current directory.
Setup is complete
Type qlever and use autocompletion to see which actions are available. Add a
"show" in the end to see what an action does without executing it (for example,
qlever index show). Edit your local Qleverfile to change settings. A typical
sequence of actions if you have used a preconfigured Qleverfile is:
qlever get-data
qlever index
qlever start
qlever example-query