Difference between revisions of "Wikidata Import 2023-01-24"
Jump to navigation
Jump to search
Line 65: | Line 65: | ||
* https://stackoverflow.com/a/50673585/1497139 | * https://stackoverflow.com/a/50673585/1497139 | ||
− | === redirect bunzip output === | + | === redirect bunzip output and show progress === |
+ | ==== progress script ==== | ||
+ | <source lang='bash'> | ||
+ | cat progress | ||
+ | #!/bin/bash | ||
+ | # show decompression progress | ||
+ | while : | ||
+ | do | ||
+ | date | ||
+ | du -sm *.nt | ||
+ | sleep 10 | ||
+ | done | ||
+ | </source> | ||
+ | ==== uncompress ==== | ||
<source lang='bash' highlight='1-2'> | <source lang='bash' highlight='1-2'> | ||
nohup bunzip2 latest-all.nt.bz2 &> /dev/null & | nohup bunzip2 latest-all.nt.bz2 &> /dev/null & | ||
+ | ./progress | ||
+ | Wed Jan 25 06:15:32 AM CET 2023 | ||
+ | 5714 latest-all.nt | ||
+ | Wed Jan 25 06:15:42 AM CET 2023 | ||
+ | 6300 latest-all.nt | ||
+ | Wed Jan 25 06:15:52 AM CET 2023 | ||
+ | 6890 latest-all.nt | ||
</source> | </source> |
Revision as of 06:17, 25 January 2023
Download latest wikidata Dump ~10 hours
https://dumps.wikimedia.org/wikidatawiki/entities
latest-all.json.bz2 18-Jan-2023 17:40 79779054481 latest-all.json.gz 18-Jan-2023 10:51 121027823223 latest-all.nt.bz2 19-Jan-2023 17:00 155239026614 latest-all.nt.gz 18-Jan-2023 23:55 200917826250 latest-all.ttl.bz2 19-Jan-2023 04:34 99583991786 latest-all.ttl.gz 18-Jan-2023 19:25 121477047220 latest-lexemes.json.bz2 18-Jan-2023 03:47 270280878 latest-lexemes.json.gz 18-Jan-2023 03:46 369955852 latest-lexemes.nt.bz2 20-Jan-2023 23:32 717929951 latest-lexemes.nt.gz 20-Jan-2023 23:27 947996669 latest-lexemes.ttl.bz2 20-Jan-2023 23:28 402494804 latest-lexemes.ttl.gz 20-Jan-2023 23:25 503140103 latest-truthy.nt.bz2 20-Jan-2023 19:30 35434201681 latest-truthy.nt.gz 20-Jan-2023 16:19 58740712185
sudo nohup wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2&
tail -f nohup.out
--2023-01-24 10:33:23-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 155239026614 (145G) [application/octet-stream]
Saving to: ‘latest-all.nt.bz2’
0K .......... .......... .......... .......... .......... 0% 337K 5d4h
50K .......... .......... .......... .......... .......... 0% 245K 6d4h
100K .......... .......... .......... .......... .......... 0% 425K 5d11h
...
1000K .......... .......... .......... .......... .......... 0% 58.4M 47h38m
...
10000K .......... .......... .......... .......... .......... 0% 3.66M 11h20m
...
100000K .......... .......... .......... .......... .......... 0% 23.6M 12h37m
...
1000000K .......... .......... .......... .......... .......... 0% 3.40M 9h1m
...
10000000K .......... .......... .......... .......... .......... 6% 3.05M 8h39m
...
100000000K .......... .......... .......... .......... .......... 65% 87.7M 3h18m
...
130000000K .......... .......... .......... .......... .......... 85% 101M 82m47s
...
140000000K .......... .......... .......... .......... .......... 92% 9.05M 44m30s
...
150000000K .......... .......... .......... .......... .......... 98% 3.94M 6m8s
...
151600600K .......... . 100% 623K=9h41m
2023-01-24 20:14:37 (4.25 MB/s) - ‘latest-all.nt.bz2’ saved [155239026614/155239026614]
unzip
bunzip2 with nohup does not work properly without being careful in its use!
nohup bunzip2 latest-all.nt.bz2 &
bunzip2: Control-C or similar caught, quitting.
bunzip2: Deleting output file latest-all.nt, if it exists.
see
- https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616002
- https://stackoverflow.com/a/50673585/1497139
redirect bunzip output and show progress
progress script
cat progress
#!/bin/bash
# show decompression progress
while :
do
date
du -sm *.nt
sleep 10
done
uncompress
nohup bunzip2 latest-all.nt.bz2 &> /dev/null &
./progress
Wed Jan 25 06:15:32 AM CET 2023
5714 latest-all.nt
Wed Jan 25 06:15:42 AM CET 2023
6300 latest-all.nt
Wed Jan 25 06:15:52 AM CET 2023
6890 latest-all.nt