Difference between revisions of "Wikidata Import 2023-01-24"

From BITPlan Wiki
Jump to navigation Jump to search
Line 84: Line 84:
 
Wed Jan 25 06:15:32 AM CET 2023
 
Wed Jan 25 06:15:32 AM CET 2023
 
5714 latest-all.nt
 
5714 latest-all.nt
Wed Jan 25 06:15:42 AM CET 2023
+
...
6300 latest-all.nt
+
Wed Jan 25 07:36:32 AM CET 2023
Wed Jan 25 06:15:52 AM CET 2023
+
300255 latest-all.nt
6890 latest-all.nt
 
 
</source>
 
</source>

Revision as of 07:46, 25 January 2023

Download latest wikidata Dump ~10 hours

https://dumps.wikimedia.org/wikidatawiki/entities

latest-all.json.bz2                                18-Jan-2023 17:40         79779054481
latest-all.json.gz                                 18-Jan-2023 10:51        121027823223
latest-all.nt.bz2                                  19-Jan-2023 17:00        155239026614
latest-all.nt.gz                                   18-Jan-2023 23:55        200917826250
latest-all.ttl.bz2                                 19-Jan-2023 04:34         99583991786
latest-all.ttl.gz                                  18-Jan-2023 19:25        121477047220
latest-lexemes.json.bz2                            18-Jan-2023 03:47           270280878
latest-lexemes.json.gz                             18-Jan-2023 03:46           369955852
latest-lexemes.nt.bz2                              20-Jan-2023 23:32           717929951
latest-lexemes.nt.gz                               20-Jan-2023 23:27           947996669
latest-lexemes.ttl.bz2                             20-Jan-2023 23:28           402494804
latest-lexemes.ttl.gz                              20-Jan-2023 23:25           503140103
latest-truthy.nt.bz2                               20-Jan-2023 19:30         35434201681
latest-truthy.nt.gz                                20-Jan-2023 16:19         58740712185
sudo nohup wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2&
tail -f nohup.out
--2023-01-24 10:33:23--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 155239026614 (145G) [application/octet-stream]
Saving to: ‘latest-all.nt.bz2’

        0K .......... .......... .......... .......... ..........  0%  337K 5d4h
       50K .......... .......... .......... .......... ..........  0%  245K 6d4h
      100K .......... .......... .......... .......... ..........  0%  425K 5d11h
  ...
     1000K .......... .......... .......... .......... ..........  0% 58.4M 47h38m
  ...
    10000K .......... .......... .......... .......... ..........  0% 3.66M 11h20m
  ...
   100000K .......... .......... .......... .......... ..........  0% 23.6M 12h37m
  ...
  1000000K .......... .......... .......... .......... ..........  0% 3.40M 9h1m
  ...
 10000000K .......... .......... .......... .......... ..........  6% 3.05M 8h39m
...
100000000K .......... .......... .......... .......... .......... 65% 87.7M 3h18m
...
130000000K .......... .......... .......... .......... .......... 85%  101M 82m47s
...
140000000K .......... .......... .......... .......... .......... 92% 9.05M 44m30s
...
150000000K .......... .......... .......... .......... .......... 98% 3.94M 6m8s
...
151600600K .......... .                                          100%  623K=9h41m

2023-01-24 20:14:37 (4.25 MB/s) - ‘latest-all.nt.bz2’ saved [155239026614/155239026614]

unzip

bunzip2 with nohup does not work properly without being careful in its use!

nohup bunzip2 latest-all.nt.bz2 &
bunzip2: Control-C or similar caught, quitting.
bunzip2: Deleting output file latest-all.nt, if it exists.

see

redirect bunzip output and show progress

progress script

cat progress 
#!/bin/bash
# show decompression progress
while :
do
  date
  du -sm *.nt
  sleep 10
done

uncompress

nohup bunzip2 latest-all.nt.bz2  &> /dev/null &
./progress 
Wed Jan 25 06:15:32 AM CET 2023
5714	latest-all.nt
...
Wed Jan 25 07:36:32 AM CET 2023
300255	latest-all.nt