Difference between revisions of "Wikidata Import 2023-01-24"

From BITPlan Wiki
Jump to navigation Jump to search
Line 140: Line 140:
 
qlever start
 
qlever start
 
qlever example-query
 
qlever example-query
 +
</source>
 +
=== start download ===
 +
<source lang='bash' highlight='1'>
 +
qlever get-data
 +
xecuting "get-data":
 +
 +
wget -nc https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2
 +
 +
Getting data using GET_DATA_CMD from Qleverfile ...
 +
 +
--2023-01-25 16:58:18--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
 +
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
 +
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
 +
HTTP request sent, awaiting response... 200 OK
 +
Length: 99583991786 (93G) [application/octet-stream]
 +
Saving to: ‘latest-all.ttl.bz2’
 +
 +
    0K .......... .......... .......... .......... ..........  0%  309K 3d15h
 +
    50K .......... .......... .......... .......... ..........  0%  246K 4d2h
 +
  100K .......... .......... .......... .......... ..........  0%  416K 3d15h
 +
  150K .......... .......... .......... .......... ..........  0% 1.14M 2d23h
 +
  200K .......... .......... .......... .......... ..........  0%  617K 2d17h
 +
  250K .......... .......... .......... .......... ..........  0% 1.23M 2d10h
 +
  300K .......... .......... .......... .......... ..........  0%  647K 2d8h
 +
  350K .......... .......... .......... .......... ..........  0% 1.12M 2d3h
 +
  400K .......... .......... .......... .......... ..........  0% 78.4M 46h13m
 +
  450K .......... .......... .......... .......... ..........  0%  550K 46h31m
 +
  500K .......... .......... .......... .......... ..........  0% 12.5M 42h28m
 +
...
 +
97249700K .......... .......... .......... .......... .......... 99% 2.98M 0s
 +
97249750K .......... .......... .......... .......... .......... 99% 2.90M 0s
 +
97249800K .......... .......... .......... .......... .......... 99%  116M 0s
 +
97249850K .......... .......... .......... .......... .......... 99% 2.95M 0s
 +
97249900K .......... .......... .......... .......... .......... 99% 3.28M 0s
 +
97249950K .......... .......... .......... .......... .        100% 2.87M=6h47m
 +
 +
2023-01-25 23:45:44 (3.89 MB/s) - ‘latest-all.ttl.bz2’ saved [99583991786/99583991786]
 +
 +
2023-01-25 23:47:18 (4.08 MB/s) - ‘latest-lexemes.ttl.bz2’ saved [402494804/402494804]
 +
 +
FINISHED --2023-01-25 23:47:18--
 +
Total wall clock time: 6h 49m 0s
 +
Downloaded: 2 files, 93G in 6h 48m 59s (3.89 MB/s)
 
</source>
 
</source>

Revision as of 07:15, 26 January 2023

Download latest wikidata Dump ~10 hours

https://dumps.wikimedia.org/wikidatawiki/entities

latest-all.json.bz2                                18-Jan-2023 17:40         79779054481
latest-all.json.gz                                 18-Jan-2023 10:51        121027823223
latest-all.nt.bz2                                  19-Jan-2023 17:00        155239026614
latest-all.nt.gz                                   18-Jan-2023 23:55        200917826250
latest-all.ttl.bz2                                 19-Jan-2023 04:34         99583991786
latest-all.ttl.gz                                  18-Jan-2023 19:25        121477047220
latest-lexemes.json.bz2                            18-Jan-2023 03:47           270280878
latest-lexemes.json.gz                             18-Jan-2023 03:46           369955852
latest-lexemes.nt.bz2                              20-Jan-2023 23:32           717929951
latest-lexemes.nt.gz                               20-Jan-2023 23:27           947996669
latest-lexemes.ttl.bz2                             20-Jan-2023 23:28           402494804
latest-lexemes.ttl.gz                              20-Jan-2023 23:25           503140103
latest-truthy.nt.bz2                               20-Jan-2023 19:30         35434201681
latest-truthy.nt.gz                                20-Jan-2023 16:19         58740712185
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.nt.bz2
--2023-01-25 11:41:45--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.nt.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 717929951 (685M) [application/octet-stream]
Saving to: ‘latest-lexemes.nt.bz2’

...

sudo nohup wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2&
tail -f nohup.out
--2023-01-24 10:33:23--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 155239026614 (145G) [application/octet-stream]
Saving to: ‘latest-all.nt.bz2’

        0K .......... .......... .......... .......... ..........  0%  337K 5d4h
       50K .......... .......... .......... .......... ..........  0%  245K 6d4h
      100K .......... .......... .......... .......... ..........  0%  425K 5d11h
  ...
     1000K .......... .......... .......... .......... ..........  0% 58.4M 47h38m
  ...
    10000K .......... .......... .......... .......... ..........  0% 3.66M 11h20m
  ...
   100000K .......... .......... .......... .......... ..........  0% 23.6M 12h37m
  ...
  1000000K .......... .......... .......... .......... ..........  0% 3.40M 9h1m
  ...
 10000000K .......... .......... .......... .......... ..........  6% 3.05M 8h39m
...
100000000K .......... .......... .......... .......... .......... 65% 87.7M 3h18m
...
130000000K .......... .......... .......... .......... .......... 85%  101M 82m47s
...
140000000K .......... .......... .......... .......... .......... 92% 9.05M 44m30s
...
150000000K .......... .......... .......... .......... .......... 98% 3.94M 6m8s
...
151600600K .......... .                                          100%  623K=9h41m

2023-01-24 20:14:37 (4.25 MB/s) - ‘latest-all.nt.bz2’ saved [155239026614/155239026614]

unzip

bunzip2 with nohup does not work properly without being careful in its use!

nohup bunzip2 latest-all.nt.bz2 &
bunzip2: Control-C or similar caught, quitting.
bunzip2: Deleting output file latest-all.nt, if it exists.

see

redirect bunzip output and show progress

progress script

cat progress 
#!/bin/bash
# show decompression progress
while :
do
  date
  du -sm *.nt
  sleep 10
done

uncompress

nohup bunzip2 latest-all.nt.bz2  &> /dev/null &
./progress 
Wed Jan 25 06:15:32 AM CET 2023
5714	latest-all.nt
...
Wed Jan 25 07:36:32 AM CET 2023
300255	latest-all.nt

QLever control

https://github.com/ad-freiburg/qlever-control

mkdir qlever
cd qlever
git clone https://github.com/ad-freiburg/qlever-control
Cloning into 'qlever-control'...
remote: Enumerating objects: 426, done.
remote: Counting objects: 100% (266/266), done.
remote: Compressing objects: 100% (170/170), done.
remote: Total 426 (delta 108), reused 231 (delta 95), pack-reused 160
Receiving objects: 100% (426/426), 131.00 KiB | 585.00 KiB/s, done.
Resolving deltas: 100% (163/163), done.

setup wikidata

mkdir wikidata
cd wikidata/
. ../qlever-control/qlever wikidata

QLEVER CONFIG

Checking your PATH ...
The directory "/hd/eneco/qlever/qlever-control" is already contained in your PATH

Setting up bash autocompletion ...
Done, number of completions: 35

Creating new Qleverfile ...
Copied pre-configured Qleverfile for "wikidata" into current directory.

Setup is complete
Type qlever and use autocompletion to see which actions are available. Add a
"show" in the end to see what an action does without executing it (for example,
qlever index show). Edit your local Qleverfile to change settings. A typical
sequence of actions if you have used a preconfigured Qleverfile is:

qlever get-data
qlever index
qlever start
qlever example-query

start download

qlever get-data
xecuting "get-data":

wget -nc https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2

Getting data using GET_DATA_CMD from Qleverfile ...

--2023-01-25 16:58:18--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 99583991786 (93G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’

     0K .......... .......... .......... .......... ..........  0%  309K 3d15h
    50K .......... .......... .......... .......... ..........  0%  246K 4d2h
   100K .......... .......... .......... .......... ..........  0%  416K 3d15h
   150K .......... .......... .......... .......... ..........  0% 1.14M 2d23h
   200K .......... .......... .......... .......... ..........  0%  617K 2d17h
   250K .......... .......... .......... .......... ..........  0% 1.23M 2d10h
   300K .......... .......... .......... .......... ..........  0%  647K 2d8h
   350K .......... .......... .......... .......... ..........  0% 1.12M 2d3h
   400K .......... .......... .......... .......... ..........  0% 78.4M 46h13m
   450K .......... .......... .......... .......... ..........  0%  550K 46h31m
   500K .......... .......... .......... .......... ..........  0% 12.5M 42h28m
...
97249700K .......... .......... .......... .......... .......... 99% 2.98M 0s
97249750K .......... .......... .......... .......... .......... 99% 2.90M 0s
97249800K .......... .......... .......... .......... .......... 99%  116M 0s
97249850K .......... .......... .......... .......... .......... 99% 2.95M 0s
97249900K .......... .......... .......... .......... .......... 99% 3.28M 0s
97249950K .......... .......... .......... .......... .         100% 2.87M=6h47m

2023-01-25 23:45:44 (3.89 MB/s) - ‘latest-all.ttl.bz2’ saved [99583991786/99583991786]

2023-01-25 23:47:18 (4.08 MB/s) - ‘latest-lexemes.ttl.bz2’ saved [402494804/402494804]

FINISHED --2023-01-25 23:47:18--
Total wall clock time: 6h 49m 0s
Downloaded: 2 files, 93G in 6h 48m 59s (3.89 MB/s)