Difference between revisions of "Wikidata Import 2023-01-24"

From BITPlan Wiki
Jump to navigation Jump to search
Line 144: Line 144:
 
<source lang='bash' highlight='1'>
 
<source lang='bash' highlight='1'>
 
qlever get-data
 
qlever get-data
xecuting "get-data":
+
executing "get-data":
  
 
wget -nc https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2
 
wget -nc https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2
Line 183: Line 183:
 
Total wall clock time: 6h 49m 0s
 
Total wall clock time: 6h 49m 0s
 
Downloaded: 2 files, 93G in 6h 48m 59s (3.89 MB/s)
 
Downloaded: 2 files, 93G in 6h 48m 59s (3.89 MB/s)
 +
</source>
 +
=== start indexing ===
 +
<source lang='bash' highlight='1'>
 +
qlever index
 +
Executing "index":
 +
 +
for F in latest-lexemes.ttl.bz2 latest-all.ttl.bz2; do bzcat $F | head -1000 | \grep ^@prefix; done | sort -u > wikidata-latest.prefix-definitions
 +
docker run -it --rm -u 0:0 -v /hd/eneco/qlever/wikidata:/index -w /index --entrypoint bash --name qlever.wikidata-latest.index-build adfreiburg/qlever -c "ulimit -Sn 1048576; bzcat -f wikidata-latest.prefix-definitions latest-lexemes.ttl.bz2 latest-all.ttl.bz2 | IndexBuilderMain -F ttl -f - -i wikidata-latest -s wikidata-latest.settings.json --stxxl-memory-gb 10 | tee wikidata-latest.index-log.txt"
 +
 +
2023-01-26 05:16:33.718 - INFO:  QLever IndexBuilder, compiled on Mon Jan 16 11:34:10 UTC 2023 using git hash d7f662
 +
2023-01-26 05:16:33.718 - INFO:  You specified the input format: TTL
 +
2023-01-26 05:16:33.719 - INFO:  You specified "locale = en_US" and "ignore-punctuation = 1"
 +
2023-01-26 05:16:33.719 - INFO:  You specified "num-triples-per-batch = 5,000,000", choose a lower value if the index builder runs out of memory
 +
2023-01-26 05:16:33.719 - INFO:  Integers that cannot be represented by QLever will throw an exception (this is the default behavior)
 +
2023-01-26 05:16:33.719 - INFO:  Processing input triples from /dev/stdin ...
 +
2023-01-26 05:18:07.861 - INFO:  Input triples processed: 100,000,000
 +
2023-01-26 05:19:34.947 - INFO:  Input triples processed: 200,000,000
 
</source>
 
</source>

Revision as of 07:20, 26 January 2023

Download latest wikidata Dump ~10 hours

https://dumps.wikimedia.org/wikidatawiki/entities

latest-all.json.bz2                                18-Jan-2023 17:40         79779054481
latest-all.json.gz                                 18-Jan-2023 10:51        121027823223
latest-all.nt.bz2                                  19-Jan-2023 17:00        155239026614
latest-all.nt.gz                                   18-Jan-2023 23:55        200917826250
latest-all.ttl.bz2                                 19-Jan-2023 04:34         99583991786
latest-all.ttl.gz                                  18-Jan-2023 19:25        121477047220
latest-lexemes.json.bz2                            18-Jan-2023 03:47           270280878
latest-lexemes.json.gz                             18-Jan-2023 03:46           369955852
latest-lexemes.nt.bz2                              20-Jan-2023 23:32           717929951
latest-lexemes.nt.gz                               20-Jan-2023 23:27           947996669
latest-lexemes.ttl.bz2                             20-Jan-2023 23:28           402494804
latest-lexemes.ttl.gz                              20-Jan-2023 23:25           503140103
latest-truthy.nt.bz2                               20-Jan-2023 19:30         35434201681
latest-truthy.nt.gz                                20-Jan-2023 16:19         58740712185
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.nt.bz2
--2023-01-25 11:41:45--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.nt.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 717929951 (685M) [application/octet-stream]
Saving to: ‘latest-lexemes.nt.bz2’

...

sudo nohup wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2&
tail -f nohup.out
--2023-01-24 10:33:23--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 155239026614 (145G) [application/octet-stream]
Saving to: ‘latest-all.nt.bz2’

        0K .......... .......... .......... .......... ..........  0%  337K 5d4h
       50K .......... .......... .......... .......... ..........  0%  245K 6d4h
      100K .......... .......... .......... .......... ..........  0%  425K 5d11h
  ...
     1000K .......... .......... .......... .......... ..........  0% 58.4M 47h38m
  ...
    10000K .......... .......... .......... .......... ..........  0% 3.66M 11h20m
  ...
   100000K .......... .......... .......... .......... ..........  0% 23.6M 12h37m
  ...
  1000000K .......... .......... .......... .......... ..........  0% 3.40M 9h1m
  ...
 10000000K .......... .......... .......... .......... ..........  6% 3.05M 8h39m
...
100000000K .......... .......... .......... .......... .......... 65% 87.7M 3h18m
...
130000000K .......... .......... .......... .......... .......... 85%  101M 82m47s
...
140000000K .......... .......... .......... .......... .......... 92% 9.05M 44m30s
...
150000000K .......... .......... .......... .......... .......... 98% 3.94M 6m8s
...
151600600K .......... .                                          100%  623K=9h41m

2023-01-24 20:14:37 (4.25 MB/s) - ‘latest-all.nt.bz2’ saved [155239026614/155239026614]

unzip

bunzip2 with nohup does not work properly without being careful in its use!

nohup bunzip2 latest-all.nt.bz2 &
bunzip2: Control-C or similar caught, quitting.
bunzip2: Deleting output file latest-all.nt, if it exists.

see

redirect bunzip output and show progress

progress script

cat progress 
#!/bin/bash
# show decompression progress
while :
do
  date
  du -sm *.nt
  sleep 10
done

uncompress

nohup bunzip2 latest-all.nt.bz2  &> /dev/null &
./progress 
Wed Jan 25 06:15:32 AM CET 2023
5714	latest-all.nt
...
Wed Jan 25 07:36:32 AM CET 2023
300255	latest-all.nt

QLever control

https://github.com/ad-freiburg/qlever-control

mkdir qlever
cd qlever
git clone https://github.com/ad-freiburg/qlever-control
Cloning into 'qlever-control'...
remote: Enumerating objects: 426, done.
remote: Counting objects: 100% (266/266), done.
remote: Compressing objects: 100% (170/170), done.
remote: Total 426 (delta 108), reused 231 (delta 95), pack-reused 160
Receiving objects: 100% (426/426), 131.00 KiB | 585.00 KiB/s, done.
Resolving deltas: 100% (163/163), done.

setup wikidata

mkdir wikidata
cd wikidata/
. ../qlever-control/qlever wikidata

QLEVER CONFIG

Checking your PATH ...
The directory "/hd/eneco/qlever/qlever-control" is already contained in your PATH

Setting up bash autocompletion ...
Done, number of completions: 35

Creating new Qleverfile ...
Copied pre-configured Qleverfile for "wikidata" into current directory.

Setup is complete
Type qlever and use autocompletion to see which actions are available. Add a
"show" in the end to see what an action does without executing it (for example,
qlever index show). Edit your local Qleverfile to change settings. A typical
sequence of actions if you have used a preconfigured Qleverfile is:

qlever get-data
qlever index
qlever start
qlever example-query

start download

qlever get-data
executing "get-data":

wget -nc https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2

Getting data using GET_DATA_CMD from Qleverfile ...

--2023-01-25 16:58:18--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 99583991786 (93G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’

     0K .......... .......... .......... .......... ..........  0%  309K 3d15h
    50K .......... .......... .......... .......... ..........  0%  246K 4d2h
   100K .......... .......... .......... .......... ..........  0%  416K 3d15h
   150K .......... .......... .......... .......... ..........  0% 1.14M 2d23h
   200K .......... .......... .......... .......... ..........  0%  617K 2d17h
   250K .......... .......... .......... .......... ..........  0% 1.23M 2d10h
   300K .......... .......... .......... .......... ..........  0%  647K 2d8h
   350K .......... .......... .......... .......... ..........  0% 1.12M 2d3h
   400K .......... .......... .......... .......... ..........  0% 78.4M 46h13m
   450K .......... .......... .......... .......... ..........  0%  550K 46h31m
   500K .......... .......... .......... .......... ..........  0% 12.5M 42h28m
...
97249700K .......... .......... .......... .......... .......... 99% 2.98M 0s
97249750K .......... .......... .......... .......... .......... 99% 2.90M 0s
97249800K .......... .......... .......... .......... .......... 99%  116M 0s
97249850K .......... .......... .......... .......... .......... 99% 2.95M 0s
97249900K .......... .......... .......... .......... .......... 99% 3.28M 0s
97249950K .......... .......... .......... .......... .         100% 2.87M=6h47m

2023-01-25 23:45:44 (3.89 MB/s) - ‘latest-all.ttl.bz2’ saved [99583991786/99583991786]

2023-01-25 23:47:18 (4.08 MB/s) - ‘latest-lexemes.ttl.bz2’ saved [402494804/402494804]

FINISHED --2023-01-25 23:47:18--
Total wall clock time: 6h 49m 0s
Downloaded: 2 files, 93G in 6h 48m 59s (3.89 MB/s)

start indexing

qlever index
Executing "index":

for F in latest-lexemes.ttl.bz2 latest-all.ttl.bz2; do bzcat $F | head -1000 | \grep ^@prefix; done | sort -u > wikidata-latest.prefix-definitions
docker run -it --rm -u 0:0 -v /hd/eneco/qlever/wikidata:/index -w /index --entrypoint bash --name qlever.wikidata-latest.index-build adfreiburg/qlever -c "ulimit -Sn 1048576; bzcat -f wikidata-latest.prefix-definitions latest-lexemes.ttl.bz2 latest-all.ttl.bz2 | IndexBuilderMain -F ttl -f - -i wikidata-latest -s wikidata-latest.settings.json --stxxl-memory-gb 10 | tee wikidata-latest.index-log.txt"

2023-01-26 05:16:33.718	- INFO:  QLever IndexBuilder, compiled on Mon Jan 16 11:34:10 UTC 2023 using git hash d7f662
2023-01-26 05:16:33.718	- INFO:  You specified the input format: TTL
2023-01-26 05:16:33.719	- INFO:  You specified "locale = en_US" and "ignore-punctuation = 1"
2023-01-26 05:16:33.719	- INFO:  You specified "num-triples-per-batch = 5,000,000", choose a lower value if the index builder runs out of memory
2023-01-26 05:16:33.719	- INFO:  Integers that cannot be represented by QLever will throw an exception (this is the default behavior)
2023-01-26 05:16:33.719	- INFO:  Processing input triples from /dev/stdin ...
2023-01-26 05:18:07.861	- INFO:  Input triples processed: 100,000,000
2023-01-26 05:19:34.947	- INFO:  Input triples processed: 200,000,000