Difference between revisions of "WikiData Import 2022-01-29"

From BITPlan Wiki
Jump to navigation Jump to search
Line 169: Line 169:
  
 
</source>
 
</source>
 +
=== Indexing ===
 +
<source lang='bash' highlight='1'>
 +
./qlever --wikidata_index
 +
</source>
 +
 
== Issues ==
 
== Issues ==
 
see  
 
see  

Revision as of 08:44, 7 February 2022

QLever trial

see https://github.com/ad-freiburg/qlever/blob/master/docs/quickstart.md

see QLever/script as discussed in QLever Issue #562 for the script which makes reproducing this attempt easier.

Environment

Operating System

lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.3 LTS
Release:	20.04
Codename:	focal

docker version

docker --version
Docker version 19.03.13, build 4484c46d9d

Memory

wf@merkur:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           62Gi       1,3Gi        60Gi        30Mi       1,4Gi        60Gi
Swap:          18Gi          0B        18Gi

diskspace

 vendor               | model              | size      | days  | years |
----------------------+--------------------+-----------+-------+-------+
  ST6000VX0023-2EF110 |                    |    6,00TB |   300 |   0.8 |
              SanDisk |     SSD PLUS 120GB |     120GB |   237 |   0.6 |
                  WDC |   WD20EARS-00MVWB1 |    2,00TB |  2232 |   6.1 |
   ST8000DM004-2U9188 |                    |    8,00TB |     8 |   0.0 |

Model Family:     Seagate Barracuda Compute
Device Model:     ST8000DM004-2U9188
Firmware Version: 0001
User Capacity:    8.001.563.222.016 bytes [8,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches

Steps

The steps below maybe automated with the QLever/script

git clone (1 min)

export QLEVER_HOME=$(pwd)
date;git clone --recursive https://github.com/ad-freiburg/qlever qlever-code;date
Sa 29. Jan 17:12:45 CET 2022
Cloning into 'qlever-code'...
remote: Enumerating objects: 12917, done.
remote: Counting objects: 100% (622/622), done.
remote: Compressing objects: 100% (514/514), done.
remote: Total 12917 (delta 382), reused 206 (delta 103), pack-reused 12295
Receiving objects: 100% (12917/12917), 111.10 MiB | 6.81 MiB/s, done.
Resolving deltas: 100% (10067/10067), done.
Submodule 'third_party/abseil-cpp' (https://github.com/abseil/abseil-cpp.git) registered for path 'third_party/abseil-cpp'
Submodule 'third_party/antlr4' (https://github.com/antlr/antlr4.git) registered for path 'third_party/antlr4'
Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/googletest'
Submodule 'third_party/re2' (https://github.com/google/re2.git) registered for path 'third_party/re2'
Submodule 'third_party/stxxl' (https://github.com/ad-freiburg/stxxl) registered for path 'third_party/stxxl'
Cloning into '/hd/jurob/qlever/qlever-code/third_party/abseil-cpp'...
remote: Enumerating objects: 16122, done.        
remote: Counting objects: 100% (248/248), done.        
remote: Compressing objects: 100% (193/193), done.        
remote: Total 16122 (delta 116), reused 106 (delta 55), pack-reused 15874        
Receiving objects: 100% (16122/16122), 10.39 MiB | 6.85 MiB/s, done.
Resolving deltas: 100% (12398/12398), done.
Cloning into '/hd/jurob/qlever/qlever-code/third_party/antlr4'...
remote: Enumerating objects: 125326, done.        
remote: Counting objects: 100% (2734/2734), done.        
remote: Compressing objects: 100% (1022/1022), done.        
remote: Total 125326 (delta 1488), reused 2292 (delta 1246), pack-reused 122592        
Receiving objects: 100% (125326/125326), 64.11 MiB | 6.72 MiB/s, done.
Resolving deltas: 100% (73647/73647), done.
Cloning into '/hd/jurob/qlever/qlever-code/third_party/googletest'...
remote: Enumerating objects: 23795, done.        
remote: Counting objects: 100% (259/259), done.        
remote: Compressing objects: 100% (146/146), done.        
remote: Total 23795 (delta 134), reused 200 (delta 105), pack-reused 23536        
Receiving objects: 100% (23795/23795), 9.90 MiB | 6.89 MiB/s, done.
Resolving deltas: 100% (17517/17517), done.
Cloning into '/hd/jurob/qlever/qlever-code/third_party/re2'...
remote: Enumerating objects: 6996, done.        
remote: Counting objects: 100% (617/617), done.        
remote: Compressing objects: 100% (388/388), done.        
remote: Total 6996 (delta 392), reused 399 (delta 223), pack-reused 6379        
Receiving objects: 100% (6996/6996), 3.49 MiB | 6.94 MiB/s, done.
Resolving deltas: 100% (5287/5287), done.
Cloning into '/hd/jurob/qlever/qlever-code/third_party/stxxl'...
remote: Enumerating objects: 40982, done.        
remote: Counting objects: 100% (45/45), done.        
remote: Compressing objects: 100% (31/31), done.        
remote: Total 40982 (delta 15), reused 28 (delta 7), pack-reused 40937        
Receiving objects: 100% (40982/40982), 14.13 MiB | 6.77 MiB/s, done.
Resolving deltas: 100% (30914/30914), done.
Submodule path 'third_party/abseil-cpp': checked out 'b9b925341f9e90f5e7aa0cf23f036c29c7e454eb'
Submodule path 'third_party/antlr4': checked out 'e4c1a74c66bd5290364ea2b36c97cd724b247357'
Submodule path 'third_party/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929'
Submodule path 'third_party/re2': checked out '13ebb377c6ad763ca61d12dd6f88b1126bd0b911'
Submodule path 'third_party/stxxl': checked out 'a4f884f2a2b4ea078c34c48e1e9a0003f4619f00'
Submodule 'extlib/foxxll' (https://github.com/ad-freiburg/foxxll.git) registered for path 'third_party/stxxl/extlib/foxxll'
Cloning into '/hd/jurob/qlever/qlever-code/third_party/stxxl/extlib/foxxll'...
remote: Enumerating objects: 21414, done.        
remote: Counting objects: 100% (28/28), done.        
remote: Compressing objects: 100% (22/22), done.        
remote: Total 21414 (delta 9), reused 13 (delta 4), pack-reused 21386        
Receiving objects: 100% (21414/21414), 4.60 MiB | 2.48 MiB/s, done.
Resolving deltas: 100% (15789/15789), done.
Submodule path 'third_party/stxxl/extlib/foxxll': checked out '8cbca7bedcdb0b84a6de99e927c5fa27a4bbbfb2'
Submodule 'extlib/tlx' (https://github.com/joka921/tlx.git) registered for path 'third_party/stxxl/extlib/foxxll/extlib/tlx'
Cloning into '/hd/jurob/qlever/qlever-code/third_party/stxxl/extlib/foxxll/extlib/tlx'...
remote: Enumerating objects: 3418, done.        
remote: Counting objects: 100% (53/53), done.        
remote: Compressing objects: 100% (44/44), done.        
remote: Total 3418 (delta 24), reused 22 (delta 9), pack-reused 3365        
Receiving objects: 100% (3418/3418), 1.11 MiB | 2.75 MiB/s, done.
Resolving deltas: 100% (2611/2611), done.
Submodule path 'third_party/stxxl/extlib/foxxll/extlib/tlx': checked out 'ef81a598d9880cc7d242afc47de7328634f07f1d'
Sa 29. Jan 17:13:49 CET 2022

docker build (15 mins)

cd qlever-code/
wf@merkur:/hd/jurob/qlever/qlever-code$ date;sudo docker build --file Dockerfiles/Dockerfile.Ubuntu20.04 -t qlever .;date
Sa 29. Jan 17:14:51 CET 2022
Sending build context to Docker daemon  126.1MB
Step 1/43 : FROM ubuntu:20.04 as base
...
Successfully tagged qlever:latest
Sa 29. Jan 17:29:27 CET 2022

Wikidata dump Download (6h)

This had been done on another machine a few days earlier.

date;wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2;date
Thu Jan 27 11:08:10 CET 2022
--2022-01-27 11:08:10--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 92422961847 (86G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’

latest-all.ttl.bz2  100%[===================>]  86.08G  4.60MB/s    in 5h 46m  

2022-01-27 16:55:06 (4.23 MB/s) - ‘latest-all.ttl.bz2’ saved [92422961847/92422961847]

Thu Jan 27 16:55:06 CET 2022
date;wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2;date
Thu Jan 27 17:36:38 CET 2022
--2022-01-27 17:36:38--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 315591211 (301M) [application/octet-stream]
Saving to: ‘latest-lexemes.ttl.bz2’

latest-lexemes.ttl. 100%[===================>] 300.97M  4.95MB/s    in 60s     

2022-01-27 17:37:39 (5.00 MB/s) - ‘latest-lexemes.ttl.bz2’ saved [315591211/315591211]

Thu Jan 27 17:37:39 CET 2022

Indexing

./qlever --wikidata_index

Issues

see