WikiData Import 2022-06-24

From BITPlan Wiki
Jump to navigation Jump to search

✅ This indexing attempt was successful.

see QLever/script as discussed in QLever Issue #562 for the script which makes reproducing this attempt easier.

see QLever Discussions for more details on this attempt series.

since the https://github.com/ad-freiburg/qlever-control now has an official "qlever" script we have rename the script that has the purpose to make the import attempts reproducible to qleverauto.

Beware of https://github.com/ad-freiburg/qlever-control/issues/4 - make sure ulimit -n is set!. This attempt had to be restarted since setting the value within a script did not work.

Preparations

Native build

WikiData_Import_2022-05-21#Build_code steps still apply for this attempt using the native/compiled version of qlever.

Update qlever-code

qlever-code$ git pull
remote: Enumerating objects: 301, done.
remote: Counting objects: 100% (301/301), done.
remote: Compressing objects: 100% (223/223), done.
Receiving objects:  31% (94/301), 53.42 MiB | 3.75 MiB/s
...
 create mode 100644 src/util/antlr/ANTLRErrorHandling.h
 delete mode 100644 src/util/antlr/ThrowingErrorStrategy.h
 create mode 100644 toolchains/gcc12.cmake

Update submodules

git submodule update --init --recursive
Submodule path 'third_party/abseil-cpp': checked out '2617970857c46e6ec971865d54f00445c260f682'
Submodule path 'third_party/googletest': checked out '0320f517fd920866d918e564105d68fd4362040a'
From https://github.com/ad-freiburg/stxxl
 * branch              70cc597f3f76f96f036db4ffdd84a5cd7b224c7c -> FETCH_HEAD
Fetching submodule extlib/foxxll
Submodule path 'third_party/stxxl': checked out '70cc597f3f76f96f036db4ffdd84a5cd7b224c7c'
From https://github.com/ad-freiburg/foxxll
 * branch              784859bc09a3982d6545fbf1d7b698e273401703 -> FETCH_HEAD
Submodule path 'third_party/stxxl/extlib/foxxll': checked out '784859bc09a3982d6545fbf1d7b698e273401703'

Build

wf@sun:/hd/seel/qlever/qlever-code$ rm -rf build/
wf@sun:/hd/seel/qlever/qlever-code$ mkdir build
wf@sun:/hd/seel/qlever/qlever-code$ cd build
wf@sun:/hd/seel/qlever/qlever-code/build$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER="g++-11" -DLOGLEVEL=INFO -DUSE_PARALLEL=true -GNinja ..
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 11.1.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
...
-- ---
-- Configuring done
-- Generating done
-- Build files have been written to: /hd/seel/qlever/qlever-code/build

Ninja

see https://ninja-build.org/manual.html

ninja
...
[613/613] Linking CXX executable test/SparqlExpressionTest

qleverauto environment checks

./qleverauto -v
qleverauto version : 1.29 $ : 2022/05/23 06:15:28 $
./qleverauto -e
needed software
docker → /usr/bin/docker ✅
top → /usr/bin/top ✅
df → /usr/bin/df ✅
jq → /usr/bin/jq ✅
lsb_release → /usr/bin/lsb_release ✅
free → /usr/bin/free ✅
operating system
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.4 LTS
Release:	20.04
Codename:	focal
docker version
Docker version 20.10.16, build aa7e414
memory
              total        used        free      shared  buff/cache   available
Mem:          125Gi       1,3Gi       120Gi        27Mi       4,2Gi       123Gi
Swap:         2,0Gi          0B       2,0Gi
diskspace
/dev/sdb5       116G   25G   86G  23% /
tmpfs            63G   16K   63G   1% /dev/shm
/dev/sda1       3,6T  2,7T  716G  80% /hd/seel
/dev/sdb1       511M  4,0K  511M   1% /boot/efi
soft ulimit for files
1048576

wikidata files and index settings

reuse files of latest attempt of last month

wikidata$ ls -l
total 92754792
-rw-rw-r-- 1 wf wf 94653250500 Mai 19 07:35 latest-all.ttl.bz2
-rw-rw-r-- 1 wf wf   327629685 Mai 21 01:28 latest-lexemes.ttl.bz2
-rw-r--r-- 1 wf wf         911 Mai 22 17:47 Qleverfile
drwxrwxr-x 2 wf wf        4096 Mai 23 08:07 RCS
-rw-rw-r-- 1 wf wf          40 Mai 23 14:52 wikidata.settings.json

Qleverfile

cat Qleverfile 
# Qleverfile for folder /hd/seel/qlever
# Automatically created on Sa 21. Mai 09:09:41 CEST 2022.
# Modify or expand as you see fit.

# Indexer settings
DB               = wikidata 
RDF_FILES        = "latest-all.ttl.bz2 latest-lexemes.ttl.bz2"
CAT_FILES        = "bzcat ${RDF_FILES}"
WITH_TEXT        = false
SETTINGS_JSON    = '{ "languages-internal": ["en"], "prefixes-external": [ "<http://www.wikidata.org/entity/statement", "<http://www.wikidata.org/value", "<http://www.wikidata.org/reference" ], "locale": { "language": "en", "country": "US", "ignore-punctuation": true }, "ascii-prefixes-only": true, "num-triples-per-batch": 10000000 }'
# Server settings
HOSTNAME                       = sun.bitplan.com
SERVER_PORT                    = 7001
MEMORY_FOR_QUERIES             = 10
CACHE_MAX_SIZE_GB              = 5
CACHE_MAX_SIZE_GB_SINGLE_ENTRY = 1
CACHE_MAX_NUM_ENTRIES          = 100

# QLever binaries
QLEVER_BIN_DIR          = /hd/seel/qlever/qlever-code/build/ 
USE_DOCKER              = false
QLEVER_DOCKER_IMAGE     = adfreiburg/qlever
QLEVER_DOCKER_CONTAINER = qlever.must_specify

# QLever UI
QLEVERUI_PORT   = 7000
QLEVERUI_DIR    = qlever-ui
QLEVERUI_CONFIG = default
rcsdiff -r1.2 ./Qleverfile 
===================================================================
RCS file: ./RCS/Qleverfile,v
retrieving revision 1.2
diff -r1.2 ./Qleverfile
10,11c10
< SETTINGS_JSON    = '{ "num-triples-per-batch": 10000000 }'
< 
---
> SETTINGS_JSON    = '{ "languages-internal": ["en"], "prefixes-external": [ "<http://www.wikidata.org/entity/statement", "<http://www.wikidata.org/value", "<http://www.wikidata.org/reference" ], "locale": { "language": "en", "country": "US", "ignore-punctuation": true }, "ascii-prefixes-only": true, "num-triples-per-batch": 10000000 }'

wikidata.settings.json

cat wikidata.settings.json 
{
  "languages-internal": ["en"],
  "prefixes-external": [
    "<http://www.wikidata.org/entity/statement",
    "<http://www.wikidata.org/value",
    "<http://www.wikidata.org/reference"
  ],
  "locale": {
	  "language": "en",
	  "country": "US",
	  "ignore-punctuation": true
  },
  "ascii-prefixes-only": true,
  "num-triples-per-batch" : 10000000
}
rcsdiff ./wikidata.settings.json 
===================================================================
RCS file: ./RCS/wikidata.settings.json,v
retrieving revision 1.1
diff -r1.1 ./wikidata.settings.json
2c2,14
<   "num-triples-per-batch": 10000000
---
>   "languages-internal": ["en"],
>   "prefixes-external": [
>     "<http://www.wikidata.org/entity/statement",
>     "<http://www.wikidata.org/value",
>     "<http://www.wikidata.org/reference"
>   ],
>   "locale": {
> 	  "language": "en",
> 	  "country": "US",
> 	  "ignore-punctuation": true
>   },
>   "ascii-prefixes-only": true,
>   "num-triples-per-batch" : 10000000

Indexer

Relevant part of qleverauto script

#
# build the wikidata index
#
wikidata_index() {
   cd $QLEVER_HOME/wikidata
   chmod o+w .
   show_timing "creating wikidata index" "started"
#   docker run -i --rm -v $QLEVER_HOME/qlever-indices/wikidata:/index --entrypoint bash $dockerimage  -c "cd /index && bzcat latest-all.ttl.bz2 latest-lexemes.ttl.bz2 | IndexBuilderMain -F ttl -f - -l -i wikidata -s wikidata.settings.json | tee wikidata.index-log.txt"
   . ../qlever-control/qlever
   check_installed IndexBuilderMain
   qlever index
   show_timing "creating wikidata index" "finished"
}

Symbolic Link for IndexBuilder Main

~/bin$ ls -l IndexBuilderMain 
lrwxrwxrwx 1 wf wf 50 Mai 22 17:47 IndexBuilderMain -> /hd/seel/qlever/qlever-code/build/IndexBuilderMain
which IndexBuilderMain 
/home/wf/bin/IndexBuilderMain

start

ulimit -n 1000000  
ulimit -a | grep '(-n)'
open files                      (-n) 1000000
nohup ./qleverauto -wi&
tail -f nohup.out
bzcat latest-all.ttl.bz2 latest-lexemes.ttl.bz2 | IndexBuilderMain -F ttl -K wikidata -f - -i wikidata -s wikidata.settings.json | tee wikidata.index-log.txt

2022-06-24 10:17:24.091	- INFO:  QLever IndexBuilder, compiled on Jun 24 2022 09:48:21
2022-06-24 10:17:24.092	- INFO:  You specified the input format: TTL
2022-06-24 10:17:24.092	- INFO:  You specified "locale = en_US" and "ignore-punctuation = 1"
2022-06-24 10:17:24.092	- INFO:  You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files
2022-06-24 10:17:24.092	- INFO:  You specified "num-triples-per-batch = 10,000,000", choose a lower value if the index builder runs out of memory
2022-06-24 10:17:24.092	- INFO:  Integers that cannot be represented by QLever will throw an exception (this is the default behavior)
2022-06-24 10:17:24.092	- INFO:  Processing input triples from /dev/stdin ...
2022-06-24 10:19:42.628	- INFO:  Input triples processed: 100,000,000
2022-06-24 10:21:49.591	- INFO:  Input triples processed: 200,000,000
...
2022-06-24 16:25:43.584	- INFO:  Input triples processed: 17,300,000,000
2022-06-24 16:27:12.709	- INFO:  Input triples processed: 17,400,000,000
2022-06-24 16:28:13.800	- INFO:  Done, total number of triples read: 17,460,734,729 [may contain duplicates]
2022-06-24 16:28:13.800	- INFO:  Number of QLever-internal triples created: 10,625,008,810 [may contain duplicates]
2022-06-24 16:28:13.800	- INFO:  Merging partial vocabularies in byte order (internal only) ...
2022-06-24 16:30:11.559	- INFO:  Words merged: 100,000,000
2022-06-24 16:32:56.452	- INFO:  Words merged: 200,000,000
...
2022-06-24 16:50:30.791	- INFO:  Words merged: 800,000,000
2022-06-24 16:50:32.869	- INFO:  Number of words in internal vocabulary: 803,012,009
2022-06-24 16:50:32.870	- INFO:  Merging partial vocabularies in Unicode order (internal and external) ...
2022-06-24 16:55:19.386	- INFO:  Words merged: 100,000,000
2022-06-24 17:00:53.965	- INFO:  Words merged: 200,000,000
...
2022-06-24 19:22:05.305	- INFO:  Words merged: 3,000,000,000
2022-06-24 19:25:42.043	- INFO:  Words merged: 3,100,000,000
2022-06-24 19:30:22.289	- INFO:  Number of words in external vocabulary: 2,380,290,824
2022-06-24 19:30:22.289	- INFO:  Removing temporary files ...
2022-06-24 19:30:36.250	- INFO:  Converting external vocabulary to binary format ...
2022-06-24 20:06:10.166	- INFO:  Converting triples from local IDs to global IDs ...
2022-06-24 20:06:40.823	- INFO:  Triples converted: 100,000,000
2022-06-24 20:07:05.558	- INFO:  Triples converted: 200,000,000
...
2022-06-24 22:06:47.931	- INFO:  Triples converted: 28,000,000,000
2022-06-24 22:07:11.040	- INFO:  Done, total number of triples converted: 28,085,743,539
2022-06-24 22:07:11.108	- INFO:  Building prefix tree from internal vocabulary ...
2022-06-24 22:07:31.625	- INFO:  Words processed: 100,000,000
2022-06-24 22:08:08.792	- INFO:  Words processed: 200,000,000
...
2022-06-24 22:12:21.901	- INFO:  Words processed: 800,000,000
2022-06-24 22:12:23.131	- INFO:  Computing maximally compressing prefixes (greedy algorithm) ...
2022-06-24 22:29:23.381	- INFO:  Reduction of size of internal vocabulary: 46%
2022-06-24 22:29:51.140	- INFO:  Writing compressed vocabulary to disk ...
2022-06-25 00:34:02.750	- INFO:  Creating a pair of index permutations ... 
2022-06-25 04:21:33.258	- INFO:  Statistics for PSO: #relations = 62,405, #blocks = 46,498, #triples = 23,411,086,943
2022-06-25 04:21:33.258	- INFO:  Statistics for POS: #relations = 62,405, #blocks = 46,498, #triples = 23,411,086,943
2022-06-25 04:21:33.258	- INFO:  Exchanging multiplicities for PSO and POS ...
2022-06-25 04:21:33.264	- INFO:  Writing meta data for PSO and POS ...
2022-06-25 05:43:51.603	- INFO:  Creating a pair of index permutations ... 
2022-06-25 07:36:56.397	- INFO:  Statistics for SPO: #relations = 2,779,853,585, #blocks = 29,774, #triples = 23,411,086,943
2022-06-25 07:36:56.397	- INFO:  Statistics for SOP: #relations = 2,779,853,585, #blocks = 29,774, #triples = 23,411,086,943
2022-06-25 07:36:56.397	- INFO:  Exchanging multiplicities for SPO and SOP ...
2022-06-25 08:18:51.323	- INFO:  Writing meta data for SPO and SOP ...
2022-06-25 08:19:08.547	- INFO:  Number of distinct patterns: 7,087,677
2022-06-25 08:19:08.547	- INFO:  Number of subjects with pattern: 2,779,853,585 [all]
2022-06-25 08:19:08.547	- INFO:  Total number of distinct subject-predicate pairs: 11,003,206,558
2022-06-25 08:19:08.547	- INFO:  Average number of predicates per subject: 4.0
2022-06-25 08:19:08.580	- INFO:  Average number of subjects per predicate: 236,837
2022-06-25 09:42:57.517	- INFO:  Creating a pair of index permutations ... 
creating wikidata index finished at Sa 25. Jun 11:04:33 CEST 2022 after 89230 seconds

Resulting files

-rwx------  1 wf  staff    94653250500 19 Mai 07:35 latest-all.ttl.bz2
-rwx------  1 wf  staff      327629685 21 Mai 01:28 latest-lexemes.ttl.bz2
-rwx------  1 wf  staff  1772814860288 25 Jun 09:42 wikidata-stxxl.disk
-rwx------  1 wf  staff          37505 25 Jun 09:42 wikidata.index-log.txt
-rwx------  1 wf  staff    28278087680 25 Jun 11:04 wikidata.index.ops
-rwx------  1 wf  staff    48982650912 25 Jun 11:04 wikidata.index.ops.meta
-rwx------  1 wf  staff    27706970112 25 Jun 11:04 wikidata.index.osp
-rwx------  1 wf  staff    48982650912 25 Jun 11:04 wikidata.index.osp.meta
-rwx------  1 wf  staff      573472800 25 Jun 11:04 wikidata.index.osp.tmp.mmap-buffer
-rwx------  1 wf  staff    15051429124 25 Jun 08:19 wikidata.index.patterns
-rwx------  1 wf  staff    56401897399 25 Jun 04:21 wikidata.index.pos
-rwx------  1 wf  staff    65137493514 25 Jun 04:21 wikidata.index.pso
-rwx------  1 wf  staff    36301340744 25 Jun 08:18 wikidata.index.sop
-rwx------  1 wf  staff   110210977824 25 Jun 08:18 wikidata.index.sop.meta
-rwx------  1 wf  staff    36699637749 25 Jun 08:18 wikidata.index.spo
-rwx------  1 wf  staff   110210977824 25 Jun 08:18 wikidata.index.spo.meta
-rwx------  1 wf  staff            274 24 Jun 22:40 wikidata.meta-data.json
-rwx------  1 wf  staff           4295 24 Jun 22:29 wikidata.prefixes
-rwx------  1 wf  staff            362 24 Jun 10:17 wikidata.settings.json
-rwx------  1 wf  staff   155452489143 24 Jun 20:06 wikidata.vocabulary.external
-rwx------  1 wf  staff    48982650912 24 Jun 20:06 wikidata.vocabulary.external.idsAndOffsets.mmap
-rwx------  1 wf  staff    27373372956 24 Jun 22:39 wikidata.vocabulary.internal

Memory usage

2022-06-25 09:30

top
...
MiB Mem : 128832,8 total,   5514,4 free,  12913,2 used, 110405,2 buff/cache

Starting server

ServerMain -i /hd/seel/qlever/wikidata/wikidata  -p 7001
2022-08-10 16:41:45.203	- INFO:  Done, number of words: 808,217,822
2022-08-10 16:41:45.206	- INFO:  Number of words in external vocabulary: 2,406,607,301
2022-08-10 16:41:45.235	- INFO:  Registered PSO permutation: #relations = 63,016, #blocks = 46,898, #triples = 23,623,846,556
2022-08-10 16:41:45.267	- INFO:  Registered POS permutation: #relations = 63,016, #blocks = 46,898, #triples = 23,623,846,556
2022-08-10 16:41:45.276	- INFO:  Registered OPS permutation: #relations = 3,179,186,092, #blocks = 39,688, #triples = 23,623,846,556
2022-08-10 16:41:45.286	- INFO:  Registered OSP permutation: #relations = 3,179,186,092, #blocks = 39,688, #triples = 23,623,846,556
2022-08-10 16:41:45.293	- INFO:  Registered SPO permutation: #relations = 2,808,258,381, #blocks = 30,045, #triples = 23,623,846,556
2022-08-10 16:41:45.302	- INFO:  Registered SOP permutation: #relations = 2,808,258,381, #blocks = 30,045, #triples = 23,623,846,556
2022-08-10 16:41:45.302	- INFO:  Reading patterns from file /hd/seel/qlever/wikidata/wikidata.index.patterns ...
2022-08-10 16:42:26.209	- INFO:  Sorting random result tables to estimate the sorting performance of this machine ...
2022-08-10 16:42:43.558	- INFO:  Access token for restricted API calls is ""
2022-08-10 16:42:43.558	- INFO:  The server is ready, listening for requests on port 7001 ...