Difference between revisions of "Wikidata Import 2023-05-15"

From BITPlan Wiki
Jump to navigation Jump to search
Line 143: Line 143:
 
</pre>
 
</pre>
 
= start =
 
= start =
<source lang='bash'>
+
<source lang='bash' higlight='1'>
 
nohup qlever start&
 
nohup qlever start&
 +
Executing "start":
 +
 +
docker run -d --restart unless-stopped -u 10000:10000 -it -v /etc/localtime:/etc/localtime:ro -v /hd/mantax/qlever/wikidata:/index -p 7001:7001 -w /index --entrypoint bash --name qlever.wikidata-latest adfreiburg/qlever -c "ServerMain -i wikidata-latest -j 8 -p 7001 -m 50 -c 30 -e 5 -k 100 -a \"wikidata-latest_1432218987\" > wikidata-latest.server-log.txt" > /dev/null
 +
 +
Starting the QLever server in the background and waiting until it's ready (Ctrl+C will not kill it) ...
 +
 +
2023-05-17 08:28:38.072 - INFO:  QLever Server, compiled on Mon May  1 10:21:29 UTC 2023 using git hash 83f1e8
 +
2023-05-17 08:28:38.091 - INFO:  Initializing server ...
 +
2023-05-17 08:28:38.094 - INFO:  The git hash used to build this index was 83f1e8
 +
2023-05-17 08:28:38.095 - INFO:  Reading vocabulary from file wikidata-latest.vocabulary.internal ...
 +
2023-05-17 08:29:05.872 - INFO:  Done, number of words: 861,507,415
 +
2023-05-17 08:29:05.889 - INFO:  Number of words in external vocabulary: 2,529,812,915
 +
2023-05-17 08:29:06.095 - INFO:  Registered PSO permutation: #relations = 70,309, #blocks = 802,770, #triples = 24,908,374,000
 +
2023-05-17 08:29:06.323 - INFO:  Registered POS permutation: #relations = 70,309, #blocks = 802,770, #triples = 24,908,374,000
 +
2023-05-17 08:29:06.503 - INFO:  Registered OPS permutation: #relations = 3,353,488,602, #blocks = 692,046, #triples = 24,908,374,000
 +
2023-05-17 08:29:06.683 - INFO:  Registered OSP permutation: #relations = 3,353,488,602, #blocks = 692,046, #triples = 24,908,374,000
 +
2023-05-17 08:29:06.820 - INFO:  Registered SPO permutation: #relations = 2,953,527,247, #blocks = 533,058, #triples = 24,908,374,000
 +
2023-05-17 08:29:06.958 - INFO:  Registered SOP permutation: #relations = 2,953,527,247, #blocks = 533,058, #triples = 24,908,374,000
 +
2023-05-17 08:29:06.958 - INFO:  Reading patterns from file wikidata-latest.index.patterns ...
 +
2023-05-17 08:29:30.494 - INFO:  Sorting random result tables to estimate the sorting performance of this machine ...
 +
2023-05-17 08:29:34.189 - INFO:  Access token for restricted API calls is "wikidata-latest_1432218987"
 +
2023-05-17 08:29:34.189 - INFO:  The server is ready, listening for requests on port 7001 ...
 +
2023-05-17 08:29:35.004 - INFO: 
 +
2023-05-17 08:29:35.004 - INFO:  Request received via GET, no content type specified
 +
2023-05-17 08:29:35.005 - INFO:  Alive check with message "from the qlever script"
 +
2023-05-17 08:29:35.022 - INFO: 
 +
2023-05-17 08:29:35.022 - INFO:  Request received via GET, no content type specified
 +
2023-05-17 08:29:35.022 - INFO:  Setting index description to: "Full Wikidata dump (latest-all.ttl.bz2 from 11.05.2023, latest-lexemes.ttl.bz2 from 13.05.2023)"
 
</source>
 
</source>

Revision as of 07:30, 17 May 2023

Import

Import
edit
state  
url  https://wiki.bitplan.com/index.php/Wikidata_Import_2023-05-15
target  QLever
start  2023-05-15
end  
days  
os  Ubuntu 22.04.2 LTS
cpu  Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
ram  256
triples  
comment  

see Wikidata_Import_2023-01-24

QLever control

https://github.com/ad-freiburg/qlever-control

mkdir qlever
cd qlever
git clone https://github.com/ad-freiburg/qlever-control
Cloning into 'qlever-control'...
remote: Enumerating objects: 426, done.
remote: Counting objects: 100% (266/266), done.
remote: Compressing objects: 100% (170/170), done.
remote: Total 426 (delta 108), reused 231 (delta 95), pack-reused 160
Receiving objects: 100% (426/426), 131.00 KiB | 585.00 KiB/s, done.
Resolving deltas: 100% (163/163), done.

setup wikidata

mkdir wikidata
cd wikidata/
. ../qlever-control/qlever wikidata

QLEVER CONFIG

Checking your PATH ...
Added the directory "/hd/mantax/qlever/qlever-control" to your PATH

Setting up bash autocompletion ...
Done, number of completions: 35

Creating new Qleverfile ...
Copied pre-configured Qleverfile for "wikidata" into current directory.

Setup is complete
Type qlever and use autocompletion to see which actions are available. Add a
"show" in the end to see what an action does without executing it (for example,
qlever index show). Edit your local Qleverfile to change settings. A typical
sequence of actions if you have used a preconfigured Qleverfile is:

qlever get-data
qlever index
qlever start
qlever example-query

get-data ~7h:30 min

nohup qlever get-data&
tail nohup.out
440650K ...                                                   100% 6.46T=2m17s

2023-05-15 18:20:15 (3.13 MB/s) - ‘latest-lexemes.ttl.bz2’ saved [451229154/451229154]

FINISHED --2023-05-15 18:20:15--
Total wall clock time: 7h 30m 9s
Downloaded: 2 files, 95G in 7h 30m 9s (3.61 MB/s)
ls -l
-rw-rw-r-- 1 wf wf 101738463320 May 11 13:38 latest-all.ttl.bz2
-rw-rw-r-- 1 wf wf    451229154 May 13 01:33 latest-lexemes.ttl.bz2

index

update qlever docker image

docker pull adfreiburg/qlever

doindex

for F in latest-lexemes.ttl.bz2 latest-all.ttl.bz2 
do 
  bzcat $F | head -1000 | \grep ^@prefix
done | sort -u > wikidata-latest.prefix-definitions
docker run --rm -u 10000:10000 -v /etc/localtime:/etc/localtime:ro -v /hd/mantax/qlever/wikidata:/index -w /index --entrypoint bash --name qlever.wikidata-latest.index-build adfreiburg/qlever -c "ulimit -Sn 1048576; bzcat -f wikidata-latest.prefix-definitions latest-lexemes.ttl.bz2 latest-all.ttl.bz2 | IndexBuilderMain -F ttl -f - -i wikidata-latest -s wikidata-latest.settings.json --stxxl-memory-gb 10 | tee wikidata-latest.index-log.txt"
nohup ./doindex &

Log

2023-05-15 20:38:48.787	- INFO:  QLever IndexBuilder, compiled on Mon May  1 10:21:29 UTC 2023 using git hash 83f1e8
2023-05-15 20:38:48.787	- INFO:  You specified the input format: TTL
2023-05-15 20:38:48.788	- INFO:  You specified "locale = en_US" and "ignore-punctuation = 1"
2023-05-15 20:38:48.788	- INFO:  You specified "num-triples-per-batch = 5,000,000", choose a lower value if the index builder runs out of memory
2023-05-15 20:38:48.788	- INFO:  Integers that cannot be represented by QLever will throw an exception (this is the default behavior)
2023-05-15 20:38:48.788	- INFO:  Processing input triples from /dev/stdin ...
2023-05-15 20:40:34.263	- INFO:  Input triples processed: 100,000,000
...
2023-05-16 02:44:35.983	- INFO:  Input triples processed: 18,500,000,000
2023-05-16 02:45:56.492	- INFO:  Done, total number of triples read: 18,572,955,199 [may contain duplicates]
2023-05-16 02:45:56.492	- INFO:  Number of QLever-internal triples created: 11,318,825,076 [may contain duplicates]
2023-05-16 02:45:56.492	- INFO:  Merging partial vocabularies in byte order (internal only) ...
2023-05-16 02:47:12.954	- INFO:  Words merged: 100,000,000
...
2023-05-16 03:01:05.155	- INFO:  Words merged: 800,000,000
2023-05-16 03:01:41.669	- INFO:  Number of words in internal vocabulary: 861,507,414
2023-05-16 03:01:41.669	- INFO:  Building prefix tree from internal vocabulary ...
2023-05-16 03:02:00.943	- INFO:  Words processed: 100,000,000
...
2023-05-16 03:06:50.205	- INFO:  Words processed: 800,000,000
2023-05-16 03:07:15.366	- INFO:  Computing maximally compressing prefixes (greedy algorithm) ...
2023-05-16 03:19:52.577	- INFO:  Reduction of size of internal vocabulary: 45%
2023-05-16 03:20:17.322	- INFO:  Merging partial vocabularies in Unicode order (internal and external) ...
2023-05-16 03:23:39.976	- INFO:  Words merged: 100,000,000
...
2023-05-16 05:05:07.691	- INFO:  Words merged: 3,300,000,000
2023-05-16 05:08:47.811	- INFO:  Number of words in external vocabulary: 2,529,812,916
2023-05-16 05:08:47.811	- INFO:  Removing temporary files ...
2023-05-16 05:08:58.512	- INFO:  Converting external vocabulary to binary format ...
2023-05-16 05:29:57.049	- INFO:  Converting triples from local IDs to global IDs ...
2023-05-16 05:30:04.051	- INFO:  Triples converted: 100,000,000
...
2023-05-16 06:43:46.833	- INFO:  Triples converted: 29,800,000,000
2023-05-16 06:43:59.778	- INFO:  Done, total number of triples converted: 29,891,780,275
2023-05-16 06:43:59.816	- INFO:  Writing compressed vocabulary to disk ...
2023-05-16 06:48:30.007	- INFO:  Creating a pair of index permutations ... 
2023-05-16 09:35:45.922	- INFO:  Statistics for PSO: #relations = 70,309, #blocks = 802,770, #triples = 24,908,374,000
2023-05-16 09:35:45.930	- INFO:  Statistics for POS: #relations = 70,309, #blocks = 802,770, #triples = 24,908,374,000
2023-05-16 09:35:45.930	- INFO:  Writing meta data for PSO and POS ...
2023-05-16 09:36:00.367	- INFO:  Creating a pair of index permutations ... 
2023-05-16 10:53:58.155	- INFO:  Statistics for SPO: #relations = 2,953,527,247, #blocks = 533,058, #triples = 24,908,374,000
2023-05-16 10:53:58.157	- INFO:  Statistics for SOP: #relations = 2,953,527,247, #blocks = 533,058, #triples = 24,908,374,000
2023-05-16 10:53:58.157	- INFO:  Writing meta data for SPO and SOP ...
2023-05-16 10:54:07.257	- INFO:  Number of distinct patterns: 8,156,126
2023-05-16 10:54:07.257	- INFO:  Number of subjects with pattern: 1,953,415,853 [all]
2023-05-16 10:54:07.257	- INFO:  Total number of distinct subject-predicate pairs: 10,724,039,637
2023-05-16 10:54:07.257	- INFO:  Average number of predicates per subject: 5.5
2023-05-16 10:54:07.266	- INFO:  Average number of subjects per predicate: 207,537
2023-05-16 10:54:24.867	- INFO:  Creating a pair of index permutations ... 
2023-05-16 12:14:01.504	- INFO:  Statistics for OSP: #relations = 3,353,488,602, #blocks = 692,046, #triples = 24,908,374,000
2023-05-16 12:14:01.507	- INFO:  Statistics for OPS: #relations = 3,353,488,602, #blocks = 692,046, #triples = 24,908,374,000
2023-05-16 12:14:01.507	- INFO:  Writing meta data for OSP and OPS ...
2023-05-16 12:14:05.301	- INFO:  Index build completed

start

nohup qlever start&
Executing "start":

docker run -d --restart unless-stopped -u 10000:10000 -it -v /etc/localtime:/etc/localtime:ro -v /hd/mantax/qlever/wikidata:/index -p 7001:7001 -w /index --entrypoint bash --name qlever.wikidata-latest adfreiburg/qlever -c "ServerMain -i wikidata-latest -j 8 -p 7001 -m 50 -c 30 -e 5 -k 100 -a \"wikidata-latest_1432218987\" > wikidata-latest.server-log.txt" > /dev/null

Starting the QLever server in the background and waiting until it's ready (Ctrl+C will not kill it) ...

2023-05-17 08:28:38.072	- INFO:  QLever Server, compiled on Mon May  1 10:21:29 UTC 2023 using git hash 83f1e8
2023-05-17 08:28:38.091	- INFO:  Initializing server ...
2023-05-17 08:28:38.094	- INFO:  The git hash used to build this index was 83f1e8
2023-05-17 08:28:38.095	- INFO:  Reading vocabulary from file wikidata-latest.vocabulary.internal ...
2023-05-17 08:29:05.872	- INFO:  Done, number of words: 861,507,415
2023-05-17 08:29:05.889	- INFO:  Number of words in external vocabulary: 2,529,812,915
2023-05-17 08:29:06.095	- INFO:  Registered PSO permutation: #relations = 70,309, #blocks = 802,770, #triples = 24,908,374,000
2023-05-17 08:29:06.323	- INFO:  Registered POS permutation: #relations = 70,309, #blocks = 802,770, #triples = 24,908,374,000
2023-05-17 08:29:06.503	- INFO:  Registered OPS permutation: #relations = 3,353,488,602, #blocks = 692,046, #triples = 24,908,374,000
2023-05-17 08:29:06.683	- INFO:  Registered OSP permutation: #relations = 3,353,488,602, #blocks = 692,046, #triples = 24,908,374,000
2023-05-17 08:29:06.820	- INFO:  Registered SPO permutation: #relations = 2,953,527,247, #blocks = 533,058, #triples = 24,908,374,000
2023-05-17 08:29:06.958	- INFO:  Registered SOP permutation: #relations = 2,953,527,247, #blocks = 533,058, #triples = 24,908,374,000
2023-05-17 08:29:06.958	- INFO:  Reading patterns from file wikidata-latest.index.patterns ...
2023-05-17 08:29:30.494	- INFO:  Sorting random result tables to estimate the sorting performance of this machine ...
2023-05-17 08:29:34.189	- INFO:  Access token for restricted API calls is "wikidata-latest_1432218987"
2023-05-17 08:29:34.189	- INFO:  The server is ready, listening for requests on port 7001 ...
2023-05-17 08:29:35.004	- INFO:  
2023-05-17 08:29:35.004	- INFO:  Request received via GET, no content type specified
2023-05-17 08:29:35.005	- INFO:  Alive check with message "from the qlever script"
2023-05-17 08:29:35.022	- INFO:  
2023-05-17 08:29:35.022	- INFO:  Request received via GET, no content type specified
2023-05-17 08:29:35.022	- INFO:  Setting index description to: "Full Wikidata dump (latest-all.ttl.bz2 from 11.05.2023, latest-lexemes.ttl.bz2 from 13.05.2023)"