Difference between revisions of "WikiData Import 2022-03-11"

From BITPlan Wiki
Jump to navigation Jump to search
 
(17 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{PageSequence|prev=WikiData Import 2022-01-29|category=WikiData}}
+
{{PageSequence|prev=WikiData Import 2022-01-29|category=Wikidata|next=WikiData Import 2022-03-16|categoryIcon=cloud-download}}
 +
❌ This attempt failed see https://github.com/ad-freiburg/qlever/issues/630
 +
 
 
= QLever trial =
 
= QLever trial =
 +
{{Import
 +
|target=QLever
 +
|start=2022-03-12
 +
|end=2022-03-12
 +
|state=❌
 +
|url=https://wiki.bitplan.com/index.php/WikiData_Import_2022-03-11
 +
|storemode=property
 +
}}
 +
 
see https://github.com/ad-freiburg/qlever/blob/master/docs/quickstart.md
 
see https://github.com/ad-freiburg/qlever/blob/master/docs/quickstart.md
  
 
see {{Link|target=QLever/script}} as discussed in [https://github.com/ad-freiburg/qlever/issues/562 QLever Issue #562] for the script which makes reproducing this attempt easier.
 
see {{Link|target=QLever/script}} as discussed in [https://github.com/ad-freiburg/qlever/issues/562 QLever Issue #562] for the script which makes reproducing this attempt easier.
 +
 +
= Environment/prerequisites =
 +
>=64 GB RAM and docker environment (e.g. Ubuntu)
 +
>1 TB diskspace (SSD preferred for speed)
 +
<source lang='bash' highlight='1'>
 +
./qlever -v -e
 +
qlever version : 1.26 $ : 2022/03/12 06:34:52 $
 +
needed software
 +
docker → /usr/bin/docker ✅
 +
top → /usr/bin/top ✅
 +
df → /usr/bin/df ✅
 +
jq → /usr/bin/jq ✅
 +
lsb_release → /usr/bin/lsb_release ✅
 +
free → /usr/bin/free ✅
 +
operating system
 +
No LSB modules are available.
 +
Distributor ID: Ubuntu
 +
Description: Ubuntu 20.04.4 LTS
 +
Release: 20.04
 +
Codename: focal
 +
docker version
 +
Docker version 20.10.13, build a224086
 +
memory
 +
              total        used        free      shared  buff/cache  available
 +
Mem:          125Gi      1,5Gi        30Gi        32Mi        93Gi      123Gi
 +
Swap:        2,0Gi          0B      2,0Gi
 +
diskspace
 +
/dev/sdb5      116G  21G  90G  19% /
 +
tmpfs            63G    0  63G  0% /dev/shm
 +
/dev/sda1      3,6T  1,1T  2,4T  31% /hd/seel
 +
/dev/sdb1      511M  4,0K  511M  1% /boot/efi
 +
soft ulimit for files
 +
1048576
 +
</source>
 +
 
= Wikidata RDF Dump latest-all.ttl download (6h) =
 
= Wikidata RDF Dump latest-all.ttl download (6h) =
<source lang='bash' highlight='1-2,12,15-16,26'>
+
<source lang='bash' highlight='1-4,14,17-18,28'>
 +
mkdir -p qlever-indices/wikidata
 +
cd qlever-indices/wikidata
 
date;wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2;date
 
date;wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2;date
 
Fr 11. Mär 18:50:39 CET 2022
 
Fr 11. Mär 18:50:39 CET 2022
Line 35: Line 83:
 
Sa 12. Mär 06:36:49 CET 2022
 
Sa 12. Mär 06:36:49 CET 2022
 
</source>
 
</source>
<source lang='bash' highlight='1'>
+
<source lang='bash' highlight='1,5'>
 
ls -l
 
ls -l
 
total 91203728
 
total 91203728
 
-rw-rw-r-- 1 wf wf 93072933618 Mär 10 04:59 latest-all.ttl.bz2
 
-rw-rw-r-- 1 wf wf 93072933618 Mär 10 04:59 latest-all.ttl.bz2
 
-rw-rw-r-- 1 wf wf  319665811 Mär 12 00:27 latest-lexemes.ttl.bz2
 
-rw-rw-r-- 1 wf wf  319665811 Mär 12 00:27 latest-lexemes.ttl.bz2
 +
du -sm *
 +
88762 latest-all.ttl.bz2
 +
305 latest-lexemes.ttl.bz2
 +
</source>
 +
== With QLever Script ==
 +
make sure you run this to copy the settings
 +
<source lang='bash' highlight='1'>
 +
./qlever --wikidata_download
 +
qlever-indices/wikidata already exists
 +
wikidata.settings.json already copied to qlever-indices/wikidata
 +
wikidata lexemes:latest-lexemes.ttl.bz2 already downloaded
 +
wikidata dump:latest-all.ttl.bz2 already downloaded
 +
</source>
 +
= Indexing =
 +
<source lang='bash' highlight='1-2'>
 +
nohup ./qlever --pull --wikidata_index&
 +
tail -f nohup.out
 +
Status: Image is up to date for adfreiburg/qlever:latest
 +
docker.io/adfreiburg/qlever:latest
 +
pulling qlever docker image finished at Sa 12. Mär 07:33:08 CET 2022 after 2 seconds
 +
creating wikidata index started at Sa 12. Mär 07:33:08 CET 2022
 +
2022-03-12 06:33:08.788 - INFO:  QLever IndexBuilder, compiled on Mar 11 2022 17:17:49
 +
2022-03-12 06:33:08.789 - INFO:  You specified the input format: TTL
 +
2022-03-12 06:33:08.789 - INFO:  You specified "locale = en_US" and "ignore-punctuation = 1"
 +
2022-03-12 06:33:08.790 - INFO:  You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files
 +
2022-03-12 06:33:08.790 - INFO:  You specified "num-triples-per-batch = 50,000,000", choose a lower value if the index builder runs out of memory
 +
2022-03-12 06:33:08.790 - INFO:  Processing input triples from /dev/stdin ...
 
</source>
 
</source>

Latest revision as of 07:49, 15 May 2023

❌ This attempt failed see https://github.com/ad-freiburg/qlever/issues/630

QLever trial

Import
edit
state  ❌
url  https://wiki.bitplan.com/index.php/WikiData_Import_2022-03-11
target  QLever
start  2022-03-12
end  2022-03-12
days  
os  
cpu  
ram  
triples  
comment  


see https://github.com/ad-freiburg/qlever/blob/master/docs/quickstart.md

see QLever/script as discussed in QLever Issue #562 for the script which makes reproducing this attempt easier.

Environment/prerequisites

>=64 GB RAM and docker environment (e.g. Ubuntu) >1 TB diskspace (SSD preferred for speed)

./qlever -v -e
qlever version : 1.26 $ : 2022/03/12 06:34:52 $
needed software
docker → /usr/bin/docker ✅
top → /usr/bin/top ✅
df → /usr/bin/df ✅
jq → /usr/bin/jq ✅
lsb_release → /usr/bin/lsb_release ✅
free → /usr/bin/free ✅
operating system
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.4 LTS
Release:	20.04
Codename:	focal
docker version
Docker version 20.10.13, build a224086
memory
              total        used        free      shared  buff/cache   available
Mem:          125Gi       1,5Gi        30Gi        32Mi        93Gi       123Gi
Swap:         2,0Gi          0B       2,0Gi
diskspace
/dev/sdb5       116G   21G   90G  19% /
tmpfs            63G     0   63G   0% /dev/shm
/dev/sda1       3,6T  1,1T  2,4T  31% /hd/seel
/dev/sdb1       511M  4,0K  511M   1% /boot/efi
soft ulimit for files
1048576

Wikidata RDF Dump latest-all.ttl download (6h)

mkdir -p qlever-indices/wikidata
cd qlever-indices/wikidata
date;wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2;date
Fr 11. Mär 18:50:39 CET 2022
--2022-03-11 18:50:39--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620:0:861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620:0:861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 93072933618 (87G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’

latest-all.ttl.bz2           100%[============================================>]  86,68G  4,47MB/s    in 5h 56m  

2022-03-12 00:47:08 (4,15 MB/s) - ‘latest-all.ttl.bz2’ saved [93072933618/93072933618]

Sa 12. Mär 00:47:08 CET 2022
date;wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2;date
Sa 12. Mär 06:35:39 CET 2022
--2022-03-12 06:35:39--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620:0:861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620:0:861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 319665811 (305M) [application/octet-stream]
Saving to: ‘latest-lexemes.ttl.bz2’

latest-lexemes.ttl.bz2       100%[============================================>] 304,86M  4,43MB/s    in 69s     

2022-03-12 06:36:49 (4,42 MB/s) - ‘latest-lexemes.ttl.bz2’ saved [319665811/319665811]

Sa 12. Mär 06:36:49 CET 2022
ls -l
total 91203728
-rw-rw-r-- 1 wf wf 93072933618 Mär 10 04:59 latest-all.ttl.bz2
-rw-rw-r-- 1 wf wf   319665811 Mär 12 00:27 latest-lexemes.ttl.bz2
du -sm *
88762	latest-all.ttl.bz2
305	latest-lexemes.ttl.bz2

With QLever Script

make sure you run this to copy the settings

./qlever --wikidata_download
qlever-indices/wikidata already exists
wikidata.settings.json already copied to qlever-indices/wikidata
wikidata lexemes:latest-lexemes.ttl.bz2 already downloaded
wikidata dump:latest-all.ttl.bz2 already downloaded

Indexing

nohup ./qlever --pull --wikidata_index&
tail -f nohup.out 
Status: Image is up to date for adfreiburg/qlever:latest
docker.io/adfreiburg/qlever:latest
pulling qlever docker image finished at Sa 12. Mär 07:33:08 CET 2022 after 2 seconds
creating wikidata index started at Sa 12. Mär 07:33:08 CET 2022
2022-03-12 06:33:08.788	- INFO:  QLever IndexBuilder, compiled on Mar 11 2022 17:17:49
2022-03-12 06:33:08.789	- INFO:  You specified the input format: TTL
2022-03-12 06:33:08.789	- INFO:  You specified "locale = en_US" and "ignore-punctuation = 1"
2022-03-12 06:33:08.790	- INFO:  You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files
2022-03-12 06:33:08.790	- INFO:  You specified "num-triples-per-batch = 50,000,000", choose a lower value if the index builder runs out of memory
2022-03-12 06:33:08.790	- INFO:  Processing input triples from /dev/stdin ...