Difference between revisions of "WikiData Import 2022-03-11"
Jump to navigation
Jump to search
(23 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | {{PageSequence|prev=WikiData Import 2022-01-29|category=WikiData}} | + | {{PageSequence|prev=WikiData Import 2022-01-29|category=Wikidata|next=WikiData Import 2022-03-16|categoryIcon=cloud-download}} |
+ | ❌ This attempt failed see https://github.com/ad-freiburg/qlever/issues/630 | ||
+ | |||
= QLever trial = | = QLever trial = | ||
+ | {{Import | ||
+ | |target=QLever | ||
+ | |start=2022-03-12 | ||
+ | |end=2022-03-12 | ||
+ | |state=❌ | ||
+ | |url=https://wiki.bitplan.com/index.php/WikiData_Import_2022-03-11 | ||
+ | |storemode=property | ||
+ | }} | ||
+ | |||
see https://github.com/ad-freiburg/qlever/blob/master/docs/quickstart.md | see https://github.com/ad-freiburg/qlever/blob/master/docs/quickstart.md | ||
see {{Link|target=QLever/script}} as discussed in [https://github.com/ad-freiburg/qlever/issues/562 QLever Issue #562] for the script which makes reproducing this attempt easier. | see {{Link|target=QLever/script}} as discussed in [https://github.com/ad-freiburg/qlever/issues/562 QLever Issue #562] for the script which makes reproducing this attempt easier. | ||
− | = | + | |
+ | = Environment/prerequisites = | ||
+ | >=64 GB RAM and docker environment (e.g. Ubuntu) | ||
+ | >1 TB diskspace (SSD preferred for speed) | ||
<source lang='bash' highlight='1'> | <source lang='bash' highlight='1'> | ||
+ | ./qlever -v -e | ||
+ | qlever version : 1.26 $ : 2022/03/12 06:34:52 $ | ||
+ | needed software | ||
+ | docker → /usr/bin/docker ✅ | ||
+ | top → /usr/bin/top ✅ | ||
+ | df → /usr/bin/df ✅ | ||
+ | jq → /usr/bin/jq ✅ | ||
+ | lsb_release → /usr/bin/lsb_release ✅ | ||
+ | free → /usr/bin/free ✅ | ||
+ | operating system | ||
+ | No LSB modules are available. | ||
+ | Distributor ID: Ubuntu | ||
+ | Description: Ubuntu 20.04.4 LTS | ||
+ | Release: 20.04 | ||
+ | Codename: focal | ||
+ | docker version | ||
+ | Docker version 20.10.13, build a224086 | ||
+ | memory | ||
+ | total used free shared buff/cache available | ||
+ | Mem: 125Gi 1,5Gi 30Gi 32Mi 93Gi 123Gi | ||
+ | Swap: 2,0Gi 0B 2,0Gi | ||
+ | diskspace | ||
+ | /dev/sdb5 116G 21G 90G 19% / | ||
+ | tmpfs 63G 0 63G 0% /dev/shm | ||
+ | /dev/sda1 3,6T 1,1T 2,4T 31% /hd/seel | ||
+ | /dev/sdb1 511M 4,0K 511M 1% /boot/efi | ||
+ | soft ulimit for files | ||
+ | 1048576 | ||
+ | </source> | ||
+ | |||
+ | = Wikidata RDF Dump latest-all.ttl download (6h) = | ||
+ | <source lang='bash' highlight='1-4,14,17-18,28'> | ||
+ | mkdir -p qlever-indices/wikidata | ||
+ | cd qlever-indices/wikidata | ||
date;wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2;date | date;wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2;date | ||
+ | Fr 11. Mär 18:50:39 CET 2022 | ||
+ | --2022-03-11 18:50:39-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 | ||
+ | Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620:0:861:1:208:80:154:7, 208.80.154.7 | ||
+ | Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620:0:861:1:208:80:154:7|:443... connected. | ||
+ | HTTP request sent, awaiting response... 200 OK | ||
+ | Length: 93072933618 (87G) [application/octet-stream] | ||
+ | Saving to: ‘latest-all.ttl.bz2’ | ||
+ | |||
+ | latest-all.ttl.bz2 100%[============================================>] 86,68G 4,47MB/s in 5h 56m | ||
+ | |||
+ | 2022-03-12 00:47:08 (4,15 MB/s) - ‘latest-all.ttl.bz2’ saved [93072933618/93072933618] | ||
+ | |||
+ | Sa 12. Mär 00:47:08 CET 2022 | ||
+ | date;wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2;date | ||
+ | Sa 12. Mär 06:35:39 CET 2022 | ||
+ | --2022-03-12 06:35:39-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2 | ||
+ | Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620:0:861:1:208:80:154:7, 208.80.154.7 | ||
+ | Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620:0:861:1:208:80:154:7|:443... connected. | ||
+ | HTTP request sent, awaiting response... 200 OK | ||
+ | Length: 319665811 (305M) [application/octet-stream] | ||
+ | Saving to: ‘latest-lexemes.ttl.bz2’ | ||
+ | |||
+ | latest-lexemes.ttl.bz2 100%[============================================>] 304,86M 4,43MB/s in 69s | ||
+ | |||
+ | 2022-03-12 06:36:49 (4,42 MB/s) - ‘latest-lexemes.ttl.bz2’ saved [319665811/319665811] | ||
+ | |||
+ | Sa 12. Mär 06:36:49 CET 2022 | ||
+ | </source> | ||
+ | <source lang='bash' highlight='1,5'> | ||
+ | ls -l | ||
+ | total 91203728 | ||
+ | -rw-rw-r-- 1 wf wf 93072933618 Mär 10 04:59 latest-all.ttl.bz2 | ||
+ | -rw-rw-r-- 1 wf wf 319665811 Mär 12 00:27 latest-lexemes.ttl.bz2 | ||
+ | du -sm * | ||
+ | 88762 latest-all.ttl.bz2 | ||
+ | 305 latest-lexemes.ttl.bz2 | ||
+ | </source> | ||
+ | == With QLever Script == | ||
+ | make sure you run this to copy the settings | ||
+ | <source lang='bash' highlight='1'> | ||
+ | ./qlever --wikidata_download | ||
+ | qlever-indices/wikidata already exists | ||
+ | wikidata.settings.json already copied to qlever-indices/wikidata | ||
+ | wikidata lexemes:latest-lexemes.ttl.bz2 already downloaded | ||
+ | wikidata dump:latest-all.ttl.bz2 already downloaded | ||
+ | </source> | ||
+ | = Indexing = | ||
+ | <source lang='bash' highlight='1-2'> | ||
+ | nohup ./qlever --pull --wikidata_index& | ||
+ | tail -f nohup.out | ||
+ | Status: Image is up to date for adfreiburg/qlever:latest | ||
+ | docker.io/adfreiburg/qlever:latest | ||
+ | pulling qlever docker image finished at Sa 12. Mär 07:33:08 CET 2022 after 2 seconds | ||
+ | creating wikidata index started at Sa 12. Mär 07:33:08 CET 2022 | ||
+ | 2022-03-12 06:33:08.788 - INFO: QLever IndexBuilder, compiled on Mar 11 2022 17:17:49 | ||
+ | 2022-03-12 06:33:08.789 - INFO: You specified the input format: TTL | ||
+ | 2022-03-12 06:33:08.789 - INFO: You specified "locale = en_US" and "ignore-punctuation = 1" | ||
+ | 2022-03-12 06:33:08.790 - INFO: You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files | ||
+ | 2022-03-12 06:33:08.790 - INFO: You specified "num-triples-per-batch = 50,000,000", choose a lower value if the index builder runs out of memory | ||
+ | 2022-03-12 06:33:08.790 - INFO: Processing input triples from /dev/stdin ... | ||
</source> | </source> |
Latest revision as of 06:49, 15 May 2023
❌ This attempt failed see https://github.com/ad-freiburg/qlever/issues/630
QLever trial
Import | |
---|---|
edit | |
state | ❌ |
url | https://wiki.bitplan.com/index.php/WikiData_Import_2022-03-11 |
target | QLever |
start | 2022-03-12 |
end | 2022-03-12 |
days | |
os | |
cpu | |
ram | |
triples | |
comment |
see https://github.com/ad-freiburg/qlever/blob/master/docs/quickstart.md
see QLever/script as discussed in QLever Issue #562 for the script which makes reproducing this attempt easier.
Environment/prerequisites
>=64 GB RAM and docker environment (e.g. Ubuntu) >1 TB diskspace (SSD preferred for speed)
./qlever -v -e
qlever version : 1.26 $ : 2022/03/12 06:34:52 $
needed software
docker → /usr/bin/docker ✅
top → /usr/bin/top ✅
df → /usr/bin/df ✅
jq → /usr/bin/jq ✅
lsb_release → /usr/bin/lsb_release ✅
free → /usr/bin/free ✅
operating system
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal
docker version
Docker version 20.10.13, build a224086
memory
total used free shared buff/cache available
Mem: 125Gi 1,5Gi 30Gi 32Mi 93Gi 123Gi
Swap: 2,0Gi 0B 2,0Gi
diskspace
/dev/sdb5 116G 21G 90G 19% /
tmpfs 63G 0 63G 0% /dev/shm
/dev/sda1 3,6T 1,1T 2,4T 31% /hd/seel
/dev/sdb1 511M 4,0K 511M 1% /boot/efi
soft ulimit for files
1048576
Wikidata RDF Dump latest-all.ttl download (6h)
mkdir -p qlever-indices/wikidata
cd qlever-indices/wikidata
date;wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2;date
Fr 11. Mär 18:50:39 CET 2022
--2022-03-11 18:50:39-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620:0:861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620:0:861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 93072933618 (87G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’
latest-all.ttl.bz2 100%[============================================>] 86,68G 4,47MB/s in 5h 56m
2022-03-12 00:47:08 (4,15 MB/s) - ‘latest-all.ttl.bz2’ saved [93072933618/93072933618]
Sa 12. Mär 00:47:08 CET 2022
date;wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2;date
Sa 12. Mär 06:35:39 CET 2022
--2022-03-12 06:35:39-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620:0:861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620:0:861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 319665811 (305M) [application/octet-stream]
Saving to: ‘latest-lexemes.ttl.bz2’
latest-lexemes.ttl.bz2 100%[============================================>] 304,86M 4,43MB/s in 69s
2022-03-12 06:36:49 (4,42 MB/s) - ‘latest-lexemes.ttl.bz2’ saved [319665811/319665811]
Sa 12. Mär 06:36:49 CET 2022
ls -l
total 91203728
-rw-rw-r-- 1 wf wf 93072933618 Mär 10 04:59 latest-all.ttl.bz2
-rw-rw-r-- 1 wf wf 319665811 Mär 12 00:27 latest-lexemes.ttl.bz2
du -sm *
88762 latest-all.ttl.bz2
305 latest-lexemes.ttl.bz2
With QLever Script
make sure you run this to copy the settings
./qlever --wikidata_download
qlever-indices/wikidata already exists
wikidata.settings.json already copied to qlever-indices/wikidata
wikidata lexemes:latest-lexemes.ttl.bz2 already downloaded
wikidata dump:latest-all.ttl.bz2 already downloaded
Indexing
nohup ./qlever --pull --wikidata_index&
tail -f nohup.out
Status: Image is up to date for adfreiburg/qlever:latest
docker.io/adfreiburg/qlever:latest
pulling qlever docker image finished at Sa 12. Mär 07:33:08 CET 2022 after 2 seconds
creating wikidata index started at Sa 12. Mär 07:33:08 CET 2022
2022-03-12 06:33:08.788 - INFO: QLever IndexBuilder, compiled on Mar 11 2022 17:17:49
2022-03-12 06:33:08.789 - INFO: You specified the input format: TTL
2022-03-12 06:33:08.789 - INFO: You specified "locale = en_US" and "ignore-punctuation = 1"
2022-03-12 06:33:08.790 - INFO: You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files
2022-03-12 06:33:08.790 - INFO: You specified "num-triples-per-batch = 50,000,000", choose a lower value if the index builder runs out of memory
2022-03-12 06:33:08.790 - INFO: Processing input triples from /dev/stdin ...