Difference between revisions of "Wikidata Import 2023-05-05"
Jump to navigation
Jump to search
(24 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | {{PageSequence|prev=Wikidata Import 2023-04-26|next=|category=Wikidata|categoryIcon=cloud-download}} | + | {{PageSequence|prev=Wikidata Import 2023-04-26|next=Wikidata Import 2023-05-10|category=Wikidata|categoryIcon=cloud-download}} |
+ | |||
+ | =Import= | ||
+ | |||
+ | {{Import | ||
+ | |state=✅ | ||
+ | |url=https://wiki.bitplan.com/index.php/Wikidata_Import_2023-05-05 | ||
+ | |target=blazegraph | ||
+ | |start=2023-05-05 | ||
+ | |os=Ubuntu 22.04.2 LTS | ||
+ | |cpu=Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz | ||
+ | |ram=256 | ||
+ | |triples=14.7 | ||
+ | |storemode=property | ||
+ | }} | ||
+ | |||
= Download ~6h:30 = | = Download ~6h:30 = | ||
== Download Options == | == Download Options == | ||
Line 93: | Line 108: | ||
=== start blazegraph === | === start blazegraph === | ||
− | <source lang='bash'> | + | make sure we use the logback.xml from the previous import attempt |
+ | <source lang='bash' highlight='3'> | ||
+ | cp -p /hd/eneco/wikidata/logback.xml . | ||
export LOG_CONFIG=/hd/mantax/wikidata/logback.xml | export LOG_CONFIG=/hd/mantax/wikidata/logback.xml | ||
nohup service/runBlazegraph.sh -p 9998 2>&1 > blazegraph.log& | nohup service/runBlazegraph.sh -p 9998 2>&1 > blazegraph.log& | ||
+ | ls -l service/wikidata.jnl | ||
+ | -rw-rw-r-- 1 wf wf 209715200 May 7 08:48 service/wikidata.jnl | ||
+ | </source> | ||
+ | |||
+ | === start loading === | ||
+ | <source lang='bash'> | ||
+ | cp -p /hd/eneco/wikidata/service/loadall.sh service | ||
+ | # patch port | ||
+ | diff service/loadRestAPI.sh /hd/eneco/wikidata/service/loadRestAPI.sh | ||
+ | 4c4 | ||
+ | < HOST=http://localhost:9998 | ||
+ | --- | ||
+ | > HOST=http://localhost:9999 | ||
+ | nohup service/loadall.sh & | ||
</source> | </source> | ||
+ | === stats === | ||
+ | 14735886663 is the total triple count as of 2023-05-08 03:50 Z | ||
+ | 14737550155 is the total triple count as of 2023-05-09 04:51 Z | ||
+ | the avg triple addition per sec is 18 | ||
+ | So on 2023-05-03 the estimated number of triples was 14.73 billion | ||
+ | <pre> | ||
+ | ./stats | ||
+ | #: load s total s avg s ETA h | ||
+ | 1: 332 332 332 97.5 | ||
+ | ... | ||
+ | 10: 187 3658 366 106.5 | ||
+ | ... | ||
+ | 100: 1091 57211 572 152.2 | ||
+ | ... | ||
+ | 200: 809 151396 757 180.4 | ||
+ | ... | ||
+ | 300: 1390 249800 833 175.3 | ||
+ | ... | ||
+ | 400: 2015 381781 954 174.5 | ||
+ | ... | ||
+ | 500: 2634 520651 1041 161.4 | ||
+ | ... | ||
+ | 600: 822 671829 1120 142.5 | ||
+ | ... | ||
+ | 700: 3255 814348 1163 115.7 | ||
+ | ... | ||
+ | 800: 2133 1011625 1265 90.6 | ||
+ | ... | ||
+ | 858: 1824 1130655 1318 73.2 | ||
+ | 2023-05-07T09:15:08Z: 330154680 | ||
+ | 2023-05-09T04:59:20Z: 3138802740 | ||
+ | 2023-05-11T04:03:52Z: 5130043279 | ||
+ | 2023-05-15T05:38:05Z: 8575705308 | ||
+ | 2023-05-16T05:04:41Z: 9240046191 | ||
+ | 2023-05-17T04:32:28Z: 9999250942 | ||
+ | 2023-05-20T09:37:59Z:11999119414 | ||
+ | 11.999 bill triples 10613 triples/s 7206 triples/s recently | ||
+ | ETA 4,4 Total 17,5 d | ||
+ | |||
+ | </pre> |
Latest revision as of 06:35, 19 February 2024
Import
Import | |
---|---|
edit | |
state | ✅ |
url | https://wiki.bitplan.com/index.php/Wikidata_Import_2023-05-05 |
target | blazegraph |
start | 2023-05-05 |
end | |
days | |
os | Ubuntu 22.04.2 LTS |
cpu | Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz |
ram | 256 |
triples | 14.7 |
comment |
Download ~6h:30
Download Options
https://dumps.wikimedia.org/wikidatawiki/entities
dcatap.rdf 04-May-2023 18:19 84753 latest-all.json.bz2 03-May-2023 21:06 81640390615 latest-all.json.gz 03-May-2023 12:47 123885468527 latest-all.nt.bz2 04-May-2023 16:07 158382342866 latest-all.nt.gz 03-May-2023 22:23 205171447838 latest-all.ttl.bz2 04-May-2023 03:24 101606862077 latest-all.ttl.gz 03-May-2023 17:08 124093922794 latest-lexemes.json.bz2 03-May-2023 03:53 305234182 latest-lexemes.json.gz 03-May-2023 03:51 416121890 latest-lexemes.nt.bz2 28-Apr-2023 23:34 778797047 latest-lexemes.nt.gz 28-Apr-2023 23:29 1019519966 latest-lexemes.ttl.bz2 28-Apr-2023 23:30 440519100 latest-lexemes.ttl.gz 28-Apr-2023 23:26 548481488 latest-truthy.nt.bz2 28-Apr-2023 22:23 36023954950 latest-truthy.nt.gz 28-Apr-2023 19:07 59758277315
download script
cat download.sh
#/bin/bash
# WF 2023-04-26
# download wikidata dumps
baseurl=https://dumps.wikimedia.org/wikidatawiki/entities/
for file in latest-all latest-lexemes
do
for ext in ttl.bz2
do
url=$baseurl/$file.$ext
log=$file-$ext.log
nohup wget $url >> $log&
done
done
Download logs
--2023-05-05 08:09:14-- https://dumps.wikimedia.org/wikidatawiki/entities//latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 101606862077 (95G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’
0K .......... .......... .......... .......... .......... 0% 400K 2d20h
50K .......... .......... .......... .......... .......... 0% 222K 4d0h
100K .......... .......... .......... .......... .......... 0% 399K 3d15h
99225450K . 100% 2.32T=6h50m
2023-05-05 15:00:04 (3.93 MB/s) - ‘latest-all.ttl.bz2’ saved [101606862077/101606862077]
Munging ~29 h
Preparation
see Wikidata_Import_2023-04-26#Preparation_.7E20-30_min
calling munge.sh
domunge.sh
#!/bin/bash
# WF 2023-04-29
# start munge in background
bzcat latest-all.ttl.bz2 | service/munge.sh -f - -d data -- --skolemize
start domunge.sh and show nohup.out log
nohup ./domunge.sh &
tail -f nohup.out
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
15:51:26.169 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO org.wikidata.query.rdf.tool.Munge - Switching to data/wikidump-000000001.ttl.gz
...
07:25:27.152 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO o.w.q.r.t.r.EntityMungingRdfHandler - Processed 57640000 entities at (751, 658, 702)
...
20:10:35.985 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO o.w.q.r.t.r.EntityMungingRdfHandler - Processed 105840000 entities at (2251, 1708, 1420)
Loading
move files to split directory
We didn't quite follow the getting started - so fix the location of the munged files
mkdir split
wikidata/data$ mv wiki* split
run blazegraph on different port
a load with a blazegraph on port 9999 is already running at this time
usage
service/runBlazegraph.sh -?
Usage: service/runBlazegraph.sh [-h <host>] [-d <dir>] [-c <context>] [-p <port>] [-o <blazegraph options>] [-f config.properties] [-n prefixes.conf] [-w wikibaseConceptUri] [-m commonsConceptUri] [-r -k <oauthConsumerKey> -s <oauthConsumerSecret -l <oauthSessionStoreLimit -i <oauthIndexUrl> -b <oauthNiceUrl> -g <wikiLogoutLink>]
start blazegraph
make sure we use the logback.xml from the previous import attempt
cp -p /hd/eneco/wikidata/logback.xml .
export LOG_CONFIG=/hd/mantax/wikidata/logback.xml
nohup service/runBlazegraph.sh -p 9998 2>&1 > blazegraph.log&
ls -l service/wikidata.jnl
-rw-rw-r-- 1 wf wf 209715200 May 7 08:48 service/wikidata.jnl
start loading
cp -p /hd/eneco/wikidata/service/loadall.sh service
# patch port
diff service/loadRestAPI.sh /hd/eneco/wikidata/service/loadRestAPI.sh
4c4
< HOST=http://localhost:9998
---
> HOST=http://localhost:9999
nohup service/loadall.sh &
stats
14735886663 is the total triple count as of 2023-05-08 03:50 Z 14737550155 is the total triple count as of 2023-05-09 04:51 Z the avg triple addition per sec is 18 So on 2023-05-03 the estimated number of triples was 14.73 billion
./stats #: load s total s avg s ETA h 1: 332 332 332 97.5 ... 10: 187 3658 366 106.5 ... 100: 1091 57211 572 152.2 ... 200: 809 151396 757 180.4 ... 300: 1390 249800 833 175.3 ... 400: 2015 381781 954 174.5 ... 500: 2634 520651 1041 161.4 ... 600: 822 671829 1120 142.5 ... 700: 3255 814348 1163 115.7 ... 800: 2133 1011625 1265 90.6 ... 858: 1824 1130655 1318 73.2 2023-05-07T09:15:08Z: 330154680 2023-05-09T04:59:20Z: 3138802740 2023-05-11T04:03:52Z: 5130043279 2023-05-15T05:38:05Z: 8575705308 2023-05-16T05:04:41Z: 9240046191 2023-05-17T04:32:28Z: 9999250942 2023-05-20T09:37:59Z:11999119414 11.999 bill triples 10613 triples/s 7206 triples/s recently ETA 4,4 Total 17,5 d