Wikidata Import 2023-05-05

From BITPlan Wiki
Jump to navigation Jump to search

Import

Import
edit
state  ✅
url  https://wiki.bitplan.com/index.php/Wikidata_Import_2023-05-05
target  blazegraph
start  2023-05-05
end  
days  
os  Ubuntu 22.04.2 LTS
cpu  Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
ram  256
triples  14.7
comment  


Download ~6h:30

Download Options

https://dumps.wikimedia.org/wikidatawiki/entities

dcatap.rdf                                         04-May-2023 18:19               84753
latest-all.json.bz2                                03-May-2023 21:06         81640390615
latest-all.json.gz                                 03-May-2023 12:47        123885468527
latest-all.nt.bz2                                  04-May-2023 16:07        158382342866
latest-all.nt.gz                                   03-May-2023 22:23        205171447838
latest-all.ttl.bz2                                 04-May-2023 03:24        101606862077
latest-all.ttl.gz                                  03-May-2023 17:08        124093922794
latest-lexemes.json.bz2                            03-May-2023 03:53           305234182
latest-lexemes.json.gz                             03-May-2023 03:51           416121890
latest-lexemes.nt.bz2                              28-Apr-2023 23:34           778797047
latest-lexemes.nt.gz                               28-Apr-2023 23:29          1019519966
latest-lexemes.ttl.bz2                             28-Apr-2023 23:30           440519100
latest-lexemes.ttl.gz                              28-Apr-2023 23:26           548481488
latest-truthy.nt.bz2                               28-Apr-2023 22:23         36023954950
latest-truthy.nt.gz                                28-Apr-2023 19:07         59758277315

download script

cat download.sh 
#/bin/bash
# WF 2023-04-26
# download wikidata dumps
baseurl=https://dumps.wikimedia.org/wikidatawiki/entities/
for file in latest-all latest-lexemes
do
  for ext in ttl.bz2
  do
    url=$baseurl/$file.$ext
    log=$file-$ext.log
    nohup wget $url >> $log&
  done
done

Download logs

--2023-05-05 08:09:14--  https://dumps.wikimedia.org/wikidatawiki/entities//latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 101606862077 (95G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’

     0K .......... .......... .......... .......... ..........  0%  400K 2d20h
    50K .......... .......... .......... .......... ..........  0%  222K 4d0h
   100K .......... .......... .......... .......... ..........  0%  399K 3d15h
99225450K .                                                     100% 2.32T=6h50m

2023-05-05 15:00:04 (3.93 MB/s) - ‘latest-all.ttl.bz2’ saved [101606862077/101606862077]

Munging ~29 h

Preparation

see Wikidata_Import_2023-04-26#Preparation_.7E20-30_min

calling munge.sh

domunge.sh

#!/bin/bash
# WF 2023-04-29
# start munge in background
bzcat latest-all.ttl.bz2 | service/munge.sh -f - -d data -- --skolemize

start domunge.sh and show nohup.out log

nohup ./domunge.sh &
tail -f  nohup.out
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
15:51:26.169 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  org.wikidata.query.rdf.tool.Munge - Switching to data/wikidump-000000001.ttl.gz
...
07:25:27.152 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  o.w.q.r.t.r.EntityMungingRdfHandler - Processed 57640000 entities at (751, 658, 702)
...
20:10:35.985 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  o.w.q.r.t.r.EntityMungingRdfHandler - Processed 105840000 entities at (2251, 1708, 1420)

Loading

move files to split directory

We didn't quite follow the getting started - so fix the location of the munged files

mkdir split
wikidata/data$ mv wiki* split

run blazegraph on different port

a load with a blazegraph on port 9999 is already running at this time

usage

service/runBlazegraph.sh -?
Usage: service/runBlazegraph.sh [-h <host>] [-d <dir>] [-c <context>] [-p <port>]  [-o <blazegraph options>] [-f config.properties] [-n prefixes.conf] [-w wikibaseConceptUri] [-m commonsConceptUri]  [-r -k <oauthConsumerKey> -s <oauthConsumerSecret -l <oauthSessionStoreLimit -i <oauthIndexUrl> -b <oauthNiceUrl> -g <wikiLogoutLink>]

start blazegraph

make sure we use the logback.xml from the previous import attempt

cp -p /hd/eneco/wikidata/logback.xml .
export LOG_CONFIG=/hd/mantax/wikidata/logback.xml
nohup service/runBlazegraph.sh -p 9998 2>&1 > blazegraph.log&
ls -l service/wikidata.jnl 
-rw-rw-r-- 1 wf wf 209715200 May  7 08:48 service/wikidata.jnl

start loading

cp -p /hd/eneco/wikidata/service/loadall.sh service
# patch port
diff service/loadRestAPI.sh /hd/eneco/wikidata/service/loadRestAPI.sh 
4c4
< HOST=http://localhost:9998
---
> HOST=http://localhost:9999
nohup service/loadall.sh &

stats

14735886663 is the total triple count as of 2023-05-08 03:50 Z 14737550155 is the total triple count as of 2023-05-09 04:51 Z the avg triple addition per sec is 18 So on 2023-05-03 the estimated number of triples was 14.73 billion

./stats 
   #: load s  total s  avg s   ETA h
   1:    332      332    332     97.5
...
  10:    187     3658    366    106.5
 ...
 100:   1091    57211    572    152.2
...
 200:    809   151396    757    180.4
...
 300:   1390   249800    833    175.3
...
 400:   2015   381781    954    174.5
...
 500:   2634   520651   1041    161.4
...
 600:    822   671829   1120    142.5
...
 700:   3255   814348   1163    115.7
...
 800:   2133  1011625   1265     90.6
...
 858:   1824  1130655   1318     73.2
2023-05-07T09:15:08Z:  330154680
2023-05-09T04:59:20Z: 3138802740
2023-05-11T04:03:52Z: 5130043279
2023-05-15T05:38:05Z: 8575705308
2023-05-16T05:04:41Z: 9240046191
2023-05-17T04:32:28Z: 9999250942
2023-05-20T09:37:59Z:11999119414
11.999 bill triples  10613 triples/s 7206 triples/s recently
ETA 4,4 Total 17,5 d