Wikidata Import 2023-04-26


Wolfgang Fahl

move files to split directory[edit]

We didn't quite follow the getting started - so fix the location of the munged files

mkdir split
wikidata/data$ mv wiki* split

prepare log directory[edit]

sudo mkdir -p /var/log/wdqs/
sudo chown $(id -un) /var/log/wdqs/

mv service to TB harddisk[edit]

mv service s
mkdir service
mv s/* service
rm s

start blazegraph[edit]

nohup service/runBlazegraph.sh 2>&1 > blazegraph.log&

loadall.sh[edit]

This is the script i used in 2018:

#!/usr/bin/env bash
# load all data
START=1
END=100000
FORMAT=wikidump-%09d.ttl.gz
LOCATION=$(pwd)/data/split
BASE=$(dirname $0)
cd $BASE


while getopts s:e:d:h option
do
  case "${option}"
  in
    s) START=${OPTARG};;
    e) END=${OPTARG};;
    d) LOCATION=${OPTARG};;
    h) 
  echo "Usage: $0 [-s <start>] [-e <end>] [-d <directory>] [-h]"
  exit 1
      ;;
  esac
done

i=$START
while [ $i -le $END ]; do
        printf -v f $FORMAT $i
	if [ -f "$LOCATION/$f.good" ] 
        then
	  echo File $LOCATION/$f already imported 
        else 
          if [ ! -f "$LOCATION/$f" ] 
          then
	    echo File $LOCATION/$f not found, terminating
	    exit 0
          else
 	    ts=$(date -Iseconds)
            echo Processing $f at $ts
            ./loadRestAPI.sh -n wdq -d "$LOCATION/$f"
	  fi
        fi
        let i++
done

load[edit]

nohup service/loadRestAPI.sh -n wdq -d `pwd`/data/split&

Logfile issue[edit]

The logfile size grows far too quickly with the default log settings.

For testing i reactivated my 128 GB RAM machine i used for the QLever tests last year.

Download munge files[edit]

cat data/split/getall 
#!/bin/bash
# WF 2023-05-02
base=http://wikidata.dbis.rwth-aachen.de/downloads/split/
for i in {0001..1058}
do
  file=wikidump-00000$i.ttl.gz
  url=$base/wikidump-00000$i.ttl.gz
  if  [ ! -f $file ]
  then
    wget $url
  else
    echo "$file ✅"
  fi
done

Run load for a single file[edit]

service/loadall.sh -s 1 -e 1
Processing wikidump-000000001.ttl.gz at 2023-05-02T18:31:50+02:00
Loading with properties...
quiet=false
verbose=0
closure=false
durableQueues=true
#Needed for quads
#defaultGraph=
com.bigdata.rdf.store.DataLoader.flush=false
com.bigdata.rdf.store.DataLoader.bufferCapacity=100000
com.bigdata.rdf.store.DataLoader.queueCapacity=10
#Namespace to load
namespace=wdq
#Files to load
fileOrDirs=/hd/seel/wikidata/data/split/wikidump-000000001.ttl.gz
#Property file (if creating a new namespace)
propertyFile=/hd/seel/wikidata/service/RWStore.properties

Check size of log file[edit]

The log file size grows at 26 MByte/s - that is far too much. E.g. this smaller 128 GB testmachine only has a 2TB SSD in the first place.

date;du -sm blazegraph.log;
Di 2. Mai 18:37:00 CEST 2023
9603	blazegraph.log
date;du -sm blazegraph.log;
Di 2. Mai 18:37:02 CEST 2023
9637	blazegraph.log
date;du -sm blazegraph.log;
Di 2. Mai 18:37:51 CEST 2023
10951	blazegraph.log

Logback.xml candidates[edit]

Download munge files[edit]

cat data/split/getall 
#!/bin/bash
# WF 2023-05-02
base=http://wikidata.dbis.rwth-aachen.de/downloads/split/
for i in {0001..1058}
do
  file=wikidump-00000$i.ttl.gz
  url=$base/wikidump-00000$i.ttl.gz
  if  [ ! -f $file ]
  then
    wget $url
  else
    echo "$file ✅"
  fi
done

Run load for a single file[edit]

service/loadall.sh -s 1 -e 1
Processing wikidump-000000001.ttl.gz at 2023-05-02T18:31:50+02:00
Loading with properties...
quiet=false
verbose=0
closure=false
durableQueues=true
#Needed for quads
#defaultGraph=
com.bigdata.rdf.store.DataLoader.flush=false
com.bigdata.rdf.store.DataLoader.bufferCapacity=100000
com.bigdata.rdf.store.DataLoader.queueCapacity=10
#Namespace to load
namespace=wdq
#Files to load
fileOrDirs=/hd/seel/wikidata/data/split/wikidump-000000001.ttl.gz
#Property file (if creating a new namespace)
propertyFile=/hd/seel/wikidata/service/RWStore.properties

Check size of log file[edit]

The log file size grows at 26 MByte/s - that is far too much. E.g. this smaller 128 GB testmachine only has a 2TB SSD in the first place.

date;du -sm blazegraph.log;
Di 2. Mai 18:37:00 CEST 2023
9603	blazegraph.log
date;du -sm blazegraph.log;
Di 2. Mai 18:37:02 CEST 2023
9637	blazegraph.log
date;du -sm blazegraph.log;
Di 2. Mai 18:37:51 CEST 2023
10951	blazegraph.log

Logback.xml candidates[edit]

Generated from ERB template[edit]

generated from https://github.com/wikimedia/operations-puppet/blob/production/modules/query_service/templates/logback.xml.erb:

logback.xml[edit]

<?xml version="1.0" encoding="UTF-8"?>
<configuration scan="true"  scanPeriod="5 minutes" packagingData="false">

    <!-- ugly trick to ensure ${HOSTNAME} is evaluated -->
    <property scope="context" name="hostname" value="${HOSTNAME}" />

    <!--
        File based logs:
        * rolling every day or when size > 100MB
    -->
    <appender name="file" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>PATH/TO/LOGS/rdf-query-service.log</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <!-- daily rollover -->
            <fileNamePattern>PATH/TO/LOGS/rdf-query-service.%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
            <maxFileSize>100MB</maxFileSize>
            <maxHistory>30</maxHistory>
        </rollingPolicy>
        <filter class="org.wikidata.query.rdf.common.log.PerLoggerThrottler" />
        <encoder class="ch.qos.logback.classic.encoder.PatternLayoutEncoder">
            <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg %mdc%n%rEx{1,QUERY_TIMEOUT,SYNTAX_ERROR}</pattern>
            <outputPatternAsHeader>true</outputPatternAsHeader>
        </encoder>
    </appender>
    <appender name="async-file" class="ch.qos.logback.classic.AsyncAppender">
        <neverBlock>true</neverBlock>
        <appender-ref ref="file" />
    </appender>

    <!--
        Console based logs:
        * per logger / message throttling is enabled
        * limited to 10 messages per second
        * level => ERROR
    -->
    <appender name="stdout" class="ch.qos.logback.core.ConsoleAppender">
        <filter class="org.wikidata.query.rdf.common.log.PerLoggerThrottler" />
        <filter class="org.wikidata.query.rdf.common.log.RateLimitFilter">
            <bucketCapacity>10</bucketCapacity>
            <refillIntervalInMillis>1000</refillIntervalInMillis>
        </filter>
        <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
            <level>error</level>
        </filter>
        <encoder class="ch.qos.logback.classic.encoder.PatternLayoutEncoder">
            <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg %mdc%n%rEx{1,QUERY_TIMEOUT,SYNTAX_ERROR}</pattern>
            <outputPatternAsHeader>true</outputPatternAsHeader>
        </encoder>
    </appender>
    <appender name="async-stdout" class="ch.qos.logback.classic.AsyncAppender">
        <neverBlock>true</neverBlock>
        <appender-ref ref="stdout" />
    </appender>

    <root level="info">
        <appender-ref ref="async-file"/>
        <appender-ref ref="async-stdout"/>
    </root>

    <logger name="org.wikidata.query.rdf" level="info"/>
    <logger name="org.wikidata.query.rdf.blazegraph.inline.literal.AbstractMultiTypeExtension" level="error"/>
    <logger name="com.bigdata" level="warn"/>
    <logger name="com.bigdata.util.concurrent.Haltable" level="off"/>
    <logger name="com.bigdata.rdf.internal.LexiconConfiguration" level="off"/> <!-- disabled temp. ref: T207643 -->

</configuration>

Sample configuration[edit]

sample from https://logback.qos.ch/manual/configuration.html:

<configuration>

  <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
    <!-- encoders are assigned the type
         ch.qos.logback.classic.encoder.PatternLayoutEncoder by default -->
    <encoder>
      <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} -%kvp- %msg%n</pattern>
    </encoder>
  </appender>

  <root level="debug">
    <appender-ref ref="STDOUT" />
  </root>
</configuration>

retries[edit]

with simple logback.xml according to[edit]

export LOG_CONFIG=/hd/seel/wikidata/logback.xml 
nohup service/runBlazegraph.sh 2>&1 > blazegraph.log&
service/loadall.sh -s 1 -e 1
date;du -sm blazegraph.log 
Di 2. Mai 19:15:40 CEST 2023
1630	blazegraph.log
du -sm blazegraph.log 
Di 2. Mai 19:15:42 CEST 2023
1654	blazegraph.log

with pasted logback.xml[edit]

hd is seel on sun and eneco on dbis wikidata

pgrep -fl java
66671 java
kill 66671
rm service/wikidata.jnl 
wget https://phab.wmfusercontent.org/file/data/cz4q6ew7kksba6k5ocyx/PHID-FILE-eevhah5inj6jl2sy3nas/basic_rdf_query_service_logback.xml
export LOG_CONFIG=/hd/eneco/wikidata/basic_rdf_query_service_logback.xml
ls -l $(echo $LOG_CONFIG)
-rw-rw-r-- 1 wf wf 3030 May  3 09:23 /hd/eneco/wikidata/basic_rdf_query_service_logback.xml
nohup service/runBlazegraph.sh 2>&1 > blazegraph.log&
wf@sun:/hd/seel/wikidata/data/split$ mv wikidump-000000001.ttl.gz.fail wikidump-000000001.ttl.gz
service/loadall.sh -s 1 -e 1
# in another terminal:
...
date;du -sm service/wikidata.jnl 
Di 2. Mai 19:28:12 CEST 2023
1530	service/wikidata.jnl
date;du -sm service/wikidata.jnl 
Di 2. Mai 19:28:15 CEST 2023
1557	service/wikidata.jnl
<?xml version="1.0"?><data modified="0" milliseconds="483615"/>
wikidump-000000001.ttl.gz.good

with fixed logback.xml[edit]

check fix[edit]

diff logback.xml basic_rdf_query_service_logback.xml 
12c12
<         <file>/var/log/wdqs/rdf-query-service.log</file>
---
>         <file>PATH/TO/LOGS/rdf-query-service.log</file>
15c15
<             <fileNamePattern>/var/log/wdqs/rdf-query-service.%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
---
>             <fileNamePattern>PATH/TO/LOGS/rdf-query-service.%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>

restart blazegraph[edit]

pgrep -fla blazegraph-service
pkill -f blazegraph-service
rm blazegraph.log
export LOG_CONFIG=/hd/eneco/wikidata/logback.xml
nohup service/runBlazegraph.sh 2>&1 > blazegraph.log&
ls -l service/wikidata.jnl 
-rw-rw-r-- 1 wf wf 209715200 May  3 09:41 service/wikidata.jnl

Count triples in blazegraph[edit]

sparql query[edit]

SELECT ( COUNT( * ) AS ?count ) { ?s ?p ?o }

queries.yaml[edit]

in $HOME/.pylodstorage

'triplecount':
    sparql: |
                SELECT ( COUNT( * ) AS ?count ) { ?s ?p ?o }

endpoints.yaml[edit]

# SPARQL endpoints for sparqlquery tool
# 2023-05-03 WF 
'wdimport':
  endpoint: http://localhost:9999/bigdata/namespace/wdq/sparql
  website: http://blazegraph.wikidata.dbis.rwth-aachen.de
  database: blazegraph
  lang: sparql
  prefixes: |
    PREFIX bd: <http://www.bigdata.com/rdf#>
    PREFIX cc: <http://creativecommons.org/ns#>
    PREFIX dct: <http://purl.org/dc/terms/>
    PREFIX geo: <http://www.opengis.net/ont/geosparql#>
    PREFIX ontolex: <http://www.w3.org/ns/lemon/ontolex#>
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX p: <http://www.wikidata.org/prop/>
    PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
    PREFIX pqn: <http://www.wikidata.org/prop/qualifier/value-normalized/>
    PREFIX pqv: <http://www.wikidata.org/prop/qualifier/value/>
    PREFIX pr: <http://www.wikidata.org/prop/reference/>
    PREFIX prn: <http://www.wikidata.org/prop/reference/value-normalized/>
    PREFIX prov: <http://www.w3.org/ns/prov#>
    PREFIX prv: <http://www.wikidata.org/prop/reference/value/>
    PREFIX ps: <http://www.wikidata.org/prop/statement/>
    PREFIX psn: <http://www.wikidata.org/prop/statement/value-normalized/>
    PREFIX psv: <http://www.wikidata.org/prop/statement/value/>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX schema: <http://schema.org/>
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX wd: <http://www.wikidata.org/entity/>
    PREFIX wdata: <http://www.wikidata.org/wiki/Special:EntityData/>
    PREFIX wdno: <http://www.wikidata.org/prop/novalue/>
    PREFIX wdref: <http://www.wikidata.org/reference/>
    PREFIX wds: <http://www.wikidata.org/entity/statement/>
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    PREFIX wdtn: <http://www.wikidata.org/prop/direct-normalized/>
    PREFIX wdv: <http://www.wikidata.org/value/>
    PREFIX wikibase: <http://wikiba.se/ontology#>
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

test sparqlquery command line tool[edit]

sparqlquery -qn triplecount -en wdimport -f json
[
  {
    "count": 457000496
  }
]

stats / ETA script[edit]

using https://pypi.org/project/pylodstorage/ sparqlquery command line

#!/bin/bash
# WF 2023-05-03
# statistics for wikidata loading
log=/tmp/log$$
cat loadall.log > $log
cat nohup.out >>$log
isodate=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
triples=$(sparqlquery -en wdimport -qn triplecount -f json | jq ".[].count")
cat $log | awk -v isodate=$isodate -v total_files=1058 -v triples=$triples '
BEGIN {
  FS="="
  printf("   #: load s  total s  avg s   ETA h\n");
}
/fileOrDirs=/ {
  filenum=get_number($2)
  next
}
/<data modified/ {
  msecs=get_number($4)
  sec=msecs/1000
  totals+=sec
  avg=totals/filenum
  eta=(total_files-filenum)*avg/3600
  printf("%4d: %6d %8d %6.0f %8.1f\n",filenum,sec,totals,avg,eta);
}
function get_number(num_str) {
  if (match(num_str, /[0-9]+/)) {
    num=substr(num_str, RSTART, RLENGTH)
    return num
 }
}
END {
  printf("%s:%d\n",isodate,triples)
  printf("%6.3f bill triples %6.0f triples/s\n",triples/1000000000,triples/totals)
}
'
rm $log

progress on wikidata dbis[edit]

Triples:

2023-05-03T09:38:00Z: 288488184  
2023-05-03T11:22:52Z: 476800190
2023-05-03T16:17:41Z: 874762589
2023-05-04T05:51:17Z:1566635203
2023-05-05T04:48:47Z:2320966288
2023-05-06T05:20:33Z:2921548912
2023-05-07T06:59:52Z:3406082541
2023-05-08T03:32:44Z:3750926849
2023-05-09T04:44:11Z:4004639109
2023-05-11T04:01:02Z:439241844
recent: 2278 triples/s ETA: 52.5 days
./stats 
   #: load s  total s  avg s   ETA h
   1:    331      331    331     97.2
   2:    355      686    343    100.7
   3:    394     1081    360    105.6
   4:    445     1526    382    111.8
   5:    404     1931    386    113.0
   6:    387     2319    387    113.0
   7:    401     2720    389    113.5
   8:    426     3147    393    114.7
   9:    435     3582    398    116.0
  10:    200     3782    378    110.1
  11:    201     3984    362    105.4
  12:    213     4198    350    101.7
  13:    318     4517    347    100.9
  14:    278     4795    343     99.3
  15:    232     5028    335     97.1
  16:    160     5188    324     93.9
  17:    372     5560    327     94.6
  18:    425     5986    333     96.1
  19:    421     6407    337     97.3
  20:    359     6767    338     97.6
  21:    260     7027    335     96.4
  22:    311     7339    334     96.0
  23:    308     7647    333     95.6
  24:    290     7938    331     95.0
  25:    375     8314    333     95.4
  26:    274     8589    330     94.7
  27:    266     8855    328     93.9
  28:    432     9287    332     94.9
  29:    275     9563    330     94.3
  30:    210     9773    326     93.0
  31:    549    10323    333     95.0
  32:    629    10952    342     97.5
  33:    575    11527    349     99.5
  34:    429    11957    352    100.0
  35:    627    12584    360    102.2
  36:    317    12902    358    101.7
  37:    180    13082    354    100.3
...
  54:   1060    30615    567    158.1
...
 106:   2458    77958    735    194.5
...
 170:    929   160254    943    232.5
...
 200:   1756   250171   1251    298.1
...
 239:   3384   339178   1419    322.9
...
 291:   3704   503140   1729    368.4
...
 311:   8835   673480   2166    449.3
1
Wikidata Import 2023-04-26 Wolfgang Fahl
🖨 🚪