WikiData Import 2020-07-30

From BITPlan Wiki
Jump to navigation Jump to search

Environment

  1. Mac Pro Mid 2010
  2. 12 core 3.46 GHz
  3. 64 GB RAM
  4. macOS High Sierra 10.13.6
  5. Source Disk: 4 TB 7200 rpm hard disk WD Gold WDC WD4002FYYZ Blackmagic speed rating: 175 MB/s write 175 MB/s read
  6. Target Disk: 4 TB SSD Samsung 860 EVO Blackmagic speed rating: 257 MB/s write 270 MB/s read
  7. java -version
    openjdk version "11.0.5" 2019-10-15
    OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.5+10)
    OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.5+10, mixed mode)
    

Summary

  1. trying to replicate success story of https://issues.apache.org/jira/projects/JENA/issues/JENA-1909
  2. download of 110 GB took some 6 h 30 min
  3. unzipping to to some 2160 GB took more than 1/2 day
  4. import of some 12 billion triples is in progress and at 6.245 billion triples after 285.481 secs

Download and unpack

This download was done with the "latest-all.nt" dataset.

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2

--2020-07-30 06:40:18--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 118776910150 (111G) [application/octet-stream]
Saving to: ‘latest-all.nt.bz2’
latest-all.nt.bz2     0%[                    ] 635.10M  5.02MB/s    eta 6h 19m
...
latest-all.nt.bz2   100%[===================>] 110.62G  4.88MB/s    in 6h 31m  
2020-07-30 13:11:49 (4.82 MB/s) - ‘latest-all.nt.bz2’ saved [118776910150/118776910150]
bzip2 -dk latest-all.nt.bz2
ls -l latest-all.nt
-rw-------  1 wf  admin  1980899328 Jul 30 17:56 latest-all.nt
# failed to limited disk space
# retry next morning
nohup bzip2 -ckd latest-all.nt.bz2 > /Volumes/Torterra/wikidata2020-07-31/latest-all.nt&
...
ls -l latest-all.nt 
-rw-r--r--  1 wf  admin  2162713035569 Jul 31 22:24 latest-all.nt

bzip2 issue

the first bzip failed - see retry for how bzip is used to extract from rotating disk to SSD

bzip2: I/O or other error, bailing out.  Possible reason follows.
bzip2: No space left on device
	Input file = latest-all.nt.bz2, output file = latest-all.nt
bzip2: Deleting output file latest-all.nt, if it exists.

Counting an copying

date;cp -p /Volumes/Torterra/wikidata2020-07-31/latest-all.nt .;date
Sat Aug  1 07:42:32 CEST 2020
Sat Aug  1 12:21:06 CEST 2020
date;wc -l latest-all.nt;date
Sat Aug  1 07:42:43 CEST 2020
 13738317356 latest-all.nt
Sat Aug  1 10:52:57 CEST 2020

Start and progress

nohup ./wikidata2jena&
tail -f tdb2-err.log 
13:54:03 INFO  loader          :: Add: 25.000.000 latest-all.nt (Batch: 103.369 / Avg: 104.489)
13:54:03 INFO  loader          ::   Elapsed: 239,26 seconds [2020/08/01 13:54:03 MESZ]
...
19:27:43 INFO  loader          :: Add: 1.000.000.000 latest-all.nt (Batch: 17.564 / Avg: 49.359)
19:27:43 INFO  loader          ::   Elapsed: 20.259,39 seconds [2020/08/01 19:27:43 MESZ]
...
06:53:39 INFO  loader          :: Add: 2.000.000.000 latest-all.nt (Batch: 27.135 / Avg: 32.564)
06:53:39 INFO  loader          ::   Elapsed: 61.415,66 seconds [2020/08/02 06:53:39 MESZ]
...
16:46:07 INFO  loader          :: Add: 3.000.000.000 latest-all.nt (Batch: 40.943 / Avg: 30.939)
16:46:07 INFO  loader          ::   Elapsed: 96.963,08 seconds [2020/08/02 16:46:07 MESZ]
...
02:47:19 INFO  loader          :: Add: 4.000.000.000 latest-all.nt (Batch: 18.551 / Avg: 30.067)
02:47:19 INFO  loader          ::   Elapsed: 133.034,73 seconds [2020/08/03 02:47:19 MESZ]
...
17:43:29 INFO  loader          :: Add: 5.000.000.000 latest-all.nt (Batch: 11.246 / Avg: 26.765)
17:43:29 INFO  loader          ::   Elapsed: 186.805,39 seconds [2020/08/03 17:43:29 MESZ]
...
13:12:48 INFO  loader          :: Add: 6.000.000.000 latest-all.nt (Batch: 7.488 / Avg: 23.349)
13:12:48 INFO  loader          ::   Elapsed: 256.964,05 seconds [2020/08/04 13:12:48 MESZ]
...

Scripts

wikidata2jena

#!/bin/bash
# WF 2020-05-10

# global settings
jena=apache-jena-3.16.0
tgz=$jena.tar.gz
jenaurl=http://mirror.easyname.ch/apache/jena/binaries/$tgz
base=/Volumes/Torterra/wikidata2020-08-01
data=$base/data
tdbloader=$jena/bin/tdb2.tdbloader

getjena() {
# download
if [ ! -f $tgz ]
then
  echo "downloading $tgz from $jenaurl"
	wget $jenaurl
else
  echo "$tgz already downloaded"
fi
# unpack
if [ ! -d $jena ]
then
  echo "unpacking $jena from $tgz"
	tar xvzf $tgz
else
  echo "$jena already unpacked"
fi
# create data directory
if [ ! -d $data ]
then
  echo "creating $data directory"
  mkdir -p $data
else
  echo "$data directory already created"
fi
}

#
# show the given timestamp
#
timestamp() {
 local msg="$1"
 local ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
 echo "$msg at $ts"
}

#
# load data for the given data dir and input
#
loaddata() {
	local data="$1"
	local input="$2"
  timestamp "start loading $input to $data"
  $tdbloader --loader=parallel --loc "$data" "$input" > tdb2-out.log 2> tdb2-err.log
	timestamp "finished loading $input to $data"
}

getjena
export TMPDIR=$base/tmp
if [ ! -d $TMPDIR ]
then
  echo "creating temporary directory $TMPDIR"
  mkdir $TMPDIR
else
  echo "using temporary directory $TMPDIR"
fi
loaddata $data latest-all.nt