WikiData Import 2020-07-30

From BITPlan Wiki
Revision as of 08:44, 11 August 2020 by Wf (talk | contribs) (→‎Summary)
Jump to navigation Jump to search

Environment

  1. Mac Pro Mid 2010
  2. 12 core 3.46 GHz
  3. 64 GB RAM
  4. macOS High Sierra 10.13.6
  5. Source Disk: 4 TB 7200 rpm hard disk WD Gold WDC WD4002FYYZ Blackmagic speed rating: 175 MB/s write 175 MB/s read
  6. Target Disk: 4 TB SSD Samsung 860 EVO Blackmagic speed rating: 257 MB/s write 270 MB/s read
  7. java -version
    openjdk version "11.0.5" 2019-10-15
    OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.5+10)
    OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.5+10, mixed mode)
    

Summary

  1. trying to replicate success story of https://issues.apache.org/jira/projects/JENA/issues/JENA-1909
  2. download of 110 GB took some 6 h 30 min
  3. unzipping to some 2160 GB took more than 1/2 day
  4. import of some 12 billion triples is in progress and at 11.125 billion triples after some 10 days

Download and unpack

This download was done with the "latest-all.nt" dataset.

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2

--2020-07-30 06:40:18--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 118776910150 (111G) [application/octet-stream]
Saving to: ‘latest-all.nt.bz2’
latest-all.nt.bz2     0%[                    ] 635.10M  5.02MB/s    eta 6h 19m
...
latest-all.nt.bz2   100%[===================>] 110.62G  4.88MB/s    in 6h 31m  
2020-07-30 13:11:49 (4.82 MB/s) - ‘latest-all.nt.bz2’ saved [118776910150/118776910150]
bzip2 -dk latest-all.nt.bz2
ls -l latest-all.nt
-rw-------  1 wf  admin  1980899328 Jul 30 17:56 latest-all.nt
# failed to limited disk space
# retry next morning
nohup bzip2 -ckd latest-all.nt.bz2 > /Volumes/Torterra/wikidata2020-07-31/latest-all.nt&
...
ls -l latest-all.nt 
-rw-r--r--  1 wf  admin  2162713035569 Jul 31 22:24 latest-all.nt

bzip2 issue

the first bzip failed - see retry for how bzip is used to extract from rotating disk to SSD

bzip2: I/O or other error, bailing out.  Possible reason follows.
bzip2: No space left on device
	Input file = latest-all.nt.bz2, output file = latest-all.nt
bzip2: Deleting output file latest-all.nt, if it exists.

Counting an copying

date;cp -p /Volumes/Torterra/wikidata2020-07-31/latest-all.nt .;date
Sat Aug  1 07:42:32 CEST 2020
Sat Aug  1 12:21:06 CEST 2020
date;wc -l latest-all.nt;date
Sat Aug  1 07:42:43 CEST 2020
 13738317356 latest-all.nt
Sat Aug  1 10:52:57 CEST 2020

Start and progress

nohup ./wikidata2jena&
tail -f tdb2-err.log 
13:54:03 INFO  loader          :: Add: 25.000.000 latest-all.nt (Batch: 103.369 / Avg: 104.489)
13:54:03 INFO  loader          ::   Elapsed: 239,26 seconds [2020/08/01 13:54:03 MESZ]
...
19:27:43 INFO  loader          :: Add: 1.000.000.000 latest-all.nt (Batch: 17.564 / Avg: 49.359)
19:27:43 INFO  loader          ::   Elapsed: 20.259,39 seconds [2020/08/01 19:27:43 MESZ]
...
06:53:39 INFO  loader          :: Add: 2.000.000.000 latest-all.nt (Batch: 27.135 / Avg: 32.564)
06:53:39 INFO  loader          ::   Elapsed: 61.415,66 seconds [2020/08/02 06:53:39 MESZ]
...
16:46:07 INFO  loader          :: Add: 3.000.000.000 latest-all.nt (Batch: 40.943 / Avg: 30.939)
16:46:07 INFO  loader          ::   Elapsed: 96.963,08 seconds [2020/08/02 16:46:07 MESZ]
...
02:47:19 INFO  loader          :: Add: 4.000.000.000 latest-all.nt (Batch: 18.551 / Avg: 30.067)
02:47:19 INFO  loader          ::   Elapsed: 133.034,73 seconds [2020/08/03 02:47:19 MESZ]
...
17:43:29 INFO  loader          :: Add: 5.000.000.000 latest-all.nt (Batch: 11.246 / Avg: 26.765)
17:43:29 INFO  loader          ::   Elapsed: 186.805,39 seconds [2020/08/03 17:43:29 MESZ]
...
13:12:48 INFO  loader          :: Add: 6.000.000.000 latest-all.nt (Batch: 7.488 / Avg: 23.349)
13:12:48 INFO  loader          ::   Elapsed: 256.964,05 seconds [2020/08/04 13:12:48 MESZ]
...
11:41:39 INFO  loader          :: Add: 7.000.000.000 latest-all.nt (Batch: 11.798 / Avg: 20.716)
11:41:39 INFO  loader          ::   Elapsed: 337.895,22 seconds [2020/08/05 11:41:39 MESZ]
...
21:02:39 INFO  loader          :: Add: 8.000.000.000 latest-all.nt (Batch: 7.732 / Avg: 17.468)
21:02:39 INFO  loader          ::   Elapsed: 457.955,13 seconds [2020/08/06 21:02:39 MESZ]
...
21:04:16 INFO  loader          :: Add: 9.000.000.000 latest-all.nt (Batch: 9.469 / Avg: 16.530)
21:04:16 INFO  loader          ::   Elapsed: 544.452,69 seconds [2020/08/07 21:04:16 MESZ]
...
14:48:25 INFO  loader          :: Add: 10.000.000.000 latest-all.nt (Batch: 14.865 / Avg: 14.394)
14:48:25 INFO  loader          ::   Elapsed: 694.700,94 seconds [2020/08/09 14:48:25 MESZ]
...
02:34:34 INFO  loader          :: Add: 11.000.000.000 latest-all.nt (Batch: 6.112 / Avg: 13.358)
02:34:37 INFO  loader          ::   Elapsed: 823.470,00 seconds [2020/08/11 02:34:34 MESZ]
...
09:37:45 INFO  loader          :: Add: 11.125.000.000 latest-all.nt (Batch: 4.353 / Avg: 13.105)
09:37:45 INFO  loader          ::   Elapsed: 848.860,88 seconds [2020/08/11 09:37:45 MESZ]

Scripts

wikidata2jena

#!/bin/bash
# WF 2020-05-10

# global settings
jena=apache-jena-3.16.0
tgz=$jena.tar.gz
jenaurl=http://mirror.easyname.ch/apache/jena/binaries/$tgz
base=/Volumes/Torterra/wikidata2020-08-01
data=$base/data
tdbloader=$jena/bin/tdb2.tdbloader

getjena() {
# download
if [ ! -f $tgz ]
then
  echo "downloading $tgz from $jenaurl"
	wget $jenaurl
else
  echo "$tgz already downloaded"
fi
# unpack
if [ ! -d $jena ]
then
  echo "unpacking $jena from $tgz"
	tar xvzf $tgz
else
  echo "$jena already unpacked"
fi
# create data directory
if [ ! -d $data ]
then
  echo "creating $data directory"
  mkdir -p $data
else
  echo "$data directory already created"
fi
}

#
# show the given timestamp
#
timestamp() {
 local msg="$1"
 local ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
 echo "$msg at $ts"
}

#
# load data for the given data dir and input
#
loaddata() {
	local data="$1"
	local input="$2"
  timestamp "start loading $input to $data"
  $tdbloader --loader=parallel --loc "$data" "$input" > tdb2-out.log 2> tdb2-err.log
	timestamp "finished loading $input to $data"
}

getjena
export TMPDIR=$base/tmp
if [ ! -d $TMPDIR ]
then
  echo "creating temporary directory $TMPDIR"
  mkdir $TMPDIR
else
  echo "using temporary directory $TMPDIR"
fi
loaddata $data latest-all.nt