WikiData Import 2020-07-30

From BITPlan Wiki
Revision as of 13:42, 1 August 2020 by Wf (talk | contribs)
Jump to navigation Jump to search

Download and unpack

This download was done with the "latest-all.nt" dataset.

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2

--2020-07-30 06:40:18--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 118776910150 (111G) [application/octet-stream]
Saving to: ‘latest-all.nt.bz2’
latest-all.nt.bz2     0%[                    ] 635.10M  5.02MB/s    eta 6h 19m
...
latest-all.nt.bz2   100%[===================>] 110.62G  4.88MB/s    in 6h 31m  
2020-07-30 13:11:49 (4.82 MB/s) - ‘latest-all.nt.bz2’ saved [118776910150/118776910150]
bzip2 -dk latest-all.nt.bz2
ls -l latest-all.nt
-rw-------  1 wf  admin  1980899328 Jul 30 17:56 latest-all.nt
# failed to limited disk space
# retry next morning
nohup bzip2 -ckd latest-all.nt.bz2 > /Volumes/Torterra/wikidata2020-07-31/latest-all.nt&

bzip2 issue

the first bzip failed - see retry for how bzip is used to extract from rotating disk to SSD

bzip2: I/O or other error, bailing out.  Possible reason follows.
bzip2: No space left on device
	Input file = latest-all.nt.bz2, output file = latest-all.nt
bzip2: Deleting output file latest-all.nt, if it exists.

Counting an copying

date;cp -p /Volumes/Torterra/wikidata2020-07-31/latest-all.nt .;date
Sat Aug  1 07:42:32 CEST 2020
Sat Aug  1 12:21:06 CEST 2020
date;wc -l latest-all.nt;date
Sat Aug  1 07:42:43 CEST 2020
 13738317356 latest-all.nt
Sat Aug  1 10:52:57 CEST 2020