Difference between revisions of "Get your own copy of WikiData/2023"
Line 120: | Line 120: | ||
INFO Elapsed: 8.611,05 seconds [2020/05/10 13:21:09 MESZ] | INFO Elapsed: 8.611,05 seconds [2020/05/10 13:21:09 MESZ] | ||
INFO Add: 1.026.050.000 Data (Batch: 128.205 / Avg: 119.149) | INFO Add: 1.026.050.000 Data (Batch: 128.205 / Avg: 119.149) | ||
+ | ... | ||
+ | INFO Elapsed: 30.653,35 seconds [2020/05/10 19:28:31 MESZ] | ||
+ | INFO Add: 3.430.050.000 Data (Batch: 105.042 / Avg: 111.896) | ||
</source> | </source> | ||
=== Manual download === | === Manual download === |
Revision as of 18:29, 10 May 2020
First Attempt 2018-01
The start of this attempt was on 2018-01-05. I tried to follow the procedure at:
~/wikidata/wikidata-query-rdf/dist/target/service-0.3.0-SNAPSHOT$nohup ./munge.sh -f data/latest-all.ttl.gz -d data/split -l en,de &
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
08:23:02.391 [main] INFO org.wikidata.query.rdf.tool.Munge - Switching to data/split/wikidump-000000001.ttl.gz
08:24:21.249 [main] INFO org.wikidata.query.rdf.tool.Munge - Processed 10000 entities at (105, 47, 33)
08:25:07.369 [main] INFO org.wikidata.query.rdf.tool.Munge - Processed 20000 entities at (162, 70, 41)
08:25:56.862 [main] INFO org.wikidata.query.rdf.tool.Munge - Processed 30000 entities at (186, 91, 50)
08:26:43.594 [main] INFO org.wikidata.query.rdf.tool.Munge - Processed 40000 entities at (203, 109, 59)
08:27:24.042 [main] INFO org.wikidata.query.rdf.tool.Munge - Processed 50000 entities at (224, 126, 67)
...
java.nio.file.NoSuchFileException: ./mwservices.json
Import issues
- https://phabricator.wikimedia.org/T164773
- https://phabricator.wikimedia.org/p/Yurik/
- https://www.mediawiki.org/wiki/User:AKlapper_(WMF)
Queries after import
Number of Triples
SELECT (COUNT(*) as ?Triples) WHERE { ?s ?p ?o}
Triples
3.019.914.549
try it on original WikiData Query Service!
Triples 10.949.664.801
TypeCount
SELECT ?type (COUNT(?type) AS ?typecount)
WHERE {
?subject a ?type.
}
GROUP by ?type
ORDER by desc(?typecount)
LIMIT 7
<http://wikiba.se/ontology#BestRank> 369637917
schema:Article 61229687
<http://wikiba.se/ontology#GlobecoordinateValue> 5379022
<http://wikiba.se/ontology#QuantityValue> 697187
<http://wikiba.se/ontology#TimeValue> 234556
<http://wikiba.se/ontology#GeoAutoPrecision> 101897
<http://www.wikidata.org/prop/novalue/P17> 37884
Second Attempt 2020-05
Test Environment
- Mac Pro Mid 2010
- 12 core 3.46 GHz
- 64 GB RAM
- macOS High Sierra 10.13.6
- 2 TerraByte 5400 rpm hard disk Blackmagic speed rating: 130 MB/s write 140 MB/s read
Download and unpack
Sizes:
- download: 67 G
- unpacked: 552 G
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
--2020-05-09 17:18:53-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 71897810492 (67G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’
latest-all.ttl.bz2 0%[ ] 147.79M 4.82MB/s eta 3h 56m
...
latest-all.ttl.bz2 100%[===================>] 66.96G 4.99MB/s in 4h 0m
2020-05-09 21:19:25 (4.75 MB/s) - ‘latest-all.ttl.bz2’ saved [71897810492/71897810492]
bzip2 -dk latest-all.ttl.bz2
ls -l
-rw-r--r-- 1 wf admin 592585505631 May 7 08:00 latest-all.ttl
Test counting lines
Simply counting the 15.728.395.994 lines of latest-all.ttl the turtle file which should roughly give the number of triples in that file takes around one hour in the test environment.
Sun May 10 07:13:45 CEST 2020
15728395994 latest-all.ttl
Sun May 10 08:12:50 CEST 2020
Test with Apache Jena
see https://muncca.com/2019/02/14/wikidata-import-in-apache-jena/ After doing the manual download i decided to create the wikidata2jena Script below. With that script the command
nohup ./wikidata2jena&
Will start the processing of the latest-all.ttl file in background. You might want to make sure that your computer does not go to sleep while the script wants to run. I am using Amphetamine macOS App for this. With
tail nohup.out
apache-jena-3.14.0.tar.gz already downloaded
apache-jena-3.14.0 already unpacked
creating data directory
creating temporary directory /Volumes/Tattu/wikidata/tmp
started load phase data at 2020-05-10T08:57:35Z
You can watch the progress of the phases which i assume will take some 1.2 days to finish for the data phase.
To see more progress details you might want to call:
tail -f tdb-data-err.log
INFO Elapsed: 352,44 seconds [2020/05/10 11:03:30 MESZ]
INFO Add: 37.050.000 Data (Batch: 107.991 / Avg: 104.985)
...
INFO Elapsed: 4.464,92 seconds [2020/05/10 12:12:03 MESZ]
INFO Add: 545.550.000 Data (Batch: 120.481 / Avg: 122.174)
...
INFO Elapsed: 8.611,05 seconds [2020/05/10 13:21:09 MESZ]
INFO Add: 1.026.050.000 Data (Batch: 128.205 / Avg: 119.149)
...
INFO Elapsed: 30.653,35 seconds [2020/05/10 19:28:31 MESZ]
INFO Add: 3.430.050.000 Data (Batch: 105.042 / Avg: 111.896)
Manual download
wget -c http://mirror.easyname.ch/apache/jena/binaries/apache-jena-3.14.0.tar.gz
--2020-05-10 10:26:38-- http://mirror.easyname.ch/apache/jena/binaries/apache-jena-3.14.0.tar.gz
Resolving mirror.easyname.ch (mirror.easyname.ch)... 77.244.244.134
Connecting to mirror.easyname.ch (mirror.easyname.ch)|77.244.244.134|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20174193 (19M) [application/x-gzip]
Saving to: ‘apache-jena-3.14.0.tar.gz’
apache-jena-3.14.0. 100%[===================>] 19.24M 2.58MB/s in 7.4s
2020-05-10 10:26:45 (2.58 MB/s) - ‘apache-jena-3.14.0.tar.gz’ saved [20174193/20174193]
tar -xvzf apache-jena-3.14.0.tar.gz
wikidata2jena Script
#!/bin/bash
# WF 2020-05-10
# global settings
jena=apache-jena-3.14.0
tgz=$jena.tar.gz
jenaurl=http://mirror.easyname.ch/apache/jena/binaries/$tgz
data=data
tdbloader=$jena/bin/tdbloader2
getjena() {
# download
if [ ! -f $tgz ]
then
echo "downloading $tgz from $jenaurl"
wget $jenaurl
else
echo "$tgz already downloaded"
fi
# unpack
if [ ! -d $jena ]
then
echo "unpacking $jena from $tgz"
tar xvzf $tgz
else
echo "$jena already unpacked"
fi
# create data directory
if [ ! -d $data ]
then
echo "creating $data directory"
mkdir $data
else
echo "$data directory already created"
fi
}
#
# show the given timestamp
#
timestamp() {
local msg="$1"
local ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
echo "$msg at $ts"
}
#
# load data for the given phase
#
loaddata4phase() {
local phase="$1"
local data="$2"
local input="$3"
timestamp "started load phase $phase"
$tdbloader --phase $phase --loc "$data" "$input" > tdb-$phase-out.log 2> tdb-$phase-err.log
timestamp "finished load phase $phase"
}
#
# load data for the given data dir and input
#
loaddata() {
local data="$1"
local input="$2"
loaddata4phase data "$data" "$input"
loaddata4phase index "$data" ""
}
getjena
wd=$(pwd)
export TMPDIR=$wd/tmp
if [ ! -d $TMPDIR ]
then
echo "creating temporary directory $TMPDIR"
mkdir $TMPDIR
else
echo "using temporary directory $TMPDIR"
fi
loaddata $data latest-all.ttl
Links
- https://www.wikidata.org/wiki/Wikidata:Database_download#RDF_dumps
- https://stackoverflow.com/questions/48020506/wikidata-on-local-blazegraph-expected-an-rdf-value-here-found-line-1/48110100
- https://stackoverflow.com/questions/56768463/wikidata-import-into-virtuoso
- https://muncca.com/2019/02/14/wikidata-import-in-apache-jena/