Get your own copy of WikiData/2023

From BITPlan Wiki
Jump to navigation Jump to search

First Attempt 2018-01

The start of this attempt was on 2018-01-05. I tried to follow the procedure at:

~/wikidata/wikidata-query-rdf/dist/target/service-0.3.0-SNAPSHOT$nohup ./munge.sh -f data/latest-all.ttl.gz -d data/split -l en,de &
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
08:23:02.391 [main] INFO  org.wikidata.query.rdf.tool.Munge - Switching to data/split/wikidump-000000001.ttl.gz
08:24:21.249 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 10000 entities at (105, 47, 33)
08:25:07.369 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 20000 entities at (162, 70, 41)
08:25:56.862 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 30000 entities at (186, 91, 50)
08:26:43.594 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 40000 entities at (203, 109, 59)
08:27:24.042 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 50000 entities at (224, 126, 67)
...
java.nio.file.NoSuchFileException: ./mwservices.json

Import issues

Queries after import

Number of Triples

SELECT (COUNT(*) as ?Triples) WHERE { ?s ?p ?o}
Triples
3.019.914.549

try it on original WikiData Query Service!

Triples
10.949.664.801

TypeCount

SELECT ?type (COUNT(?type) AS ?typecount)
WHERE {
  ?subject a ?type.
}
GROUP by ?type
ORDER by desc(?typecount)
LIMIT 7
<http://wikiba.se/ontology#BestRank>	369637917
schema:Article	61229687
<http://wikiba.se/ontology#GlobecoordinateValue>	5379022
<http://wikiba.se/ontology#QuantityValue>	697187
<http://wikiba.se/ontology#TimeValue>	234556
<http://wikiba.se/ontology#GeoAutoPrecision>	101897
<http://www.wikidata.org/prop/novalue/P17>	37884

Second Attempt 2020-05

Test Environment

  1. Mac Pro Mid 2010
  2. 12 core 3.46 GHz
  3. 64 GB RAM
  4. macOS High Sierra 10.13.6
  5. 2 TerraByte 5400 rpm hard disk Blackmagic speed rating: 130 MB/s write 140 MB/s read

Download and unpack

Sizes:

  • download: 67 G
  • unpacked: 552 G
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
--2020-05-09 17:18:53--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 71897810492 (67G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’

latest-all.ttl.bz2    0%[                    ] 147.79M  4.82MB/s    eta 3h 56m 
...
latest-all.ttl.bz2  100%[===================>]  66.96G  4.99MB/s    in 4h 0m   

2020-05-09 21:19:25 (4.75 MB/s) - ‘latest-all.ttl.bz2’ saved [71897810492/71897810492]

bzip2 -dk latest-all.ttl.bz2
ls -l
-rw-r--r--  1 wf  admin  592585505631 May  7 08:00 latest-all.ttl

Test counting lines

Simply counting the 15.728.395.994 lines of latest-all.ttl the turtle file which should roughly give the number of triples in that file takes around one hour in the test environment.

Sun May 10 07:13:45 CEST 2020
 15728395994 latest-all.ttl
Sun May 10 08:12:50 CEST 2020

Test with Apache Jena

see https://muncca.com/2019/02/14/wikidata-import-in-apache-jena/

wget -c http://mirror.easyname.ch/apache/jena/binaries/apache-jena-3.14.0.tar.gz
--2020-05-10 10:26:38--  http://mirror.easyname.ch/apache/jena/binaries/apache-jena-3.14.0.tar.gz
Resolving mirror.easyname.ch (mirror.easyname.ch)... 77.244.244.134
Connecting to mirror.easyname.ch (mirror.easyname.ch)|77.244.244.134|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20174193 (19M) [application/x-gzip]
Saving to: ‘apache-jena-3.14.0.tar.gz’

apache-jena-3.14.0. 100%[===================>]  19.24M  2.58MB/s    in 7.4s    

2020-05-10 10:26:45 (2.58 MB/s) - ‘apache-jena-3.14.0.tar.gz’ saved [20174193/20174193]

Links