Difference between revisions of "Get your own copy of WikiData"

From BITPlan Wiki
Jump to: navigation, search
(Start and progress)
(Start and progress)
(One intermediate revision by the same user not shown)
Line 208: Line 208:
 
08:12:09 INFO  loader    :: Add: 2.000.000 latest-truthy.nt (Batch: 68.212 / Avg: 130.208)
 
08:12:09 INFO  loader    :: Add: 2.000.000 latest-truthy.nt (Batch: 68.212 / Avg: 130.208)
 
...
 
...
18:48:02 INFO  loader    :: Add: 900.000.000 latest-truthy.nt (Batch: 10.075 / Avg: 23.579)
+
09:49:07 INFO  loader    :: Add: 500.000.000 latest-truthy.nt (Batch: 54.259 / Avg: 85.713)
18:48:02 INFO  loader    ::  Elapsed: 38.168,85 seconds [2020/06/13 18:48:02 MESZ]
+
09:49:07 INFO  loader    ::  Elapsed: 5.833,40 seconds [2020/06/13 09:49:07 MESZ]
 
...
 
...
00:37:01 INFO  loader    :: Add: 1.005.000.000 latest-truthy.nt (Batch: 4.508 / Avg: 17.002)
+
00:17:13 INFO  loader    :: Add: 1.000.000.000 latest-truthy.nt (Batch: 4.020 / Avg: 17.265)
00:37:01 INFO  loader    ::  Elapsed: 59.107,44 seconds [2020/06/14 00:37:01 MESZ]
+
00:17:13 INFO  loader    ::  Elapsed: 57.919,50 seconds [2020/06/14 00:17:13 MESZ]
 
...
 
...
08:31:39 INFO  loader    :: Add: 1.600.000.000 latest-truthy.nt (Batch: 1.274 / Avg: 6.144)
+
10:37:44 INFO  loader    :: Add: 1.500.000.000 latest-truthy.nt (Batch: 8.740 / Avg: 8.262)
08:31:39 INFO  loader    ::  Elapsed: 260.385,89 seconds [2020/06/16 08:31:39 MESZ]
+
10:37:44 INFO  loader    ::  Elapsed: 181.550,47 seconds [2020/06/15 10:37:44 MESZ]
...
 
08:22:39 INFO  loader    :: Add: 1.710.000.000 latest-truthy.nt (Batch: 1.166 / Avg: 4.938)
 
08:22:39 INFO  loader    ::  Elapsed: 346.245,44 seconds [2020/06/17 08:22:39 MESZ]
 
 
...
 
...
 
08:08:47 INFO  loader    :: Add: 2.000.000.000 latest-truthy.nt (Batch: 1.604 / Avg: 3.859)
 
08:08:47 INFO  loader    :: Add: 2.000.000.000 latest-truthy.nt (Batch: 1.604 / Avg: 3.859)
 
08:08:48 INFO  loader    ::  Elapsed: 518.214,03 seconds [2020/06/19 08:08:48 MESZ]
 
08:08:48 INFO  loader    ::  Elapsed: 518.214,03 seconds [2020/06/19 08:08:48 MESZ]
 
...
 
...
22:58:07 INFO  loader    :: Add: 2.225.000.000 latest-truthy.nt (Batch: 431 / Avg: 2.989)
+
12:36:01 INFO  loader    :: Add: 2.250.000.000 latest-truthy.nt (Batch: 683 / Avg: 2.835)
22:58:07 INFO  loader    ::  Elapsed: 744.373,25 seconds [2020/06/21 22:58:07 MESZ]
+
12:36:01 INFO  loader    ::  Elapsed: 793.447,44 seconds [2020/06/22 12:36:01 MESZ]
 +
...
 +
03:56:21 INFO  loader    :: Add: 2.500.000.000 latest-truthy.nt (Batch: 1.030 / Avg: 2.093)
 +
03:56:21 INFO  loader    ::  Elapsed: 1.194.267,75 seconds [2020/06/27 03:56:21 MESZ]
 
...
 
...
06:23:35 INFO  loader    :: Add: 2.335.000.000 latest-truthy.nt (Batch: 626 / Avg: 2.473)
 
06:23:35 INFO  loader    ::  Elapsed: 943.901,13 seconds [2020/06/24 06:23:35 MESZ]
 
 
</source>
 
</source>
  

Revision as of 09:50, 27 June 2020

Why would you want your own WikiData copy?

The resources behind https://query.wikidata.org/ are scarce and used by a lot of people. You might hit the https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_limits quite quickly.

See SPARQL for some examples that work online (mostly) without hitting these limits.

Prerequisites

Getting a copy of WikiData is not for the faint of heart.

You need quite a bit of patience and some hardware resources to get your own WikiData copy working. The resources you need are a moving target since WikiData is growing all the time. On this page you'll see the documentation for two attempts from

  • 2018
  • 2020

the successful 2018 attempt was done with a cheap 50 EUR used server from 2009. The server was on sale in Mönchengladbach via ebay. The server originally had 32 GByte of RAM and we increased the amount to 64 GByte by buying a second one and adding the RAM. In 2018 a 512 GByte SSD was sufficient to speed up the import process from some 14 days to 3.8 days. Specs of the server:

ASUS KFSN5-D/IST mainboard
Brand	Quad-Core AMD Opteron(tm) Processor 2374 H
Speed	2.20GHz	NB SPEED 2.00GHz

The disadvantage of the server is that running it 24h / 365 days is more costly than the server itself. It has a power consumption of some 3 kWh per day which would cost more than 300 EUR per year to run. We decided to only switch it on when needed.

First Attempt 2018-01

The start of this attempt was on 2018-01-05. I tried to follow the procedure at:

~/wikidata/wikidata-query-rdf/dist/target/service-0.3.0-SNAPSHOT$nohup ./munge.sh -f data/latest-all.ttl.gz -d data/split -l en,de &
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
08:23:02.391 [main] INFO  org.wikidata.query.rdf.tool.Munge - Switching to data/split/wikidump-000000001.ttl.gz
08:24:21.249 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 10000 entities at (105, 47, 33)
08:25:07.369 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 20000 entities at (162, 70, 41)
08:25:56.862 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 30000 entities at (186, 91, 50)
08:26:43.594 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 40000 entities at (203, 109, 59)
08:27:24.042 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 50000 entities at (224, 126, 67)
...
java.nio.file.NoSuchFileException: ./mwservices.json

Import issues

Success

With the use of a 512 GByte SSD disk and carefully monitoring the progress of the import the import succeeded after some 3.8 days.

Queries after import

Number of Triples

SELECT (COUNT(*) as ?Triples) WHERE { ?s ?p ?o}
Triples
3.019.914.549

try it on original WikiData Query Service!

Triples
10.949.664.801

TypeCount

SELECT ?type (COUNT(?type) AS ?typecount)
WHERE {
  ?subject a ?type.
}
GROUP by ?type
ORDER by desc(?typecount)
LIMIT 7
<http://wikiba.se/ontology#BestRank>	369637917
schema:Article	61229687
<http://wikiba.se/ontology#GlobecoordinateValue>	5379022
<http://wikiba.se/ontology#QuantityValue>	697187
<http://wikiba.se/ontology#TimeValue>	234556
<http://wikiba.se/ontology#GeoAutoPrecision>	101897
<http://www.wikidata.org/prop/novalue/P17>	37884

Second Attempt 2020-05

The second attempt started in 2020-05 - as of 2020-06-09 it was not successful yet. Three import attempts with Apache Jena failed so far. As of 2020-06-09 an attempt with blaezegraph is prepared again.

Test Environment

  1. Mac Pro Mid 2010
  2. 12 core 3.46 GHz
  3. 64 GB RAM
  4. macOS High Sierra 10.13.6
  5. 2 TerraByte 5400 rpm hard disk Seagate Barracuda ST2000DM001 Blackmagic speed rating: 130 MB/s write 140 MB/s read
  6. 4 TerraByte 7200 rpm hard disk WD Gold WDC WD4002FYYZ Blackmagic speed rating: 175 MB/s write 175 MB/s read
  7. java -version 
    openjdk version "1.8.0_232"
    OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_232-b09)
    OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.232-b09, mixed mode)
    

Download and unpack

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2

latest-all.json.bz2 (for blazegraph)

Sizes: latest-all.json.bz2 03-Jun-2020 07:19 57017252630

  • download: 53 G
  • unpacked: n/a
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2
--2020-06-09 06:40:09--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57017252630 (53G) [application/octet-stream]
Saving to: ‘latest-all.json.bz2’

latest-all.json.bz2   1%[                    ] 677.93M  4.84MB/s    eta 3h 5m

latest-all.ttl.bz2 (for Apache Jena)

Sizes:

  • download: 67 G
  • unpacked: 552 G
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
--2020-05-09 17:18:53--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 71897810492 (67G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’

latest-all.ttl.bz2    0%[                    ] 147.79M  4.82MB/s    eta 3h 56m 
...
latest-all.ttl.bz2  100%[===================>]  66.96G  4.99MB/s    in 4h 0m   

2020-05-09 21:19:25 (4.75 MB/s) - ‘latest-all.ttl.bz2’ saved [71897810492/71897810492]

bzip2 -dk latest-all.ttl.bz2
ls -l
-rw-r--r--  1 wf  admin  592585505631 May  7 08:00 latest-all.ttl

Test counting lines

Simply counting the 15.728.395.994 lines of latest-all.ttl the turtle file which should roughly give the number of triples in that file takes around one hour in the test environment.

Sun May 10 07:13:45 CEST 2020
 15728395994 latest-all.ttl
Sun May 10 08:12:50 CEST 2020

Test with Apache Jena

see https://muncca.com/2019/02/14/wikidata-import-in-apache-jena/ After doing the manual download i decided to create the wikidata2jena Script below. With that script the command

nohup ./wikidata2jena&

Will start the processing of the latest-all.ttl file in background. You might want to make sure that your computer does not go to sleep while the script wants to run. I am using Amphetamine macOS App for this. With

tail nohup.out 
apache-jena-3.14.0.tar.gz already downloaded
apache-jena-3.14.0 already unpacked
creating data directory
creating temporary directory /Volumes/Tattu/wikidata/tmp
started load phase data at 2020-05-11T16:45:11Z

You can watch the progress of the phases which i assume will take some 2 days to finish for the data phase.

To see more progress details you might want to call:

tail -f tdb-data-err.log

With the hint from https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits I modified the script to used tdb2.tdbloader. tdbloader2 and started a second import. Since for the time being the disk and memory are shared by the import processes the performance might be lower than possible with my hardware.

fourth attempt to load width tdb2.tdbloader

  1. changed hardware back to Mac pro
  2. trying to replicate success story of https://issues.apache.org/jira/projects/JENA/issues/JENA-1909
  3. since moving disk is target the smaller input file latest-truthy.nt.bz2 is going to be tried
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2
--2020-06-12 17:19:43--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25480099958 (24G) [application/octet-stream]
Saving to: ‘latest-truthy.nt.bz2’

latest-truthy.nt.bz   2%[                    ] 691.48M  4.80MB/s    eta 87m 52s
latest-truthy.nt.bz2        100%[===========================================>]  23.73G  4.63MB/s    in 89m 3s  

2020-06-12 19:33:52 (4.55 MB/s) - ‘latest-truthy.nt.bz2’ saved [25480099958/25480099958]
bzip2 -dk latest-truthy.nt.bz2 
ls -l latest-truthy.nt
-rw-r--r--  1 wf  admin  661598113609 Jun  6 20:11 latest-truthy.nt
sudo port install openjdk11
java -version
openjdk version "11.0.5" 2019-10-15
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.5+10)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.5+10, mixed mode)
Start and progress
nohup ./wikidata2jena&
[1] 18313
appending output to nohup.out
tail -f tdb2-out.log 
08:11:53 INFO  loader     :: Loader = LoaderParallel
08:11:53 INFO  loader     :: Start: latest-truthy.nt
08:11:57 INFO  loader     :: Add: 500.000 latest-truthy.nt (Batch: 156.887 / Avg: 156.887)
08:11:59 INFO  loader     :: Add: 1.000.000 latest-truthy.nt (Batch: 247.035 / Avg: 191.901)
08:12:01 INFO  loader     :: Add: 1.500.000 latest-truthy.nt (Batch: 177.367 / Avg: 186.799)
08:12:09 INFO  loader     :: Add: 2.000.000 latest-truthy.nt (Batch: 68.212 / Avg: 130.208)
...
09:49:07 INFO  loader     :: Add: 500.000.000 latest-truthy.nt (Batch: 54.259 / Avg: 85.713)
09:49:07 INFO  loader     ::   Elapsed: 5.833,40 seconds [2020/06/13 09:49:07 MESZ]
...
00:17:13 INFO  loader     :: Add: 1.000.000.000 latest-truthy.nt (Batch: 4.020 / Avg: 17.265)
00:17:13 INFO  loader     ::   Elapsed: 57.919,50 seconds [2020/06/14 00:17:13 MESZ]
...
10:37:44 INFO  loader     :: Add: 1.500.000.000 latest-truthy.nt (Batch: 8.740 / Avg: 8.262)
10:37:44 INFO  loader     ::   Elapsed: 181.550,47 seconds [2020/06/15 10:37:44 MESZ]
...
08:08:47 INFO  loader     :: Add: 2.000.000.000 latest-truthy.nt (Batch: 1.604 / Avg: 3.859)
08:08:48 INFO  loader     ::   Elapsed: 518.214,03 seconds [2020/06/19 08:08:48 MESZ]
...
12:36:01 INFO  loader     :: Add: 2.250.000.000 latest-truthy.nt (Batch: 683 / Avg: 2.835)
12:36:01 INFO  loader     ::   Elapsed: 793.447,44 seconds [2020/06/22 12:36:01 MESZ]
...
03:56:21 INFO  loader     :: Add: 2.500.000.000 latest-truthy.nt (Batch: 1.030 / Avg: 2.093)
03:56:21 INFO  loader     ::   Elapsed: 1.194.267,75 seconds [2020/06/27 03:56:21 MESZ]
...

third attempt to load with tdb2.tdbloader

  1. changed hardware to the original Linux box from 2018
  2. use same target 4 Terrabyte harddisk
  3. source disk is now also a 4 Terrabyte harddisk West Digital WD40EZRX Caviar green
  4. updated apache Jena version to 3.15
java -version
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
tail -f tdb2--out.log 
...
20:40:38 INFO  loader     :: Add: 10,000,000 latest-all.ttl (Batch: 68,813 / Avg: 68,559)
20:40:38 INFO  loader     ::   Elapsed: 145.86 seconds [2020/05/19 20:40:38 CEST]
...
20:49:43 INFO  loader     :: Add: 50,000,000 latest-all.ttl (Batch: 69,803 / Avg: 72,422)
20:49:43 INFO  loader     ::   Elapsed: 690.40 seconds [2020/05/19 20:49:43 CEST]
...
07:27:38 INFO  loader     :: Add: 555,000,000 latest-all.ttl (Batch: 7,331 / Avg: 14,243)
07:27:38 INFO  loader     ::   Elapsed: 38,965.83 seconds [2020/05/20 07:27:38 CEST]
...
08:37:48 INFO  loader     :: Add: 805,000,000 latest-all.ttl (Batch: 1,204 / Avg: 6,212)
08:37:48 INFO  loader     ::   Elapsed: 129,575.23 seconds [2020/05/21 08:37:48 CEST]
...
08:57:25 INFO  loader     :: Add: 900,000,000 latest-all.ttl (Batch: 929 / Avg: 4,144)
08:57:25 INFO  loader     ::   Elapsed: 217,152.75 seconds [2020/05/22 08:57:25 CEST]
...
06:25:08 INFO  loader     :: Add: 1,090,000,000 latest-all.ttl (Batch: 684 / Avg: 2,332)
06:25:08 INFO  loader     ::   Elapsed: 467,215.09 seconds [2020/05/25 06:25:08 CEST]
...
12:36:02 INFO  loader     :: Add: 1,730,000,000 latest-all.ttl (Batch: 523 / Avg: 1,465)
12:36:02 INFO  loader     ::   Elapsed: 1,180,669.63 seconds [2020/06/02 12:36:02 CEST]
...
java.lang.IllegalArgumentException: null
	at java.nio.Buffer.position(Buffer.java:244) ~[?:1.8.0_201]
	at org.apache.jena.dboe.base.record.RecordFactory.lambda$static$0(RecordFactory.java:111) ~[jena-dboe-base-3.15.0.jar:3.15.0]
	at org.apache.jena.dboe.base.record.RecordFactory.buildFrom(RecordFactory.java:127) ~[jena-dboe-base-3.15.0.jar:3.15.0]
	at org.apache.jena.dboe.base.buffer.RecordBuffer._get(RecordBuffer.java:102) ~[jena-dboe-base-3.15.0.jar:3.15.0]
	at org.apache.jena.dboe.base.buffer.RecordBuffer.get(RecordBuffer.java:52) ~[jena-dboe-base-3.15.0.jar:3.15.0]
	at org.apache.jena.dboe.trans.bplustree.BPTreeRecords.getSplitKey(BPTreeRecords.java:195) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
	at org.apache.jena.dboe.trans.bplustree.BPTreeNode.split(BPTreeNode.java:562) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
	at org.apache.jena.dboe.trans.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:509) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
	at org.apache.jena.dboe.trans.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:522) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
	at org.apache.jena.dboe.trans.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:522) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
	at org.apache.jena.dboe.trans.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:522) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
	at org.apache.jena.dboe.trans.bplustree.BPTreeNode.insert(BPTreeNode.java:203) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
	at org.apache.jena.dboe.trans.bplustree.BPlusTree.insertAndReturnOld(BPlusTree.java:278) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
	at org.apache.jena.dboe.trans.bplustree.BPlusTree.insert(BPlusTree.java:271) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
	at org.apache.jena.tdb2.store.tupletable.TupleIndexRecord.performAdd(TupleIndexRecord.java:94) ~[jena-tdb2-3.15.0.jar:3.15.0]
	at org.apache.jena.tdb2.store.tupletable.TupleIndexBase.add(TupleIndexBase.java:66) ~[jena-tdb2-3.15.0.jar:3.15.0]
	at org.apache.jena.tdb2.loader.main.Indexer.lambda$loadTuples$1(Indexer.java:133) ~[jena-tdb2-3.15.0.jar:3.15.0]
	at org.apache.jena.tdb2.loader.main.Indexer.stageIndex(Indexer.java:115) ~[jena-tdb2-3.15.0.jar:3.15.0]
	at org.apache.jena.tdb2.loader.main.Indexer.lambda$startBulk$0(Indexer.java:92) ~[jena-tdb2-3.15.0.jar:3.15.0]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
14:58:42 ERROR Indexer    :: Interrupted

second attempt to load with tdb2.tdbloader

changed java version, rebooted machine and cleaned target disk to be empty with 4 Terrabyte available space

java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.232-b09, mixed mode)
tail -f tdb2--err.log 
...
12:29:24 INFO  loader               ::   Elapsed: 159,53 seconds [2020/05/17 12:29:24 MESZ]
12:29:29 INFO  loader               :: Add: 15.500.000 latest-all.ttl (Batch: 89.477 / Avg: 93.873
...
12:48:05 INFO  loader               :: Add: 125.000.000 latest-all.ttl (Batch: 100.563 / Avg: 97.596)
12:48:05 INFO  loader               ::   Elapsed: 1.280,78 seconds [2020/05/17 12:48:05 MESZ]
...
16:34:18 INFO  loader               :: Add: 870.000.000 latest-all.ttl (Batch: 20.284 / Avg: 58.569)
16:34:18 INFO  loader               ::   Elapsed: 14.854,12 seconds [2020/05/17 16:34:18 MESZ]
...
22:06:25 INFO  loader               ::   Elapsed: 34.780,33 seconds [2020/05/17 22:06:25 MESZ]
22:07:58 INFO  loader               :: Add: 1.085.500.000 latest-all.ttl (Batch: 5.340 / Avg: 31.126)
...
07:30:38 INFO  loader               :: Add: 1.515.000.000 latest-all.ttl (Batch: 9.037 / Avg: 22.073)
07:30:38 INFO  loader               ::   Elapsed: 68.633,35 seconds [2020/05/18 07:30:38 MESZ]
...
12:24:35 INFO  loader               :: Add: 1.655.000.000 latest-all.ttl (Batch: 2.463 / Avg: 19.183)
12:24:35 INFO  loader               ::   Elapsed: 86.270,45 seconds [2020/05/18 12:24:35 MESZ]
shortly after the Java VM crashed with the same symptoms as in the first crashing attempt
Memory and CPU usage
top -o cpu
PID   COMMAND      %CPU  TIME     #TH    #WQ  #PORT MEM    PURG   CMPRS  PGRP PPID STATE
796   java         333.6 97:30.06 50/1   1    138   8466M  0B     0B     787  791  running
...
    COMMAND      %CPU  TIME     #TH    #WQ  #PORT MEM    PURG   CMPRS  PGRP  PPID STATE
796    java         67.0  11:05:54 50/1   1    138   4990M+ 0B     5959M- 787   791  running

first attempt to load with tdb2.tdbloader

at 1 Billion triples the Java VM crashed

20:20:54 INFO  loader               :: Add: 30.000.000 latest-all.ttl (Batch: 48.118 / Avg: 40.320)
20:20:54 INFO  loader               ::   Elapsed: 744,05 seconds [2020/05/15 20:20:54 MESZ]
...
20:39:38 INFO  loader               :: Add: 80.000.000 latest-all.ttl (Batch: 49.578 / Avg: 42.829)
20:39:38 INFO  loader               ::   Elapsed: 1.867,87 seconds [2020/05/15 20:39:38 MESZ]
...
06:57:27 INFO  loader               ::   Elapsed: 38.936,91 seconds [2020/05/16 06:57:27 MESZ]
06:58:19 INFO  loader               :: Add: 880.500.000 latest-all.ttl (Batch: 9.717 / Avg: 22.583
....
12:19:51 INFO  loader               ::   Elapsed: 58.280,51 seconds [2020/05/16 12:19:51 MESZ]
12:22:21 INFO  loader               :: Add: 1.000.500.000 latest-all.ttl (Batch: 3.322 / Avg: 17.122)
12:24:46 INFO  loader               :: Add: 1.001.000.000 latest-all.ttl (Batch: 3.459 / Avg: 17.089)
12:26:51 INFO  loader               :: Add: 1.001.500.000 latest-all.ttl (Batch: 3.993 / Avg: 17.061)
apache-jena-3.14.0/bin/tdb2.tdbloader: line 89:  9626 Abort trap: 6           java $JVM_ARGS $LOGGING -cp "$JENA_CP" tdb2.tdbloader "$@"

java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)

head -50 hs_err_pid9626.log  
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (safepoint.cpp:310), pid=9626, tid=0x0000000000004703
#  guarantee(PageArmed == 0) failed: invariant
#
# JRE version: Java(TM) SE Runtime Environment (8.0_191-b12) (build 1.8.0_191-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.191-b12 mixed mode bsd-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

---------------  T H R E A D  ---------------

Current thread (0x00007f8f6e013000):  VMThread [stack: 0x000070000db5f000,0x000070000dc5f000] [id=18179]

Stack: [0x000070000db5f000,0x000070000dc5f000],  sp=0x000070000dc5e910,  free space=1022k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.dylib+0x5b55da]
V  [libjvm.dylib+0x1e8ec9]
V  [libjvm.dylib+0x4e5d46]
V  [libjvm.dylib+0x5bbc2f]
V  [libjvm.dylib+0x5bb55d]
V  [libjvm.dylib+0x48e0be]
C  [libsystem_pthread.dylib+0x3661]  _pthread_body+0x154
C  [libsystem_pthread.dylib+0x350d]  _pthread_body+0x0
C  [libsystem_pthread.dylib+0x2bf9]  thread_start+0xd

VM_Operation (0x000070000ef973e8): ParallelGCFailedAllocation, mode: safepoint, requested by thread 0x00007f8f6e991000

May be related to https://bugs.openjdk.java.net/browse/JDK-8164292 which was never reproducable and therefore not fixed ...

attempt to load with TDB 1 tdbloader2

INFO    Elapsed: 15,97 seconds [2020/05/11 18:45:29 MESZ]
INFO  Add: 1.550.000 Data (Batch: 112.359 / Avg: 94.414)
...
INFO    Elapsed: 505,57 seconds [2020/05/11 18:53:39 MESZ]
INFO  Add: 54.550.000 Data (Batch: 111.607 / Avg: 107.803)
...
INFO    Elapsed: 5.371,68 seconds [2020/05/11 20:14:45 MESZ]
INFO  Add: 665.050.000 Data (Batch: 83.333 / Avg: 123.792)
...
INFO    Elapsed: 44.571,30 seconds [2020/05/12 07:08:05 MESZ]
INFO  Add: 5.439.050.000 Data (Batch: 163.398 / Avg: 122.029)
...
INFO    Elapsed: 50.578,97 seconds [2020/05/12 08:48:12 MESZ]
INFO  Add: 6.189.550.000 Data (Batch: 112.612 / Avg: 122.372)
...
INFO    Elapsed: 61.268,32 seconds [2020/05/12 11:46:22 MESZ]
INFO  Add: 7.470.050.000 Data (Batch: 138.121 / Avg: 121.922)
...
INFO    Elapsed: 73.785,44 seconds [2020/05/12 15:14:59 MESZ]
INFO  Add: 8.729.050.000 Data (Batch: 63.532 / Avg: 118.301)
...
INFO    Elapsed: 89.876,01 seconds [2020/05/12 19:43:09 MESZ]
INFO  Add: 9.409.050.000 Data (Batch: 33.222 / Avg: 104.687)
...
INFO    Elapsed: 97.888,94 seconds [2020/05/12 21:56:42 MESZ]
INFO  Add: 9.584.550.000 Data (Batch: 17.692 / Avg: 97.909)
...
INFO    Elapsed: 130.683,99 seconds [2020/05/13 07:03:17 MESZ]
INFO  Add: 9.947.050.000 Data (Batch: 4.505 / Avg: 76.108)
...
INFO  Add: 9.999.950.000 Data (Batch: 10.273 / Avg: 72.642)
INFO  Add: 10.000.000.000 Data (Batch: 11.088 / Avg: 72.639)
INFO    Elapsed: 137.665,23 seconds [2020/05/13 08:59:38 MESZ]
INFO  Add: 10.000.050.000 Data (Batch: 10.397 / Avg: 72.637)
...
INFO    Elapsed: 158.494,22 seconds [2020/05/13 14:46:47 MESZ]
INFO  Add: 10.149.550.000 Data (Batch: 5.196 / Avg: 64.033)
...
INFO    Elapsed: 187.192,20 seconds [2020/05/13 22:45:05 MESZ]
INFO  Add: 10.309.050.000 Data (Batch: 6.067 / Avg: 55.069)
...
INFO    Elapsed: 215.463,73 seconds [2020/05/14 06:36:17 MESZ]
INFO  Add: 10.490.050.000 Data (Batch: 5.209 / Avg: 48.683)
...
INFO    Elapsed: 241.567,91 seconds [2020/05/14 13:51:21 MESZ]
INFO  Add: 10.625.050.000 Data (Batch: 6.687 / Avg: 43.982)
...
INFO    Elapsed: 265.019,81 seconds [2020/05/14 20:22:13 MESZ]
INFO  Add: 10.772.050.000 Data (Batch: 5.812 / Avg: 40.644)
...
INFO    Elapsed: 309.162,91 seconds [2020/05/15 08:37:56 MESZ]
INFO  Add: 11.000.000.000 Data (Batch: 3.858 / Avg: 35.565)
...
INFO    Elapsed: 349.881,19 seconds [2020/05/15 19:56:34 MESZ]
INFO  Add: 11.175.550.000 Data (Batch: 4.792 / Avg: 31.940)

Output of a failed attempt with insufficient disk space

INFO    Elapsed: 352,44 seconds [2020/05/10 11:03:30 MESZ]
INFO  Add: 37.050.000 Data (Batch: 107.991 / Avg: 104.985)
...
INFO    Elapsed: 4.464,92 seconds [2020/05/10 12:12:03 MESZ]
INFO  Add: 545.550.000 Data (Batch: 120.481 / Avg: 122.174)
...
INFO    Elapsed: 8.611,05 seconds [2020/05/10 13:21:09 MESZ]
INFO  Add: 1.026.050.000 Data (Batch: 128.205 / Avg: 119.149)
...
INFO    Elapsed: 30.653,35 seconds [2020/05/10 19:28:31 MESZ]
INFO  Add: 3.430.050.000 Data (Batch: 105.042 / Avg: 111.896)
...
INFO    Elapsed: 70.746,18 seconds [2020/05/11 06:36:44 MESZ]
INFO  Add: 6.149.050.000 Data (Batch: 49.358 / Avg: 86.915)
...
INFO    Elapsed: 90.976,65 seconds [2020/05/11 12:13:54 MESZ]
INFO  Add: 7.674.050.000 Data (Batch: 12.124 / Avg: 84.348)
...
INFO    Elapsed: 96.770,91 seconds [2020/05/11 13:50:29 MESZ]
INFO  Add: 7.979.050.000 Data (Batch: 51.334 / Avg: 82.452)
org.apache.jena.atlas.AtlasException: java.io.IOException: No space left on device

so the first attempt failed after some 80% of the data was loaded in phase "data" with the hard disk being full. Unfortunately only 1.2 TB of the 2 TB of the disk had been available.

Manual download

wget -c http://mirror.easyname.ch/apache/jena/binaries/apache-jena-3.14.0.tar.gz
--2020-05-10 10:26:38--  http://mirror.easyname.ch/apache/jena/binaries/apache-jena-3.14.0.tar.gz
Resolving mirror.easyname.ch (mirror.easyname.ch)... 77.244.244.134
Connecting to mirror.easyname.ch (mirror.easyname.ch)|77.244.244.134|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20174193 (19M) [application/x-gzip]
Saving to: ‘apache-jena-3.14.0.tar.gz’

apache-jena-3.14.0. 100%[===================>]  19.24M  2.58MB/s    in 7.4s    

2020-05-10 10:26:45 (2.58 MB/s) - ‘apache-jena-3.14.0.tar.gz’ saved [20174193/20174193]
tar -xvzf apache-jena-3.14.0.tar.gz

wikidata2jena Script

tdb2.tdbloader version

#!/bin/bash
# WF 2020-05-10

# global settings
jena=apache-jena-3.15.0
tgz=$jena.tar.gz
jenaurl=http://mirror.easyname.ch/apache/jena/binaries/$tgz
base=/Volumes/owei/wikidata
data=$base/data
tdbloader=$jena/bin/tdb2.tdbloader

getjena() {
# download
if [ ! -f $tgz ]
then
  echo "downloading $tgz from $jenaurl"
	wget $jenaurl
else
  echo "$tgz already downloaded"
fi
# unpack
if [ ! -d $jena ]
then
  echo "unpacking $jena from $tgz"
	tar xvzf $tgz
else
  echo "$jena already unpacked"
fi
# create data directory
if [ ! -d $data ]
then
  echo "creating $data directory"
  mkdir -p $data
else
  echo "$data directory already created"
fi
}

#
# show the given timestamp
#
timestamp() {
 local msg="$1"
 local ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
 echo "$msg at $ts"
}

#
# load data for the given data dir and input
#
loaddata() {
	local data="$1"
	local input="$2"
  timestamp "start loading $input to $data"
  $tdbloader --loader=parallel --loc "$data" "$input" > tdb2-$phase-out.log 2> tdb2-$phase-err.log
	timestamp "finished loading $input to $data"
}

getjena
export TMPDIR=$base/tmp
if [ ! -d $TMPDIR ]
then
  echo "creating temporary directory $TMPDIR"
  mkdir $TMPDIR
else
  echo "using temporary directory $TMPDIR"
fi
loaddata $data latest-all.nt

tdbloader version (deprecated)

#!/bin/bash
# WF 2020-05-10

# global settings
jena=apache-jena-3.14.0
tgz=$jena.tar.gz
jenaurl=http://mirror.easyname.ch/apache/jena/binaries/$tgz
data=data
tdbloader=$jena/bin/tdbloader2

getjena() {
# download
if [ ! -f $tgz ]
then
  echo "downloading $tgz from $jenaurl"
	wget $jenaurl
else
  echo "$tgz already downloaded"
fi
# unpack
if [ ! -d $jena ]
then
  echo "unpacking $jena from $tgz"
	tar xvzf $tgz
else
  echo "$jena already unpacked"
fi
# create data directory
if [ ! -d $data ]
then
  echo "creating $data directory"
  mkdir $data
else
  echo "$data directory already created"
fi
}

#
# show the given timestamp
#
timestamp() {
 local msg="$1"
 local ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
 echo "$msg at $ts"
}

#
# load data for the given phase
#
loaddata4phase() {
  local phase="$1"
	local data="$2"
	local input="$3"
	timestamp "started load phase $phase"
  $tdbloader --phase $phase --loc "$data" "$input" > tdb-$phase-out.log 2> tdb-$phase-err.log
	timestamp "finished load phase $phase"
}

#
# load data for the given data dir and input 
#
loaddata() {
	local data="$1"
	local input="$2"
	loaddata4phase data "$data" "$input"
	loaddata4phase index "$data" ""
}

getjena
wd=$(pwd)
export TMPDIR=$wd/tmp
if [ ! -d $TMPDIR ]
then
  echo "creating temporary directory $TMPDIR"
  mkdir $TMPDIR
else
  echo "using temporary directory $TMPDIR"
fi
loaddata $data latest-all.ttl

Test with wikidata-query-rdf (blazegraph)

See

Clone the project

git clone  --recurse-submodules https://gerrit.wikimedia.org/r/wikidata/query/rdf wikidata-query-rdf
Cloning into 'wikidata-query-rdf'...
remote: Counting objects: 22, done
remote: Total 16238 (delta 0), reused 16238 (delta 0)
Receiving objects: 100% (16238/16238), 2.75 MiB | 2.31 MiB/s, done.
Resolving deltas: 100% (8067/8067), done.

Build it

cd wikidata-query-rdf
$ mvn package
...
[INFO] Results:
[INFO] 
[WARNING] Tests run: 181, Failures: 0, Errors: 0, Skipped: 1
[INFO] 
...
[INFO] Reactor Summary for Wikidata Query Service 0.3.36-SNAPSHOT:
[INFO] 
[INFO] Wikidata Query Service ............................. SUCCESS [ 13.905 s]
[INFO] Shared code ........................................ SUCCESS [ 16.985 s]
[INFO] Wikidata Query RDF Testing Tools ................... SUCCESS [ 10.553 s]
[INFO] Blazegraph extension to improve performance for Wikibase SUCCESS [01:37 min]
[INFO] Blazegraph Service Package ......................... SUCCESS [01:01 min]
[INFO] Wikidata Query RDF Tools ........................... SUCCESS [ 48.598 s]
[INFO] Wikidata Query Service Streaming Updater ........... FAILURE [02:14 min]
[INFO] rdf-spark-tools .................................... SKIPPED
[INFO] Wikibase RDF Query Service ......................... SKIPPED

dyld: lazy symbol binding failed: Symbol not found: ____chkstk_darwin

[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  06:26 min
[INFO] Finished at: 2020-06-09T06:52:22+02:00
[INFO] -----------------------------------------------------------------------

The tests seem to fail due to a binary incompatibility with Mac OS 10.13.6

Build without tests

mvn package -D skipTests
[INFO] Reactor Summary for Wikidata Query Service 0.3.36-SNAPSHOT:
[INFO] 
[INFO] Wikidata Query Service ............................. SUCCESS [  2.340 s]
[INFO] Shared code ........................................ SUCCESS [  1.545 s]
[INFO] Wikidata Query RDF Testing Tools ................... SUCCESS [  0.412 s]
[INFO] Blazegraph extension to improve performance for Wikibase SUCCESS [  5.488 s]
[INFO] Blazegraph Service Package ......................... SUCCESS [  2.184 s]
[INFO] Wikidata Query RDF Tools ........................... SUCCESS [ 14.088 s]
[INFO] Wikidata Query Service Streaming Updater ........... SUCCESS [ 32.272 s]
[INFO] rdf-spark-tools .................................... SUCCESS [02:55 min]
[INFO] Wikibase RDF Query Service ......................... SUCCESS [ 21.837 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  04:16 min
[INFO] Finished at: 2020-06-09T07:14:45+02:00
[INFO] ------------------------------------------------------------------------

Links

Performance Reports

# Date Source Target Triples days Link
2017-12 latest-truthy.nt.gz Apache Jena  ? 8 hours Andy Seaborne Apache Jena Mailinglist
2018-01 wikidata-20180101-all-BETA.ttl Blazegraph 3 billion 4 days Wolfgang Fahl - BITPlan wiki
2019-02 latest-all.ttl.gz Apache Jena  ? > 2 days corsin - muncca blog
2019-05 wikidata-20190513-all-BETA.ttl Blazegraph  ? 10.2 days Adam Sanchez WikiData mailing list
2019-05 wikidata-20190513-all-BETA.ttl Virtuoso  ? 43 hours Adam Sanchez WikiData mailing list
2020-06 latest-all.ttl (2020-04-28) Apache Jena 12.9 billion 6 d 16 h Jonas Sourlier - Jena Issue 1909

Open Questions