The resources behind https://query.wikidata.org/ are scarce and used by a lot of people. You might hit the https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_limits quite quickly.
See SPARQL for some examples that work online (mostly) without hitting these limits.
# Date | Source | Target | Triples | days | RAM GB | CPU Cores | Speed | Link |
---|---|---|---|---|---|---|---|---|
2017-12 | latest-truthy.nt.gz | Apache Jena | ? | 8 hours | ? | Andy Seaborne Apache Jena Mailinglist | ||
2018-01 | wikidata-20180101-all-BETA.ttl | Blazegraph | 3 billion | 4 days | 32 | 4 | 2.2 GHz | Wolfgang Fahl - BITPlan wiki |
2019-02 | latest-all.ttl.gz | Apache Jena | ? | > 2 days | ? | corsin - muncca blog | ||
2019-05 | wikidata-20190513-all-BETA.ttl | Blazegraph | ? | 10.2 days | Adam Sanchez WikiData mailing list | |||
2019-05 | wikidata-20190513-all-BETA.ttl | Virtuoso | ? | 43 hours | ? | - | ||
2019-09 | latest-all.ttl (2019-09) | Virtuoso | 9.5 billion | 9.1 hours | ? | Adam Sanchez - WikiData mailing list | ||
2019-10 | blazegraph | ~10 billion | 5.5 d | 104 | 16 | Adam Shoreland Wikimedia Foundation | ||
2020-03 | latest-all.nt.bz2 (2020-03-01 | Virtuoso | ~11.8 billion | 10 hours + 1day prep | 248 | Hugh Williams - Virtuoso | ||
2020-06 | latest-all.ttl (2020-04-28) | Apache Jena | 12.9 billion | 6 d 16 h | ? | Jonas Sourlier - Jena Issue 1909 | ||
2020-07 | latest-truthy.nt (2020-07-15) | Apache Jena | 5.2 billion | 4 d 14 h | 64 | Wolfgang Fahl BITPlan Wiki | ||
2020-08 | latest-all.nt (2020-08-15) | Apache Jena | 13.8 billion | 9 d 21 h | 64 | Wolfgang Fahl BITPlan Wiki | ||
2022-02 | latest-all.nt (2022-01-29) | QLever | 16.9 billion | 4 d 2 h | 127 | 8 | 1.8 GHz | Wolfgang Fahl - BITPlan Wiki |
2022-02 | latest-all.nt (2022-02) | stardog | 16.7 billion | 9h | Evren Sirin - stardog | |||
2022-05 | latest-all.ttl.bz2 (2022-05-29) | QLever | ~17 billion | 14h | 128 | 12/24 | 4.8 GHz boost | Hannah Bast - QLever |
2022-06 | latest-all.nt (2022-06-25) | QLever | 17.2 billion | 1 d 2 h | 128 | 8 | 1.8 GHz | Wolfgang Fahl - BITPlan Wiki |
2022-07 | latest-all.ttl (2022-07-12) | stardog | 17.2 billion | 1 d 19 h | 253 | Tim Holzheim - BITPlan Wiki |
Getting a copy of WikiData is not for the faint of heart.
You need quite a bit of patience and some hardware resources to get your own WikiData copy working. The resources you need are a moving target since WikiData is growing all the time. On this page you'll see the documentation for two attempts from
the successful 2018 attempt was done with a cheap 50 EUR used server from 2009. The server was on sale in Mönchengladbach via ebay. The server originally had 32 GByte of RAM and we increased the amount to 64 GByte by buying a second one and adding the RAM. In 2018 a 512 GByte SSD was sufficient to speed up the import process from some 14 days to 3.8 days. Specs of the server:
ASUS KFSN5-D/IST mainboard Brand Quad-Core AMD Opteron(tm) Processor 2374 H Speed 2.20GHz NB SPEED 2.00GHz
The disadvantage of the server is that running it 24h / 365 days is more costly than the server itself. It has a power consumption of some 3 kWh per day which would cost more than 300 EUR per year to run. We decided to only switch it on when needed.
The second attempt started in 2020-05 - as of 2020-07-30 it was successful with the fifth try see WikiData Import 2020-07-15 See below for the tries including the ones that failed. In August we started documenting the trials via Category:WikiData You might find the latest ones there.
The resources behind https://query.wikidata.org/ are scarce and used by a lot of people. You might hit the https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_limits quite quickly.
See SPARQL for some examples that work online (mostly) without hitting these limits.
# Date | Source | Target | Triples | days | RAM GB | CPU Cores | Speed | Link |
---|---|---|---|---|---|---|---|---|
2017-12 | latest-truthy.nt.gz | Apache Jena | ? | 8 hours | ? | Andy Seaborne Apache Jena Mailinglist | ||
2018-01 | wikidata-20180101-all-BETA.ttl | Blazegraph | 3 billion | 4 days | 32 | 4 | 2.2 GHz | Wolfgang Fahl - BITPlan wiki |
2019-02 | latest-all.ttl.gz | Apache Jena | ? | > 2 days | ? | corsin - muncca blog | ||
2019-05 | wikidata-20190513-all-BETA.ttl | Blazegraph | ? | 10.2 days | Adam Sanchez WikiData mailing list | |||
2019-05 | wikidata-20190513-all-BETA.ttl | Virtuoso | ? | 43 hours | ? | - | ||
2019-09 | latest-all.ttl (2019-09) | Virtuoso | 9.5 billion | 9.1 hours | ? | Adam Sanchez - WikiData mailing list | ||
2019-10 | blazegraph | ~10 billion | 5.5 d | 104 | 16 | Adam Shoreland Wikimedia Foundation | ||
2020-03 | latest-all.nt.bz2 (2020-03-01 | Virtuoso | ~11.8 billion | 10 hours + 1day prep | 248 | Hugh Williams - Virtuoso | ||
2020-06 | latest-all.ttl (2020-04-28) | Apache Jena | 12.9 billion | 6 d 16 h | ? | Jonas Sourlier - Jena Issue 1909 | ||
2020-07 | latest-truthy.nt (2020-07-15) | Apache Jena | 5.2 billion | 4 d 14 h | 64 | Wolfgang Fahl BITPlan Wiki | ||
2020-08 | latest-all.nt (2020-08-15) | Apache Jena | 13.8 billion | 9 d 21 h | 64 | Wolfgang Fahl BITPlan Wiki | ||
2022-02 | latest-all.nt (2022-01-29) | QLever | 16.9 billion | 4 d 2 h | 127 | 8 | 1.8 GHz | Wolfgang Fahl - BITPlan Wiki |
2022-02 | latest-all.nt (2022-02) | stardog | 16.7 billion | 9h | Evren Sirin - stardog | |||
2022-05 | latest-all.ttl.bz2 (2022-05-29) | QLever | ~17 billion | 14h | 128 | 12/24 | 4.8 GHz boost | Hannah Bast - QLever |
2022-06 | latest-all.nt (2022-06-25) | QLever | 17.2 billion | 1 d 2 h | 128 | 8 | 1.8 GHz | Wolfgang Fahl - BITPlan Wiki |
2022-07 | latest-all.ttl (2022-07-12) | stardog | 17.2 billion | 1 d 19 h | 253 | Tim Holzheim - BITPlan Wiki |
Getting a copy of WikiData is not for the faint of heart.
You need quite a bit of patience and some hardware resources to get your own WikiData copy working. The resources you need are a moving target since WikiData is growing all the time. On this page you'll see the documentation for two attempts from
the successful 2018 attempt was done with a cheap 50 EUR used server from 2009. The server was on sale in Mönchengladbach via ebay. The server originally had 32 GByte of RAM and we increased the amount to 64 GByte by buying a second one and adding the RAM. In 2018 a 512 GByte SSD was sufficient to speed up the import process from some 14 days to 3.8 days. Specs of the server:
ASUS KFSN5-D/IST mainboard Brand Quad-Core AMD Opteron(tm) Processor 2374 H Speed 2.20GHz NB SPEED 2.00GHz
The disadvantage of the server is that running it 24h / 365 days is more costly than the server itself. It has a power consumption of some 3 kWh per day which would cost more than 300 EUR per year to run. We decided to only switch it on when needed.
The second attempt started in 2020-05 - as of 2020-07-30 it was successful with the fifth try see WikiData Import 2020-07-15 See below for the tries including the ones that failed. In August we started documenting the trials via Category:WikiData You might find the latest ones there.
java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.232-b09, mixed mode)
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2
java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.232-b09, mixed mode)
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2
Sizes: latest-all.json.bz2 03-Jun-2020 07:19 57017252630
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2
--2020-06-09 06:40:09-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57017252630 (53G) [application/octet-stream]
Saving to: ‘latest-all.json.bz2’
latest-all.json.bz2 1%[ ] 677.93M 4.84MB/s eta 3h 5m
Sizes:
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
--2020-05-09 17:18:53-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 71897810492 (67G) [application/octet-stream]
Saving to: ‘latest-all.ttl.bz2’
latest-all.ttl.bz2 0%[ ] 147.79M 4.82MB/s eta 3h 56m
...
latest-all.ttl.bz2 100%[===================>] 66.96G 4.99MB/s in 4h 0m
2020-05-09 21:19:25 (4.75 MB/s) - ‘latest-all.ttl.bz2’ saved [71897810492/71897810492]
bzip2 -dk latest-all.ttl.bz2
ls -l
-rw-r--r-- 1 wf admin 592585505631 May 7 08:00 latest-all.ttl
Simply counting the 15.728.395.994 lines of latest-all.ttl the turtle file which should roughly give the number of triples in that file takes around one hour in the test environment.
Sun May 10 07:13:45 CEST 2020
15728395994 latest-all.ttl
Sun May 10 08:12:50 CEST 2020
see https://muncca.com/2019/02/14/wikidata-import-in-apache-jena/ After doing the manual download i decided to create the wikidata2jena Script below. With that script the command
nohup ./wikidata2jena&
Will start the processing of the latest-all.ttl file in background. You might want to make sure that your computer does not go to sleep while the script wants to run. I am using Amphetamine macOS App for this. With
tail nohup.out
apache-jena-3.14.0.tar.gz already downloaded
apache-jena-3.14.0 already unpacked
creating data directory
creating temporary directory /Volumes/Tattu/wikidata/tmp
started load phase data at 2020-05-11T16:45:11Z
You can watch the progress of the phases which i assume will take some 2 days to finish for the data phase.
To see more progress details you might want to call:
tail -f tdb-data-err.log
With the hint from https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits I modified the script to used tdb2.tdbloader. tdbloader2 and started a second import. Since for the time being the disk and memory are shared by the import processes the performance might be lower than possible with my hardware.
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2
--2020-07-15 15:24:25-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25868964531 (24G) [application/octet-stream]
Saving to: ‘latest-truthy.nt.bz2’
latest-truthy.nt.bz 2%[ ] 546.01M 4.66MB/s eta 85m 57s
ls -l latest-truthy.nt.bz2
-rw-r--r-- 1 wf admin 25868964531 Jul 11 23:33 latest-truthy.nt.bz2
bzip2 -dk latest-truthy.nt.bz2
ls -l latest-truthy.nt
-rw------- 1 wf admin 671598919680 Jul 15 21:15 latest-truthy.nt
zeus:wikidata wf$ls -l latest-truthy.nt
-rw-r--r-- 1 wf admin 671749317281 Jul 11 23:33 latest-truthy.nt
nohup ./wikidata2jena&
tail -f tdb2-err.log
21:16:57 INFO loader :: Loader = LoaderParallel
21:16:57 INFO loader :: Start: latest-truthy.nt
21:17:00 INFO loader :: Add: 500.000 latest-truthy.nt (Batch: 151.883 / Avg: 151.883)
21:17:02 INFO loader :: Add: 1.000.000 latest-truthy.nt (Batch: 243.309 / Avg: 187.020)
...
21:33:21 INFO loader :: Add: 100.000.000 latest-truthy.nt (Batch: 209.292 / Avg: 101.592)
21:33:21 INFO loader :: Elapsed: 984,33 seconds [2020/07/15 21:33:21 MESZ]
...
22:41:36 INFO loader :: Add: 500.000.000 latest-truthy.nt (Batch: 54.153 / Avg: 98.446)
22:41:36 INFO loader :: Elapsed: 5.078,89 seconds [2020/07/15 22:41:36 MESZ]
...
02:55:36 INFO loader :: Add: 1.000.000.000 latest-truthy.nt (Batch: 21.504 / Avg: 49.215)
02:55:36 INFO loader :: Elapsed: 20.318,94 seconds [2020/07/16 02:55:36 MESZ]
...
13:47:17 INFO loader :: Add: 2.000.000.000 latest-truthy.nt (Batch: 32.036 / Avg: 33.658)
13:47:17 INFO loader :: Elapsed: 59.420,03 seconds [2020/07/16 13:47:17 MESZ]
...
06:10:12 INFO loader :: Add: 3.000.000.000 latest-truthy.nt (Batch: 10.900 / Avg: 25.338)
06:10:13 INFO loader :: Elapsed: 118.395,31 seconds [2020/07/17 06:10:12 MESZ]
...
09:33:14 INFO loader :: Add: 4.000.000.000 latest-truthy.nt (Batch: 11.790 / Avg: 18.435)
09:33:14 INFO loader :: Elapsed: 216.976,77 seconds [2020/07/18 09:33:14 MESZ]
...
00:55:21 INFO loader :: Add: 5.000.000.000 latest-truthy.nt (Batch: 4.551 / Avg: 13.939)
00:55:21 INFO loader :: Elapsed: 358.703,75 seconds [2020/07/20 00:55:21 MESZ]
...
11:02:06 INFO loader :: Add: 5.253.500.000 latest-truthy.nt (Batch: 10.555 / Avg: 13.296)
11:02:38 INFO loader :: Finished: latest-truthy.nt: 5.253.753.313 tuples in 395140,91s (Avg: 13.295)
11:05:27 INFO loader :: Finish - index SPO
11:08:24 INFO loader :: Finish - index POS
11:08:59 INFO loader :: Finish - index OSP
11:08:59 INFO loader :: Time = 395.522,378 seconds : Triples = 5.253.753.313 : Rate = 13.283 /s
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2
--2020-06-12 17:19:43-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25480099958 (24G) [application/octet-stream]
Saving to: ‘latest-truthy.nt.bz2’
latest-truthy.nt.bz 2%[ ] 691.48M 4.80MB/s eta 87m 52s
latest-truthy.nt.bz2 100%[===========================================>] 23.73G 4.63MB/s in 89m 3s
2020-06-12 19:33:52 (4.55 MB/s) - ‘latest-truthy.nt.bz2’ saved [25480099958/25480099958]
bzip2 -dk latest-truthy.nt.bz2
ls -l latest-truthy.nt
-rw-r--r-- 1 wf admin 661598113609 Jun 6 20:11 latest-truthy.nt
sudo port install openjdk11
java -version
openjdk version "11.0.5" 2019-10-15
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.5+10)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.5+10, mixed mode)
This import was unsuccessful due to its long runtime. After 3 weeks only 25% of the data had been imported with a performance degradation from 150 K triples per second down to 0.5 K Triples per second (factor 300). The import would have taken many months to finish this way.
nohup ./wikidata2jena&
[1] 18313
appending output to nohup.out
tail -f tdb2-out.log
08:11:53 INFO loader :: Loader = LoaderParallel
08:11:53 INFO loader :: Start: latest-truthy.nt
08:11:57 INFO loader :: Add: 500.000 latest-truthy.nt (Batch: 156.887 / Avg: 156.887)
08:11:59 INFO loader :: Add: 1.000.000 latest-truthy.nt (Batch: 247.035 / Avg: 191.901)
08:12:01 INFO loader :: Add: 1.500.000 latest-truthy.nt (Batch: 177.367 / Avg: 186.799)
08:12:09 INFO loader :: Add: 2.000.000 latest-truthy.nt (Batch: 68.212 / Avg: 130.208)
...
09:49:07 INFO loader :: Add: 500.000.000 latest-truthy.nt (Batch: 54.259 / Avg: 85.713)
09:49:07 INFO loader :: Elapsed: 5.833,40 seconds [2020/06/13 09:49:07 MESZ]
...
00:17:13 INFO loader :: Add: 1.000.000.000 latest-truthy.nt (Batch: 4.020 / Avg: 17.265)
00:17:13 INFO loader :: Elapsed: 57.919,50 seconds [2020/06/14 00:17:13 MESZ]
...
10:37:44 INFO loader :: Add: 1.500.000.000 latest-truthy.nt (Batch: 8.740 / Avg: 8.262)
10:37:44 INFO loader :: Elapsed: 181.550,47 seconds [2020/06/15 10:37:44 MESZ]
...
08:08:47 INFO loader :: Add: 2.000.000.000 latest-truthy.nt (Batch: 1.604 / Avg: 3.859)
08:08:48 INFO loader :: Elapsed: 518.214,03 seconds [2020/06/19 08:08:48 MESZ]
...
12:36:01 INFO loader :: Add: 2.250.000.000 latest-truthy.nt (Batch: 683 / Avg: 2.835)
12:36:01 INFO loader :: Elapsed: 793.447,44 seconds [2020/06/22 12:36:01 MESZ]
...
03:56:21 INFO loader :: Add: 2.500.000.000 latest-truthy.nt (Batch: 1.030 / Avg: 2.093)
03:56:21 INFO loader :: Elapsed: 1.194.267,75 seconds [2020/06/27 03:56:21 MESZ]
...
05:15:29 INFO loader :: Add: 2.830.000.000 latest-truthy.nt (Batch: 390 / Avg: 1.832)
05:15:29 INFO loader :: Elapsed: 1.544.615,50 seconds [2020/07/01 05:15:29 MESZ]
...
04:12:50 INFO loader :: Elapsed: 1.972.857,00 seconds [2020/07/06 04:12:50 MESZ]
04:37:35 INFO loader :: Add: 2.990.500.000 latest-truthy.nt (Batch: 336 / Avg: 1.514)
java -version
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
tail -f tdb2--out.log
...
20:40:38 INFO loader :: Add: 10,000,000 latest-all.ttl (Batch: 68,813 / Avg: 68,559)
20:40:38 INFO loader :: Elapsed: 145.86 seconds [2020/05/19 20:40:38 CEST]
...
20:49:43 INFO loader :: Add: 50,000,000 latest-all.ttl (Batch: 69,803 / Avg: 72,422)
20:49:43 INFO loader :: Elapsed: 690.40 seconds [2020/05/19 20:49:43 CEST]
...
07:27:38 INFO loader :: Add: 555,000,000 latest-all.ttl (Batch: 7,331 / Avg: 14,243)
07:27:38 INFO loader :: Elapsed: 38,965.83 seconds [2020/05/20 07:27:38 CEST]
...
08:37:48 INFO loader :: Add: 805,000,000 latest-all.ttl (Batch: 1,204 / Avg: 6,212)
08:37:48 INFO loader :: Elapsed: 129,575.23 seconds [2020/05/21 08:37:48 CEST]
...
08:57:25 INFO loader :: Add: 900,000,000 latest-all.ttl (Batch: 929 / Avg: 4,144)
08:57:25 INFO loader :: Elapsed: 217,152.75 seconds [2020/05/22 08:57:25 CEST]
...
06:25:08 INFO loader :: Add: 1,090,000,000 latest-all.ttl (Batch: 684 / Avg: 2,332)
06:25:08 INFO loader :: Elapsed: 467,215.09 seconds [2020/05/25 06:25:08 CEST]
...
12:36:02 INFO loader :: Add: 1,730,000,000 latest-all.ttl (Batch: 523 / Avg: 1,465)
12:36:02 INFO loader :: Elapsed: 1,180,669.63 seconds [2020/06/02 12:36:02 CEST]
...
java.lang.IllegalArgumentException: null
at java.nio.Buffer.position(Buffer.java:244) ~[?:1.8.0_201]
at org.apache.jena.dboe.base.record.RecordFactory.lambda$static$0(RecordFactory.java:111) ~[jena-dboe-base-3.15.0.jar:3.15.0]
at org.apache.jena.dboe.base.record.RecordFactory.buildFrom(RecordFactory.java:127) ~[jena-dboe-base-3.15.0.jar:3.15.0]
at org.apache.jena.dboe.base.buffer.RecordBuffer._get(RecordBuffer.java:102) ~[jena-dboe-base-3.15.0.jar:3.15.0]
at org.apache.jena.dboe.base.buffer.RecordBuffer.get(RecordBuffer.java:52) ~[jena-dboe-base-3.15.0.jar:3.15.0]
at org.apache.jena.dboe.trans.bplustree.BPTreeRecords.getSplitKey(BPTreeRecords.java:195) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
at org.apache.jena.dboe.trans.bplustree.BPTreeNode.split(BPTreeNode.java:562) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
at org.apache.jena.dboe.trans.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:509) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
at org.apache.jena.dboe.trans.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:522) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
at org.apache.jena.dboe.trans.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:522) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
at org.apache.jena.dboe.trans.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:522) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
at org.apache.jena.dboe.trans.bplustree.BPTreeNode.insert(BPTreeNode.java:203) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
at org.apache.jena.dboe.trans.bplustree.BPlusTree.insertAndReturnOld(BPlusTree.java:278) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
at org.apache.jena.dboe.trans.bplustree.BPlusTree.insert(BPlusTree.java:271) ~[jena-dboe-trans-data-3.15.0.jar:3.15.0]
at org.apache.jena.tdb2.store.tupletable.TupleIndexRecord.performAdd(TupleIndexRecord.java:94) ~[jena-tdb2-3.15.0.jar:3.15.0]
at org.apache.jena.tdb2.store.tupletable.TupleIndexBase.add(TupleIndexBase.java:66) ~[jena-tdb2-3.15.0.jar:3.15.0]
at org.apache.jena.tdb2.loader.main.Indexer.lambda$loadTuples$1(Indexer.java:133) ~[jena-tdb2-3.15.0.jar:3.15.0]
at org.apache.jena.tdb2.loader.main.Indexer.stageIndex(Indexer.java:115) ~[jena-tdb2-3.15.0.jar:3.15.0]
at org.apache.jena.tdb2.loader.main.Indexer.lambda$startBulk$0(Indexer.java:92) ~[jena-tdb2-3.15.0.jar:3.15.0]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
14:58:42 ERROR Indexer :: Interrupted
changed java version, rebooted machine and cleaned target disk to be empty with 4 Terrabyte available space
java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.232-b09, mixed mode)
tail -f tdb2--err.log
...
12:29:24 INFO loader :: Elapsed: 159,53 seconds [2020/05/17 12:29:24 MESZ]
12:29:29 INFO loader :: Add: 15.500.000 latest-all.ttl (Batch: 89.477 / Avg: 93.873
...
12:48:05 INFO loader :: Add: 125.000.000 latest-all.ttl (Batch: 100.563 / Avg: 97.596)
12:48:05 INFO loader :: Elapsed: 1.280,78 seconds [2020/05/17 12:48:05 MESZ]
...
16:34:18 INFO loader :: Add: 870.000.000 latest-all.ttl (Batch: 20.284 / Avg: 58.569)
16:34:18 INFO loader :: Elapsed: 14.854,12 seconds [2020/05/17 16:34:18 MESZ]
...
22:06:25 INFO loader :: Elapsed: 34.780,33 seconds [2020/05/17 22:06:25 MESZ]
22:07:58 INFO loader :: Add: 1.085.500.000 latest-all.ttl (Batch: 5.340 / Avg: 31.126)
...
07:30:38 INFO loader :: Add: 1.515.000.000 latest-all.ttl (Batch: 9.037 / Avg: 22.073)
07:30:38 INFO loader :: Elapsed: 68.633,35 seconds [2020/05/18 07:30:38 MESZ]
...
12:24:35 INFO loader :: Add: 1.655.000.000 latest-all.ttl (Batch: 2.463 / Avg: 19.183)
12:24:35 INFO loader :: Elapsed: 86.270,45 seconds [2020/05/18 12:24:35 MESZ]
shortly after the Java VM crashed with the same symptoms as in the first crashing attempt
top -o cpu
PID COMMAND %CPU TIME #TH #WQ #PORT MEM PURG CMPRS PGRP PPID STATE
796 java 333.6 97:30.06 50/1 1 138 8466M 0B 0B 787 791 running
...
COMMAND %CPU TIME #TH #WQ #PORT MEM PURG CMPRS PGRP PPID STATE
796 java 67.0 11:05:54 50/1 1 138 4990M+ 0B 5959M- 787 791 running
at 1 Billion triples the Java VM crashed
20:20:54 INFO loader :: Add: 30.000.000 latest-all.ttl (Batch: 48.118 / Avg: 40.320)
20:20:54 INFO loader :: Elapsed: 744,05 seconds [2020/05/15 20:20:54 MESZ]
...
20:39:38 INFO loader :: Add: 80.000.000 latest-all.ttl (Batch: 49.578 / Avg: 42.829)
20:39:38 INFO loader :: Elapsed: 1.867,87 seconds [2020/05/15 20:39:38 MESZ]
...
06:57:27 INFO loader :: Elapsed: 38.936,91 seconds [2020/05/16 06:57:27 MESZ]
06:58:19 INFO loader :: Add: 880.500.000 latest-all.ttl (Batch: 9.717 / Avg: 22.583
....
12:19:51 INFO loader :: Elapsed: 58.280,51 seconds [2020/05/16 12:19:51 MESZ]
12:22:21 INFO loader :: Add: 1.000.500.000 latest-all.ttl (Batch: 3.322 / Avg: 17.122)
12:24:46 INFO loader :: Add: 1.001.000.000 latest-all.ttl (Batch: 3.459 / Avg: 17.089)
12:26:51 INFO loader :: Add: 1.001.500.000 latest-all.ttl (Batch: 3.993 / Avg: 17.061)
apache-jena-3.14.0/bin/tdb2.tdbloader: line 89: 9626 Abort trap: 6 java $JVM_ARGS $LOGGING -cp "$JENA_CP" tdb2.tdbloader "$@"
java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
head -50 hs_err_pid9626.log
#
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (safepoint.cpp:310), pid=9626, tid=0x0000000000004703
# guarantee(PageArmed == 0) failed: invariant
#
# JRE version: Java(TM) SE Runtime Environment (8.0_191-b12) (build 1.8.0_191-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.191-b12 mixed mode bsd-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
--------------- T H R E A D ---------------
Current thread (0x00007f8f6e013000): VMThread [stack: 0x000070000db5f000,0x000070000dc5f000] [id=18179]
Stack: [0x000070000db5f000,0x000070000dc5f000], sp=0x000070000dc5e910, free space=1022k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.dylib+0x5b55da]
V [libjvm.dylib+0x1e8ec9]
V [libjvm.dylib+0x4e5d46]
V [libjvm.dylib+0x5bbc2f]
V [libjvm.dylib+0x5bb55d]
V [libjvm.dylib+0x48e0be]
C [libsystem_pthread.dylib+0x3661] _pthread_body+0x154
C [libsystem_pthread.dylib+0x350d] _pthread_body+0x0
C [libsystem_pthread.dylib+0x2bf9] thread_start+0xd
VM_Operation (0x000070000ef973e8): ParallelGCFailedAllocation, mode: safepoint, requested by thread 0x00007f8f6e991000
May be related to https://bugs.openjdk.java.net/browse/JDK-8164292 which was never reproducable and therefore not fixed ...
INFO Elapsed: 15,97 seconds [2020/05/11 18:45:29 MESZ]
INFO Add: 1.550.000 Data (Batch: 112.359 / Avg: 94.414)
...
INFO Elapsed: 505,57 seconds [2020/05/11 18:53:39 MESZ]
INFO Add: 54.550.000 Data (Batch: 111.607 / Avg: 107.803)
...
INFO Elapsed: 5.371,68 seconds [2020/05/11 20:14:45 MESZ]
INFO Add: 665.050.000 Data (Batch: 83.333 / Avg: 123.792)
...
INFO Elapsed: 44.571,30 seconds [2020/05/12 07:08:05 MESZ]
INFO Add: 5.439.050.000 Data (Batch: 163.398 / Avg: 122.029)
...
INFO Elapsed: 50.578,97 seconds [2020/05/12 08:48:12 MESZ]
INFO Add: 6.189.550.000 Data (Batch: 112.612 / Avg: 122.372)
...
INFO Elapsed: 61.268,32 seconds [2020/05/12 11:46:22 MESZ]
INFO Add: 7.470.050.000 Data (Batch: 138.121 / Avg: 121.922)
...
INFO Elapsed: 73.785,44 seconds [2020/05/12 15:14:59 MESZ]
INFO Add: 8.729.050.000 Data (Batch: 63.532 / Avg: 118.301)
...
INFO Elapsed: 89.876,01 seconds [2020/05/12 19:43:09 MESZ]
INFO Add: 9.409.050.000 Data (Batch: 33.222 / Avg: 104.687)
...
INFO Elapsed: 97.888,94 seconds [2020/05/12 21:56:42 MESZ]
INFO Add: 9.584.550.000 Data (Batch: 17.692 / Avg: 97.909)
...
INFO Elapsed: 130.683,99 seconds [2020/05/13 07:03:17 MESZ]
INFO Add: 9.947.050.000 Data (Batch: 4.505 / Avg: 76.108)
...
INFO Add: 9.999.950.000 Data (Batch: 10.273 / Avg: 72.642)
INFO Add: 10.000.000.000 Data (Batch: 11.088 / Avg: 72.639)
INFO Elapsed: 137.665,23 seconds [2020/05/13 08:59:38 MESZ]
INFO Add: 10.000.050.000 Data (Batch: 10.397 / Avg: 72.637)
...
INFO Elapsed: 158.494,22 seconds [2020/05/13 14:46:47 MESZ]
INFO Add: 10.149.550.000 Data (Batch: 5.196 / Avg: 64.033)
...
INFO Elapsed: 187.192,20 seconds [2020/05/13 22:45:05 MESZ]
INFO Add: 10.309.050.000 Data (Batch: 6.067 / Avg: 55.069)
...
INFO Elapsed: 215.463,73 seconds [2020/05/14 06:36:17 MESZ]
INFO Add: 10.490.050.000 Data (Batch: 5.209 / Avg: 48.683)
...
INFO Elapsed: 241.567,91 seconds [2020/05/14 13:51:21 MESZ]
INFO Add: 10.625.050.000 Data (Batch: 6.687 / Avg: 43.982)
...
INFO Elapsed: 265.019,81 seconds [2020/05/14 20:22:13 MESZ]
INFO Add: 10.772.050.000 Data (Batch: 5.812 / Avg: 40.644)
...
INFO Elapsed: 309.162,91 seconds [2020/05/15 08:37:56 MESZ]
INFO Add: 11.000.000.000 Data (Batch: 3.858 / Avg: 35.565)
...
INFO Elapsed: 349.881,19 seconds [2020/05/15 19:56:34 MESZ]
INFO Add: 11.175.550.000 Data (Batch: 4.792 / Avg: 31.940)
INFO Elapsed: 352,44 seconds [2020/05/10 11:03:30 MESZ]
INFO Add: 37.050.000 Data (Batch: 107.991 / Avg: 104.985)
...
INFO Elapsed: 4.464,92 seconds [2020/05/10 12:12:03 MESZ]
INFO Add: 545.550.000 Data (Batch: 120.481 / Avg: 122.174)
...
INFO Elapsed: 8.611,05 seconds [2020/05/10 13:21:09 MESZ]
INFO Add: 1.026.050.000 Data (Batch: 128.205 / Avg: 119.149)
...
INFO Elapsed: 30.653,35 seconds [2020/05/10 19:28:31 MESZ]
INFO Add: 3.430.050.000 Data (Batch: 105.042 / Avg: 111.896)
...
INFO Elapsed: 70.746,18 seconds [2020/05/11 06:36:44 MESZ]
INFO Add: 6.149.050.000 Data (Batch: 49.358 / Avg: 86.915)
...
INFO Elapsed: 90.976,65 seconds [2020/05/11 12:13:54 MESZ]
INFO Add: 7.674.050.000 Data (Batch: 12.124 / Avg: 84.348)
...
INFO Elapsed: 96.770,91 seconds [2020/05/11 13:50:29 MESZ]
INFO Add: 7.979.050.000 Data (Batch: 51.334 / Avg: 82.452)
org.apache.jena.atlas.AtlasException: java.io.IOException: No space left on device
so the first attempt failed after some 80% of the data was loaded in phase "data" with the hard disk being full. Unfortunately only 1.2 TB of the 2 TB of the disk had been available.
wget -c http://mirror.easyname.ch/apache/jena/binaries/apache-jena-3.14.0.tar.gz
--2020-05-10 10:26:38-- http://mirror.easyname.ch/apache/jena/binaries/apache-jena-3.14.0.tar.gz
Resolving mirror.easyname.ch (mirror.easyname.ch)... 77.244.244.134
Connecting to mirror.easyname.ch (mirror.easyname.ch)|77.244.244.134|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20174193 (19M) [application/x-gzip]
Saving to: ‘apache-jena-3.14.0.tar.gz’
apache-jena-3.14.0. 100%[===================>] 19.24M 2.58MB/s in 7.4s
2020-05-10 10:26:45 (2.58 MB/s) - ‘apache-jena-3.14.0.tar.gz’ saved [20174193/20174193]
tar -xvzf apache-jena-3.14.0.tar.gz
#!/bin/bash
# WF 2020-05-10
# global settings
jena=apache-jena-3.15.0
tgz=$jena.tar.gz
jenaurl=http://mirror.easyname.ch/apache/jena/binaries/$tgz
base=/Volumes/owei/wikidata
data=$base/data
tdbloader=$jena/bin/tdb2.tdbloader
getjena() {
# download
if [ ! -f $tgz ]
then
echo "downloading $tgz from $jenaurl"
wget $jenaurl
else
echo "$tgz already downloaded"
fi
# unpack
if [ ! -d $jena ]
then
echo "unpacking $jena from $tgz"
tar xvzf $tgz
else
echo "$jena already unpacked"
fi
# create data directory
if [ ! -d $data ]
then
echo "creating $data directory"
mkdir -p $data
else
echo "$data directory already created"
fi
}
#
# show the given timestamp
#
timestamp() {
local msg="$1"
local ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
echo "$msg at $ts"
}
#
# load data for the given data dir and input
#
loaddata() {
local data="$1"
local input="$2"
timestamp "start loading $input to $data"
$tdbloader --loader=parallel --loc "$data" "$input" > tdb2-$phase-out.log 2> tdb2-$phase-err.log
timestamp "finished loading $input to $data"
}
getjena
export TMPDIR=$base/tmp
if [ ! -d $TMPDIR ]
then
echo "creating temporary directory $TMPDIR"
mkdir $TMPDIR
else
echo "using temporary directory $TMPDIR"
fi
loaddata $data latest-all.nt
#!/bin/bash
# WF 2020-05-10
# global settings
jena=apache-jena-3.14.0
tgz=$jena.tar.gz
jenaurl=http://mirror.easyname.ch/apache/jena/binaries/$tgz
data=data
tdbloader=$jena/bin/tdbloader2
getjena() {
# download
if [ ! -f $tgz ]
then
echo "downloading $tgz from $jenaurl"
wget $jenaurl
else
echo "$tgz already downloaded"
fi
# unpack
if [ ! -d $jena ]
then
echo "unpacking $jena from $tgz"
tar xvzf $tgz
else
echo "$jena already unpacked"
fi
# create data directory
if [ ! -d $data ]
then
echo "creating $data directory"
mkdir $data
else
echo "$data directory already created"
fi
}
#
# show the given timestamp
#
timestamp() {
local msg="$1"
local ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
echo "$msg at $ts"
}
#
# load data for the given phase
#
loaddata4phase() {
local phase="$1"
local data="$2"
local input="$3"
timestamp "started load phase $phase"
$tdbloader --phase $phase --loc "$data" "$input" > tdb-$phase-out.log 2> tdb-$phase-err.log
timestamp "finished load phase $phase"
}
#
# load data for the given data dir and input
#
loaddata() {
local data="$1"
local input="$2"
loaddata4phase data "$data" "$input"
loaddata4phase index "$data" ""
}
getjena
wd=$(pwd)
export TMPDIR=$wd/tmp
if [ ! -d $TMPDIR ]
then
echo "creating temporary directory $TMPDIR"
mkdir $TMPDIR
else
echo "using temporary directory $TMPDIR"
fi
loaddata $data latest-all.ttl
See
git clone --recurse-submodules https://gerrit.wikimedia.org/r/wikidata/query/rdf wikidata-query-rdf
Cloning into 'wikidata-query-rdf'...
remote: Counting objects: 22, done
remote: Total 16238 (delta 0), reused 16238 (delta 0)
Receiving objects: 100% (16238/16238), 2.75 MiB | 2.31 MiB/s, done.
Resolving deltas: 100% (8067/8067), done.
cd wikidata-query-rdf
$ mvn package
...
[INFO] Results:
[INFO]
[WARNING] Tests run: 181, Failures: 0, Errors: 0, Skipped: 1
[INFO]
...
[INFO] Reactor Summary for Wikidata Query Service 0.3.36-SNAPSHOT:
[INFO]
[INFO] Wikidata Query Service ............................. SUCCESS [ 13.905 s]
[INFO] Shared code ........................................ SUCCESS [ 16.985 s]
[INFO] Wikidata Query RDF Testing Tools ................... SUCCESS [ 10.553 s]
[INFO] Blazegraph extension to improve performance for Wikibase SUCCESS [01:37 min]
[INFO] Blazegraph Service Package ......................... SUCCESS [01:01 min]
[INFO] Wikidata Query RDF Tools ........................... SUCCESS [ 48.598 s]
[INFO] Wikidata Query Service Streaming Updater ........... FAILURE [02:14 min]
[INFO] rdf-spark-tools .................................... SKIPPED
[INFO] Wikibase RDF Query Service ......................... SKIPPED
dyld: lazy symbol binding failed: Symbol not found: ____chkstk_darwin
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 06:26 min
[INFO] Finished at: 2020-06-09T06:52:22+02:00
[INFO] -----------------------------------------------------------------------
The tests seem to fail due to a binary incompatibility with Mac OS 10.13.6
mvn package -D skipTests
[INFO] Reactor Summary for Wikidata Query Service 0.3.36-SNAPSHOT:
[INFO]
[INFO] Wikidata Query Service ............................. SUCCESS [ 2.340 s]
[INFO] Shared code ........................................ SUCCESS [ 1.545 s]
[INFO] Wikidata Query RDF Testing Tools ................... SUCCESS [ 0.412 s]
[INFO] Blazegraph extension to improve performance for Wikibase SUCCESS [ 5.488 s]
[INFO] Blazegraph Service Package ......................... SUCCESS [ 2.184 s]
[INFO] Wikidata Query RDF Tools ........................... SUCCESS [ 14.088 s]
[INFO] Wikidata Query Service Streaming Updater ........... SUCCESS [ 32.272 s]
[INFO] rdf-spark-tools .................................... SUCCESS [02:55 min]
[INFO] Wikibase RDF Query Service ......................... SUCCESS [ 21.837 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:16 min
[INFO] Finished at: 2020-06-09T07:14:45+02:00
[INFO] ------------------------------------------------------------------------