Difference between revisions of "Get your own copy of WikiData"
(11 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | = Why would you want your own | + | = Why would you want your own Wikidata copy? = |
The resources behind https://query.wikidata.org/ are scarce and used by a lot of people. You might hit the https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_limits quite quickly. | The resources behind https://query.wikidata.org/ are scarce and used by a lot of people. You might hit the https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_limits quite quickly. | ||
See {{Link|target=SPARQL}} for some examples that work online (mostly) without hitting these limits. | See {{Link|target=SPARQL}} for some examples that work online (mostly) without hitting these limits. | ||
+ | |||
+ | == What are alternative endpoints ? == | ||
+ | * [https://qlever.cs.uni-freiburg.de/wikidata QLever Uni Freiburg CS] | ||
+ | * [http://qlever.wikidata.dbis.rwth-aachen.de/wikidata QLever RWTH Aachen i5] | ||
+ | * [https://wikidata.demo.openlinksw.com/sparql Virtuoso OpenLink Software] | ||
+ | |||
= Prerequisites = | = Prerequisites = | ||
− | Getting a copy of | + | Getting a copy of Wikidata is not for the faint of heart. |
You need quite a bit of patience and some hardware resources to get your own WikiData copy working. | You need quite a bit of patience and some hardware resources to get your own WikiData copy working. | ||
− | The resources you need are a moving target since | + | The resources you need are a moving target since Wikidata is growing all the time. |
+ | |||
+ | On the other hand solutions such as {{Link|target=https://github.com/ad-freiburg/qlever/wiki/Using-QLever-for-Wikidata|title=QLever by Hannah Bast}} are making progress so that you can run your own copy of Wikidata on commodity hardware with an AMD Ryzen 9 16 core processor, 32 GB of RAM a 2 TTB | ||
+ | SSD and the indexing will take less than 5 h. Together with the download time of some 6 hours that's less than half a day for getting a current copy so a daily update is feasible these days. | ||
+ | |||
+ | = Papers = | ||
+ | [http://ceurspt.wikidata.dbis.rwth-aachen.de/Vol-3262/paper9.html Getting and hosting your own copy of Wikidata Proceedings of the 3rd Wikidata Workshop 2022 co-located with the 21st International Semantic Web Conference ( ISWC2022 ) CEUR-WS Volume 3262] | ||
− | = | + | = Successes = |
− | + | Please contact [https://scholia.toolforge.org/author/Q110462723 Wolfgang Fahl] if you'd like to see your own success report added to the "Reports" table below. | |
− | + | The imports table is generated from the documentation of our own Wikidata import trials in this semantic mediawiki. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Reports == | == Reports == | ||
Line 34: | Line 31: | ||
|- | |- | ||
| 2024-01-29 || latest-all.nt (2024-01-29) || [https://qlever.cs.uni-freiburg.de/wikidata QLever] || 19.1 billion || 4.5 hours || 32 || 16 || AMD Ryzen 9 || {{Link|target=https://github.com/ad-freiburg/qlever/wiki/Using-QLever-for-Wikidata|title=Hannah Bast - QLever}} | | 2024-01-29 || latest-all.nt (2024-01-29) || [https://qlever.cs.uni-freiburg.de/wikidata QLever] || 19.1 billion || 4.5 hours || 32 || 16 || AMD Ryzen 9 || {{Link|target=https://github.com/ad-freiburg/qlever/wiki/Using-QLever-for-Wikidata|title=Hannah Bast - QLever}} | ||
+ | |- | ||
+ | | 2023-01 || [https://harej.co/posts/2023/01/loading-wikidata-into-different-graph-databases-blazegraph-qlever/ James Hare - Blazegraph & QLever] || Blazegraph & QLever || ? || ? || ? || ? || 384 GB || [https://github.com/scatter-llc/private-wikidata-query James Hare Scatter LLC] | ||
|- | |- | ||
| 2022-07 || latest-all.ttl (2022-07-12) || [https://www.stardog.com stardog] || 17.2 billion || 1 d 19 h || 253 || || || {{Link|target=WikiData_Import_2022-07-12|title=Tim Holzheim - BITPlan Wiki}} | | 2022-07 || latest-all.ttl (2022-07-12) || [https://www.stardog.com stardog] || 17.2 billion || 1 d 19 h || 253 || || || {{Link|target=WikiData_Import_2022-07-12|title=Tim Holzheim - BITPlan Wiki}} | ||
Line 67: | Line 66: | ||
| 2017-12 || latest-truthy.nt.gz || Apache Jena || ? || 8 hours || ? || || || [https://lists.apache.org/thread.html/70dde8e3d99ce3d69de613b5013c3f4c583d96161dec494ece49a412%40%3Cusers.jena.apache.org%3E Andy Seaborne Apache Jena Mailinglist] | | 2017-12 || latest-truthy.nt.gz || Apache Jena || ? || 8 hours || ? || || || [https://lists.apache.org/thread.html/70dde8e3d99ce3d69de613b5013c3f4c583d96161dec494ece49a412%40%3Cusers.jena.apache.org%3E Andy Seaborne Apache Jena Mailinglist] | ||
|} | |} | ||
+ | |||
+ | == Imports on our own hardware == | ||
+ | {{#ask: [[Concept:Import]] | ||
+ | |mainlabel=Import | ||
+ | |?Import state = state | ||
+ | |?Import url = url | ||
+ | |?Import target = target | ||
+ | |?Import start = start | ||
+ | |?Import end = end | ||
+ | |?Import days = days | ||
+ | |?Import os = os | ||
+ | |?Import cpu = cpu | ||
+ | |?Import ram = ram | ||
+ | |?Import triples = triples | ||
+ | |limit=200 | ||
+ | |sort=Import start | ||
+ | |order=desc | ||
+ | }} | ||
+ | = Links = | ||
+ | * https://github.com/mmayers12/wikidata | ||
+ | * https://www.wikidata.org/wiki/Wikidata:Database_download#RDF_dumps | ||
+ | * https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual | ||
+ | * https://muncca.com/2019/02/14/wikidata-import-in-apache-jena/ | ||
+ | * https://users.jena.apache.narkive.com/J1gsFHRk/tdb2-tdbloader-performance | ||
+ | * https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits | ||
+ | * https://github.com/maxlath/import-wikidata-dump-to-couchdb | ||
+ | * https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8169477 | ||
+ | * https://lists.wikimedia.org/pipermail/wikidata/2019-December/013716.html | ||
+ | * https://akbaritabar.netlify.app/how_to_use_a_wikidata_dump | ||
+ | * https://addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits-part-1/ | ||
+ | * https://topicseed.com/blog/importing-wikidata-dumps | ||
+ | |||
+ | = Questions = | ||
+ | * https://opendata.stackexchange.com/questions/107/how-can-i-download-the-complete-wikidata-database | ||
+ | * How to keep your copy of WikiData up to data without going thru the whole import process again? See https://phabricator.wikimedia.org/T244590 | ||
+ | * https://stackoverflow.com/questions/56769098/bulk-loading-of-wikidata-dump-in-virtuoso | ||
+ | * https://stackoverflow.com/questions/47885637/failed-to-install-wikidata-query-rdf-blazegraph | ||
+ | * https://stackoverflow.com/questions/48020506/wikidata-on-local-blazegraph-expected-an-rdf-value-here-found-line-1/48110100 | ||
+ | * https://stackoverflow.com/questions/56768463/wikidata-import-into-virtuoso | ||
+ | * https://stackoverflow.com/questions/14494449/virtuoso-system-requirements | ||
+ | |||
+ | [[Category:Wikidata]] |
Latest revision as of 06:20, 5 February 2024
Why would you want your own Wikidata copy?
The resources behind https://query.wikidata.org/ are scarce and used by a lot of people. You might hit the https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_limits quite quickly.
See SPARQL for some examples that work online (mostly) without hitting these limits.
What are alternative endpoints ?
Prerequisites
Getting a copy of Wikidata is not for the faint of heart.
You need quite a bit of patience and some hardware resources to get your own WikiData copy working. The resources you need are a moving target since Wikidata is growing all the time.
On the other hand solutions such as QLever by Hannah Bast are making progress so that you can run your own copy of Wikidata on commodity hardware with an AMD Ryzen 9 16 core processor, 32 GB of RAM a 2 TTB SSD and the indexing will take less than 5 h. Together with the download time of some 6 hours that's less than half a day for getting a current copy so a daily update is feasible these days.
Papers
Successes
Please contact Wolfgang Fahl if you'd like to see your own success report added to the "Reports" table below. The imports table is generated from the documentation of our own Wikidata import trials in this semantic mediawiki.
Reports
# Date | Source | Target | Triples | days | RAM GB | CPU Cores | Speed | Link |
---|---|---|---|---|---|---|---|---|
2024-01-29 | latest-all.nt (2024-01-29) | QLever | 19.1 billion | 4.5 hours | 32 | 16 | AMD Ryzen 9 | Hannah Bast - QLever |
2023-01 | James Hare - Blazegraph & QLever | Blazegraph & QLever | ? | ? | ? | ? | 384 GB | James Hare Scatter LLC |
2022-07 | latest-all.ttl (2022-07-12) | stardog | 17.2 billion | 1 d 19 h | 253 | Tim Holzheim - BITPlan Wiki | ||
2022-06 | latest-all.nt (2022-06-25) | QLever | 17.2 billion | 1 d 2 h | 128 | 8 | 1.8 GHz | Wolfgang Fahl - BITPlan Wiki |
2022-05 | latest-all.ttl.bz2 (2022-05-29) | QLever | ~17 billion | 14h | 128 | 12/24 | 4.8 GHz boost | Hannah Bast - QLever |
2022-02 | latest-all.nt (2022-02) | stardog | 16.7 billion | 9h | Evren Sirin - stardog | |||
2022-02 | latest-all.nt (2022-01-29) | QLever | 16.9 billion | 4 d 2 h | 127 | 8 | 1.8 GHz | Wolfgang Fahl - BITPlan Wiki |
2020-08 | latest-all.nt (2020-08-15) | Apache Jena | 13.8 billion | 9 d 21 h | 64 | Wolfgang Fahl BITPlan Wiki | ||
2020-07 | latest-truthy.nt (2020-07-15) | Apache Jena | 5.2 billion | 4 d 14 h | 64 | Wolfgang Fahl BITPlan Wiki | ||
2020-06 | latest-all.ttl (2020-04-28) | Apache Jena | 12.9 billion | 6 d 16 h | ? | Jonas Sourlier - Jena Issue 1909 | ||
2020-03 | latest-all.nt.bz2 (2020-03-01 | Virtuoso | ~11.8 billion | 10 hours + 1day prep | 248 | Hugh Williams - Virtuoso | ||
2019-10 | blazegraph | ~10 billion | 5.5 d | 104 | 16 | Adam Shoreland Wikimedia Foundation | ||
2019-09 | latest-all.ttl (2019-09) | Virtuoso | 9.5 billion | 9.1 hours | ? | Adam Sanchez - WikiData mailing list | ||
2019-05 | wikidata-20190513-all-BETA.ttl | Virtuoso | ? | 43 hours | ? | - | ||
2019-05 | wikidata-20190513-all-BETA.ttl | Blazegraph | ? | 10.2 days | Adam Sanchez WikiData mailing list | |||
2019-02 | latest-all.ttl.gz | Apache Jena | ? | > 2 days | ? | corsin - muncca blog | ||
2018-01 | wikidata-20180101-all-BETA.ttl | Blazegraph | 3 billion | 4 days | 32 | 4 | 2.2 GHz | Wolfgang Fahl - BITPlan wiki |
2017-12 | latest-truthy.nt.gz | Apache Jena | ? | 8 hours | ? | Andy Seaborne Apache Jena Mailinglist |
Imports on our own hardware
Links
- https://github.com/mmayers12/wikidata
- https://www.wikidata.org/wiki/Wikidata:Database_download#RDF_dumps
- https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual
- https://muncca.com/2019/02/14/wikidata-import-in-apache-jena/
- https://users.jena.apache.narkive.com/J1gsFHRk/tdb2-tdbloader-performance
- https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits
- https://github.com/maxlath/import-wikidata-dump-to-couchdb
- https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8169477
- https://lists.wikimedia.org/pipermail/wikidata/2019-December/013716.html
- https://akbaritabar.netlify.app/how_to_use_a_wikidata_dump
- https://addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits-part-1/
- https://topicseed.com/blog/importing-wikidata-dumps
Questions
- https://opendata.stackexchange.com/questions/107/how-can-i-download-the-complete-wikidata-database
- How to keep your copy of WikiData up to data without going thru the whole import process again? See https://phabricator.wikimedia.org/T244590
- https://stackoverflow.com/questions/56769098/bulk-loading-of-wikidata-dump-in-virtuoso
- https://stackoverflow.com/questions/47885637/failed-to-install-wikidata-query-rdf-blazegraph
- https://stackoverflow.com/questions/48020506/wikidata-on-local-blazegraph-expected-an-rdf-value-here-found-line-1/48110100
- https://stackoverflow.com/questions/56768463/wikidata-import-into-virtuoso
- https://stackoverflow.com/questions/14494449/virtuoso-system-requirements