Difference between revisions of "SPARQL"
Line 651: | Line 651: | ||
The Southwest corner of Scotland would in fact be in Ireland if you try to create a rectangle so [https://en.wikipedia.org/wiki/Sligo in the western province Connacht of Ireland might qualify] https://www.wikidata.org/wiki/Q190002 | The Southwest corner of Scotland would in fact be in Ireland if you try to create a rectangle so [https://en.wikipedia.org/wiki/Sligo in the western province Connacht of Ireland might qualify] https://www.wikidata.org/wiki/Q190002 | ||
− | Wikidata query may use the wikibase box service to filter coordinates that lie within a box. We are using this to find distilleries, where the location is in this box. | + | A Wikidata query may use the wikibase box service to filter coordinates that lie within a box. We are using this to find distilleries, where the location is in this box. |
Given that the south west corner of the box is in Ireland we might catch some Irish distilleries but as of 2018-01 we seem to be lucky :-) | Given that the south west corner of the box is in Ireland we might catch some Irish distilleries but as of 2018-01 we seem to be lucky :-) |
Revision as of 23:09, 5 January 2018
Content
What is SPARQL
SPARQL is a query language for semantic databases using the Resource Description Framework (RDF) format
Tutorial
There are quite a few tutorials out there for SPARQL e.g.
The W3C tutorial is somewhat outdated and mostly didn't work for me. The Apache Jena tutorial mostly also works with the Blazegraph database which we'll use in this tutorial. Just the output looks different and some examples won't work as shown in the tutorial.
This tutorial is for people which are new to semantic concepts but would like to use an example with a fair amount of data but not too much of complexity in the structure of the data.
Semantic Concepts
Personally I learned Semantic Concepts using Semantic MediaWiki see
When using SPARQL a tutorial needs to get a slightly different touch, so for those who know the talk above I'll explain some key concepts based on an example using:
- Countries
- Towns
- Municipal Units
Triples
A semantic statement has the form
<subject> <predicate> <object>
e.g.
Dubai is-located-in AE
is such a semantic statement which is also called a Triple.
The natural language statement "Dubai is located in United Arab Emirates" is purposely slightly modified to a more "computer-ready" form. The predicate has been written as is-located-in to make it a proper Identifier. The country-name "United Arab Emirates" has been replaced by its two letter United Nations Location Code AE. A triple like this has a natural graph representation:
TripleStore
A Triplestore is a database that can store and query triples. In fact for educational purposes I have written a simple Triplestore myself:
For that simple triplestore the triples are supplied in Simple Data Interchange Format. Again that format is mostly for educational purposes although it can also be used for small usecases with just a few thousand triples. Please also note that there is no SPARQL support in that project.
For more than a non-educational use a Triplestore is needed that can handle larger amounts of data and support SPARQL. The Wikipedia List of Subject-Predicate-Object Databases shows you some options. For this tutorial we'll use Blazegraph.
Setting up the Blazegraph Triple Store
You need Java to be installed on you machine.
Download the blazegraph.jar file from https://www.blazegraph.com/download/ and start it with
java -jar blazegraph.jar
In fact it's better if you start the jar file with an option to allow bigger xml files to be handled:
java -Djdk.xml.entityExpansionLimit=0 -jar blazegraph.jar
otherwise you might later run into the error:
org.openrdf.rio.RDFParseException: JAXP00010001: The parser has encountered more than "64000" entity expansions in this document; this is the limit imposed by the JDK
you should see
Welcome to the Blazegraph(tm) Database. Go to http://localhost:9999/blazegraph/ to get started.
And you might want to do just that and click that link.
Where Blazegraph stores it's data
The default setting for Blazegraphs journal file is to use blazegraph.jnl in the directory where you started the jar file. On my Mac OS Laptop the initial file size is some 200 MBytes.
ls -l blazegraph.jnl
-rw-r--r-- 1 wf staff 209715200 4 Jan 11:50 blazegraph.jnl
The Blazegraph Web UI
The Web-UI shows the Tabs:
- WELCOME
- QUERY
- UPDATE
- EXPLORE
- NAMESPACES
- STATUS
- PERFORMANCE
Let's start with the UPDATE tab to load some sample data.
The sample Data
The human readable form of some of our sample data and their description is available at:
- https://en.wikipedia.org/wiki/UN/LOCODE
- http://www.unece.org/cefact/locode/service/location
- https://www.unece.org/fileadmin/DAM/cefact/locode/ae.htm
RDF Version of the data
- https://old.datahub.io/dataset/rkb-explorer-unlocode
- http://unlocode.rkbexplorer.com/models/dump.tgz
You might want to download and unzip http://unlocode.rkbexplorer.com/models/dump.tgz. The result should be a directory with the following content:
pan:models wf$ls -l
total 31832
-rw-r--r-- 1 wf staff 265 4 Jan 07:27 catalog-v001.xml
-rw-r--r--@ 1 wf staff 42194 18 Feb 2009 unlocode-countries.rdf
-rw-r--r--@ 1 wf staff 228389 18 Feb 2009 unlocode-municipalunits.rdf
-rw-r--r--@ 1 wf staff 16017733 18 Feb 2009 unlocode-towns.rdf
Now drag and drop the three files:
- unlocode-countries.rdf
- unlocode-municipalunits.rdf
- unlocode-towns.rdf
one after another into the field with the text
(Type in or drag a file containing RDF data, ...
and click the update button below the field after each drag&drop operation. The output will be
Modified: 484 Milliseconds: 430 Modified: 1917 Milliseconds: ... Running update: 287 Modified: 239567 Milliseconds: 2260
The Milliseconds may vary on your machine. If you run into the 64000 entity limit you may need to restart your blazegraph.jar file with the Java VM options outlined above.
Structure of the Example Data
The following diagram shows the structure of the Example Data. Please note the use of UML diagrams which are usually not used to show Ontology structures.
UML Diagram
Example Data
Direct Town - Country Link
Two step - Town - Municipal-Unit - Country
SPARQL Queries
Select all Triples
Now our environment should be ready to hit the "QUERY" tab and enter our first SPARQL query. You might want to simply cut&paste the code from the SPARQL Query descriptions for each example below into the field with the text
(Input a SPARQL query)
and then hit the "Execute" Button below this field.
SPARQL Query
SELECT *
WHERE {
?subject ?predicate ?object
}
Result
Query running ...
Will be visible shortly than you'll see the result table, which will have total results of 242375 triples, displaying the first 50:
subject predicate object <http://unlocode.rkbexplorer.com/id/AEDHF> <http://www.aktors.org/ontology/portal#has-longitude> 54.5333333 <http://unlocode.rkbexplorer.com/id/AEDHF> <http://www.aktors.org/ontology/portal#is-located-in> <http://unlocode.rkbexplorer.com/id/AE> <http://unlocode.rkbexplorer.com/id/AEDHF> <http://www.aktors.org/ontology/support#has-pretty-name> Al Dhafra <http://unlocode.rkbexplorer.com/id/AEDHF> rdf:type <http://www.aktors.org/ontology/portal#Town> <http://unlocode.rkbexplorer.com/id/AEDUY> <http://www.aktors.org/ontology/portal#has-latitude> 25.7780637 <http://unlocode.rkbexplorer.com/id/AEDUY> <http://www.aktors.org/ontology/portal#has-longitude> 55.9310912 <http://unlocode.rkbexplorer.com/id/AEDUY> <http://www.aktors.org/ontology/portal#is-located-in> <http://unlocode.rkbexplorer.com/id/AE> <http://unlocode.rkbexplorer.com/id/AEDUY> <http://www.aktors.org/ontology/support#has-pretty-name> Ras Zubbaya (Ras Dubayyah) <http://unlocode.rkbexplorer.com/id/AEDUY> rdf:type <http://www.aktors.org/ontology/portal#Town> <http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#has-latitude> 25.2500000 <http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#has-longitude> 55.2666666 <http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#is-located-in> <http://unlocode.rkbexplorer.com/id/AE> <http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/support#has-pretty-name> Dubai <http://unlocode.rkbexplorer.com/id/AEDXB> rdf:type <http://www.aktors.org/ontology/portal#Town>
Explanation
SELECT *
asked for a selection
WHERE {
?subject ?predicate ?object
}
specified a condition. Since we used question marks for the three triple parts we made all three parts of the triple variable so any/each triple in the database will fulfill the condition.
The query shows all triples you uploaded from the RDF files "as is".
Now you can see that RDF unlike SiDiF mostly uses lenghty URLs to express things. So the Triple for Dubai being in AE gets to be:
<http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#is-located-in> <http://unlocode.rkbexplorer.com/id/AE>
And there are multiple triples for the subject <http://unlocode.rkbexplorer.com/id/AEDXB> So lets select only those.
Select by subject
SPARQL Query
SELECT *
WHERE {
<http://unlocode.rkbexplorer.com/id/AEDXB> ?predicate ?object
}
Result
We get 5 results:
predicate object <http://www.aktors.org/ontology/portal#has-latitude> 25.2500000 <http://www.aktors.org/ontology/portal#has-longitude> 55.2666666 <http://www.aktors.org/ontology/portal#is-located-in> <http://unlocode.rkbexplorer.com/id/AE> <http://www.aktors.org/ontology/support#has-pretty-name> Dubai rdf:type <http://www.aktors.org/ontology/portal#Town>
Explanation
This time the subject in the condition was not variable anymore but fixed to <http://unlocode.rkbexplorer.com/id/AEDXB>. We already new that such a subject existed from the query that selected all triples.
Select multiple predicates of one subject in one query
SPARQL Query
SELECT ?lat ?lon
WHERE {
<http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#has-latitude> ?lat.
<http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#has-longitude> ?lon.
}
result
We get one result.
lat lon 25.2500000 55.2666666
explanation
Instead of the asterisk * we had used for the SELECT so far this time we specified two variables:
SELECT ?lat ?lon
and these where used for two conditions:
<http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#has-latitude> ?lat.
<http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#has-longitude> ?lon.
keeping the subject fixed at <http://unlocode.rkbexplorer.com/id/AEDXB> but varying the predicate:
- <http://www.aktors.org/ontology/portal#has-latitude> assigning the result to the variable lat
- <http://www.aktors.org/ontology/portal#has-longitude> assigning the result to the variable lon
Using prefixes
SPARQL Query
PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>
SELECT *
WHERE {
unlocode:AEDXB ?predicate ?object
}
result
We get five results again
predicate object <http://www.aktors.org/ontology/portal#has-latitude> 25.2500000 <http://www.aktors.org/ontology/portal#has-longitude> 55.2666666 <http://www.aktors.org/ontology/portal#is-located-in> <http://unlocode.rkbexplorer.com/id/AE> <http://www.aktors.org/ontology/support#has-pretty-name> Dubai rdf:type <http://www.aktors.org/ontology/portal#Town>
explanation
The Prefix specification:
PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>
will replace each prefix use of:
unlocode:
with the specified url and then appending the value after the colon
AEXDB
so
unlocode:AEXDB
is as if we had written:
<http://unlocode.rkbexplorer.com/id/AEXDB>
Using PREFIX is very useful to make your SPARQL queries a lot more readable.
Specifying the type and selecting values
SPARQL Query
PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>
PREFIX portal: <http://www.aktors.org/ontology/portal#>
PREFIX support: <http://www.aktors.org/ontology/support#>
SELECT *
WHERE {
?subject rdf:type portal:Town.
?subject portal:has-latitude ?lat.
?subject portal:has-longitude ?lon.
?subject portal:is-located-in ?locatedIn.
?subject support:has-pretty-name ?name.
}
result
We get 48517 results
predicate object subject lat lon locatedIn name <http://unlocode.rkbexplorer.com/id/ARANA> -28.4666666 -62.8333333 <http://unlocode.rkbexplorer.com/id/AR-G> Anatuya <http://unlocode.rkbexplorer.com/id/ARAND> -27.6000000 -66.3166666 <http://unlocode.rkbexplorer.com/id/AR-K> Andalgala <http://unlocode.rkbexplorer.com/id/ARCCP> -27.3333333 -65.5833333 <http://unlocode.rkbexplorer.com/id/AR-T> Conception <http://unlocode.rkbexplorer.com/id/ARCTC> -26.8265906 -65.2203670 <http://unlocode.rkbexplorer.com/id/AR-K> Catamarca <http://unlocode.rkbexplorer.com/id/ARELB> -27.9166666 -65.8833333 <http://unlocode.rkbexplorer.com/id/AR-K> El Bolson ...
explanation
The condition
?subject rdf:type portal:Town.
made sure we only get the triples for subject of the rdf type "Town". Basically this is the set of triples that we imported from unlcode-towns.rdf in the first place.
LIMIT, ORDER and "a" keyword"
SPARQL Query
PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>
PREFIX portal: <http://www.aktors.org/ontology/portal#>
PREFIX support: <http://www.aktors.org/ontology/support#>
SELECT *
WHERE {
?subject a portal:Town.
?subject portal:has-latitude ?lat.
?subject portal:has-longitude ?lon.
?subject portal:is-located-in ?locatedIn.
?subject support:has-pretty-name ?name.
}
ORDER BY ?name
LIMIT 10
result
We get 10 results
<http://unlocode.rkbexplorer.com/id/TOEUA> -21.3666666 -174.9333333 <http://unlocode.rkbexplorer.com/id/TO> 'Eua Island <http://unlocode.rkbexplorer.com/id/BEGVO> 50.7500000 5.7500000 <http://unlocode.rkbexplorer.com/id/BE-VLI> 's Gravenvoeren <http://unlocode.rkbexplorer.com/id/NLSGL> 52.2333333 5.1166666 <http://unlocode.rkbexplorer.com/id/NL> 's-Graveland <http://unlocode.rkbexplorer.com/id/NLGRA> 51.7833333 4.6166666 <http://unlocode.rkbexplorer.com/id/NL> 's-Gravendeel <http://unlocode.rkbexplorer.com/id/NLHAG> 52.0833333 4.3000000 <http://unlocode.rkbexplorer.com/id/NL> 's-Gravenhage (Den Haag) <http://unlocode.rkbexplorer.com/id/NLSGM> 51.6666666 4.8000000 <http://unlocode.rkbexplorer.com/id/NL> 's-Gravenmoer <http://unlocode.rkbexplorer.com/id/NLSGP> 51.4500000 3.9000000 <http://unlocode.rkbexplorer.com/id/NL> 's-Gravenpolder <http://unlocode.rkbexplorer.com/id/BEGWE> 51.2666666 4.5500000 <http://unlocode.rkbexplorer.com/id/BE-VAN> 's-Gravenwezel <http://unlocode.rkbexplorer.com/id/NLGRZ> 52.0000000 4.1666666 <http://unlocode.rkbexplorer.com/id/NL> 's-Gravenzande <http://unlocode.rkbexplorer.com/id/NLSHB> 52.5333333 6.0166666 <http://unlocode.rkbexplorer.com/id/NL> 's-Heerenbroek
explanation
?subject a portal:Town.
is a shortcut for
?subject rdf:type portal:Town.
ORDER BY ?name
sorts the result by the support-has-pretty-name predicate that we put in the ?name variable
LIMIT 10
Sets the maximum amount of results we get to 10.
lookup relation information
SPARQL Query
PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>
PREFIX portal: <http://www.aktors.org/ontology/portal#>
PREFIX support: <http://www.aktors.org/ontology/support#>
SELECT *
WHERE {
?town a portal:Town.
?town support:has-pretty-name ?townname.
?town portal:has-latitude ?lat.
?town portal:has-longitude ?lon.
?town portal:is-located-in ?loc.
?loc support:has-pretty-name ?locname.
}
ORDER by ?locname ?townname
LIMIT 7
result
town townname lat lon loc locname <http://unlocode.rkbexplorer.com/id/AFBAG> Bagram 34.9500000 69.2500000 <http://unlocode.rkbexplorer.com/id/AF> AFGHANISTAN <http://unlocode.rkbexplorer.com/id/AFBIN> Bamian 34.8200748 67.8112337 <http://unlocode.rkbexplorer.com/id/AF> AFGHANISTAN <http://unlocode.rkbexplorer.com/id/AFBST> Bost 31.5832023 64.3602926 <http://unlocode.rkbexplorer.com/id/AF> AFGHANISTAN <http://unlocode.rkbexplorer.com/id/AFCCN> Chakcharan 33.6500012 62.3166972 <http://unlocode.rkbexplorer.com/id/AF> AFGHANISTAN <http://unlocode.rkbexplorer.com/id/AFDAZ> Darwaz 31.8372189 67.7843650 <http://unlocode.rkbexplorer.com/id/AF> AFGHANISTAN <http://unlocode.rkbexplorer.com/id/AFISQ> Eslam Qal'eh 34.6666666 61.0666666 <http://unlocode.rkbexplorer.com/id/AF> AFGHANISTAN <http://unlocode.rkbexplorer.com/id/AFFBD> Faizabad 37.1214418 70.5785208 <http://unlocode.rkbexplorer.com/id/AF> AFGHANISTAN
explanation
The information in the ?loc variable is taken to lookup the has-pretty-name information of the country code "pointer" supplied.
double lookup of relation information with FILTER
SPARQL query
PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>
PREFIX portal: <http://www.aktors.org/ontology/portal#>
PREFIX support: <http://www.aktors.org/ontology/support#>
SELECT ?townname ?lat ?lon ?regionname ?countryname
WHERE {
?town a portal:Town.
?town support:has-pretty-name ?townname.
?town portal:has-latitude ?lat.
?town portal:has-longitude ?lon.
?town portal:is-located-in ?region.
?region support:has-pretty-name ?regionname.
?region portal:is-part-of ?country.
?country support:has-pretty-name ?countryname.
FILTER regex(?regionname,"bayern","i")
}
ORDER by ?townname
LIMIT 7
result
townname lat lon regionname countryname Abensberg 48.8157530 11.8473832 Bayern GERMANY Abtswind 49.7707076 10.3745558 Bayern GERMANY Adelsdorf 49.7000000 10.8833333 Bayern GERMANY Affalterbach 49.6166666 11.2000000 Bayern GERMANY Ahorn 50.2166666 10.9333333 Bayern GERMANY Aichach 48.4579674 11.1301884 Bayern GERMANY Ainring 47.8166666 12.9333333 Bayern GERMANY
explanation
This time we have left out the linking variables and only selected the has-pretty-name lookup results. Also we have filtered the result with a regular expression that matches any region that has "bayern" in it's pretty name. "i" is regular expression option to ignore upper and lower case letters.
Group by
PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>
PREFIX portal: <http://www.aktors.org/ontology/portal#>
PREFIX support: <http://www.aktors.org/ontology/support#>
SELECT ?type (COUNT(?type) AS ?typecount)
WHERE {
?subject a ?type.
}
GROUP by ?type
result
type typecount <http://www.aktors.org/ontology/portal#Town> 47998 <http://www.aktors.org/ontology/portal#Country> 237 <http://www.aktors.org/ontology/portal#Municipal-Unit> 969 owl:Ontology 2
explanation
SELECT ?type (COUNT(?type) AS ?typecount)
Is an aggregate selection together with a corresponding:
GROUP by ?type
If you do not specify the GROUP by clause you get a
MalformedQueryException: Bad aggregate
WikiData
The WikiData Project offers a SPARQL based query service at https://query.wikidata.org/
See the Concise list of SPARQL query examples for a multitude of further examples.
Query Examples
Whisky Distilleries worldwide
SPARQL Query
#defaultView:Map
SELECT ?item ?coord
WHERE
{
?item wdt:P31 wd:Q10373548.
?item wdt:P625 ?coord.
}
result
explanation
#defaultView:Map
A comment that instructs the results of the query to be shown as a map by the Wikidata Query service
?item wdt:P31 wd:Q10373548.
- the predicate is https://www.wikidata.org/wiki/Property:P31 (instance of)
- the object is https://www.wikidata.org/wiki/Q10373548 (whisky distillery)
Whisky Distilleries in Scotland
SPARQL Query
#defaultView:Map
SELECT ?item ?coord
WHERE
{
?item wdt:P31 wd:Q10373548.
?item wdt:P131 ?located_in .
?located_in wdt:P131* wd:Q22.
?item wdt:P625 ?coord.
}
result
explanation
The condition
?located_in wdt:P131* wd:Q22.
is a transitive link condition. The asterisk symbol "*" states that the property https://www.wikidata.org/wiki/Property:P131 (located in the administrative territorial entity) should eventually lead to scotland. As of 2018-01 the map shows less entries then in the previous query since some distilleries do not have a link-chain via https://www.wikidata.org/wiki/Property:P131 (located in the administrative territorial entity) to the entity https://www.wikidata.org/wiki/Q22 (Scotland).
As an example look at https://www.wikidata.org/wiki/Q49646 (Jura distillery). It has a statement https://www.wikidata.org/wiki/Property:P131 (located in the administrative territorial entity) linking to https://www.wikidata.org/wiki/Q111509 (Jura - island in the inner Hebrides of Scotland). Another https://www.wikidata.org/wiki/Property:P131 (located in the administrative territorial entity) links to https://www.wikidata.org/wiki/Q202174 (Argyll and Bute - unitary authority council area in Scotland) which eventually links to https://www.wikidata.org/wiki/Q22 (Scotland - country in North-West Europe, part of the United Kingdom)
Link-Chain
The following graph explains the link chain from the Jura distillery Entity to the Scotland Entity. The nodes are clickable.
The link from https://www.wikidata.org/wiki/Q111509 to https://www.wikidata.org/wiki/Q202174 was added manually by me on 2018-01-04. Quite a few other links to make the query work are still missing.
Whisky Distilleries in Scotland by geographic coordinates
SPARQL Query
# Whisky distilleries in the box between
# Hardoldswick / Unst / Shetland Islands
# and Sligo Northisland which is a rough estimate of a scotland bounding box
# and might capture some Irish distilleries as well
#defaultView:Map
SELECT ?item ?coord
WHERE
{
wd:Q2642797 wdt:P625 ?NECorner .
wd:Q190002 wdt:P625 ?SWCorner .
?item wdt:P31 wd:Q10373548.
SERVICE wikibase:box {
?item wdt:P625 ?coord .
bd:serviceParam wikibase:cornerSouthWest ?SWCorner .
bd:serviceParam wikibase:cornerNorthEast ?NECorner .
}
}
result
explanation
The Northeast corner of Scotland might be defined by Haroldswick on Unst, Shettland Islands https://www.wikidata.org/wiki/Q2642797
The Southwest corner of Scotland would in fact be in Ireland if you try to create a rectangle so in the western province Connacht of Ireland might qualify https://www.wikidata.org/wiki/Q190002
A Wikidata query may use the wikibase box service to filter coordinates that lie within a box. We are using this to find distilleries, where the location is in this box.
Given that the south west corner of the box is in Ireland we might catch some Irish distilleries but as of 2018-01 we seem to be lucky :-)