SPARQL

From BITPlan Wiki
Revision as of 13:52, 8 January 2018 by Wf (talk | contribs) (→‎Triples)
Jump to navigation Jump to search

Content

What is SPARQL

SPARQL is a query language for semantic databases using the Resource Description Framework (RDF) format

Tutorial

There are quite a few tutorials out there for SPARQL e.g.

  1. W3C SPARQL By Example
  2. Apache Jena SPARQL

The W3C tutorial is somewhat outdated and mostly didn't work for me. The Apache Jena tutorial mostly also works with the Blazegraph database which we'll use in this tutorial. Just the output looks different and some examples won't work as shown in the tutorial.

This tutorial is for people which are new to semantic concepts but would like to use an example with a fair amount of data but not too much of complexity in the structure of the data.

Semantic Concepts

Personally I learned Semantic Concepts using Semantic MediaWiki see

  1. Wolfgang Fahl's User page at www.semantic-mediawiki.org
  2. Semantic Concepts Talk at SMWCon 2015

When using SPARQL a tutorial needs to get a slightly different touch, so for those who know the talk above I'll explain some key concepts based on an example using:

  1. Countries
  2. Towns
  3. Municipal Units

Triples

A semantic statement has the form

<subject> <predicate> <object>

e.g.

Dubai is-located-in AE

is such a semantic statement which is also called a Triple.

The natural language statement "Dubai is located in United Arab Emirates" is purposely slightly modified to a more "computer-ready" form. The predicate has been written as is-located-in to make it a proper Identifier. The country-name "United Arab Emirates" has been replaced by its two letter United Nations Location Code AE. A triple like this has a natural graph representation:

TripleStore

A Triplestore is a database that can store and query triples. In fact for educational purposes I have written a simple Triplestore myself:

For that simple triplestore the triples are supplied in Simple Data Interchange Format. Again that format is mostly for educational purposes although it can also be used for small usecases with just a few thousand triples. Please also note that there is no SPARQL support in that project.

For more than a non-educational use a Triplestore is needed that can handle larger amounts of data and support SPARQL. The Wikipedia List of Subject-Predicate-Object Databases shows you some options. For this tutorial we'll use Blazegraph.

Setting up the Blazegraph Triple Store

logo.png

You need Java to be installed on you machine.

Download the blazegraph.jar file from https://www.blazegraph.com/download/ and start it with

java -jar blazegraph.jar

In fact it's better if you start the jar file with an option to allow bigger xml files to be handled:

java -Djdk.xml.entityExpansionLimit=0 -jar blazegraph.jar

otherwise you might later run into the error:

org.openrdf.rio.RDFParseException: JAXP00010001: The parser has encountered more than "64000" entity expansions in this document; this is the limit imposed by the JDK

you should see

Welcome to the Blazegraph(tm) Database.

Go to http://localhost:9999/blazegraph/ to get started.

And you might want to do just that and click that link.

Where Blazegraph stores it's data

The default setting for Blazegraphs journal file is to use blazegraph.jnl in the directory where you started the jar file. On my Mac OS Laptop the initial file size is some 200 MBytes.

ls -l blazegraph.jnl 
-rw-r--r--  1 wf  staff  209715200  4 Jan 11:50 blazegraph.jnl

The Blazegraph Web UI

Blazegraph1.png

The Web-UI shows the Tabs:

  1. WELCOME
  2. QUERY
  3. UPDATE
  4. EXPLORE
  5. NAMESPACES
  6. STATUS
  7. PERFORMANCE

Let's start with the UPDATE tab to load some sample data.

The sample Data

The human readable form of some of our sample data and their description is available at:

RDF Version of the data

You might want to download and unzip http://unlocode.rkbexplorer.com/models/dump.tgz. The result should be a directory with the following content:

pan:models wf$ls -l
total 31832
-rw-r--r--  1 wf  staff       265  4 Jan 07:27 catalog-v001.xml
-rw-r--r--@ 1 wf  staff     42194 18 Feb  2009 unlocode-countries.rdf
-rw-r--r--@ 1 wf  staff    228389 18 Feb  2009 unlocode-municipalunits.rdf
-rw-r--r--@ 1 wf  staff  16017733 18 Feb  2009 unlocode-towns.rdf

Now drag and drop the three files:

  1. unlocode-countries.rdf
  2. unlocode-municipalunits.rdf
  3. unlocode-towns.rdf

one after another into the field with the text

(Type in or drag a file containing RDF data, ...

and click the update button below the field after each drag&drop operation. The output will be

Modified: 484
Milliseconds: 430
Modified: 1917
Milliseconds: ...
Running update: 287
Modified: 239567
Milliseconds: 2260

The Milliseconds may vary on your machine. If you run into the 64000 entity limit you may need to restart your blazegraph.jar file with the Java VM options outlined above.

Structure of the Example Data

The following diagram shows the structure of the Example Data. Please note the use of UML diagrams which are usually not used to show Ontology structures.

UML Diagram

Example Data

Direct Town - Country Link

Two step - Town - Municipal-Unit - Country

SPARQL Queries

Select all Triples

Now our environment should be ready to hit the "QUERY" tab and enter our first SPARQL query. You might want to simply cut&paste the code from the SPARQL Query descriptions for each example below into the field with the text

(Input a SPARQL query)

and then hit the "Execute" Button below this field.

SPARQL Query

SELECT * 
WHERE {
  ?subject ?predicate ?object
}

Result

Query running ...

Will be visible shortly than you'll see the result table, which will have total results of 242375 triples, displaying the first 50:

subject	                                        predicate	                                        object
<http://unlocode.rkbexplorer.com/id/AEDHF>	<http://www.aktors.org/ontology/portal#has-longitude>	54.5333333
<http://unlocode.rkbexplorer.com/id/AEDHF>	<http://www.aktors.org/ontology/portal#is-located-in>	<http://unlocode.rkbexplorer.com/id/AE>
<http://unlocode.rkbexplorer.com/id/AEDHF>	<http://www.aktors.org/ontology/support#has-pretty-name>	Al Dhafra
<http://unlocode.rkbexplorer.com/id/AEDHF>	rdf:type	<http://www.aktors.org/ontology/portal#Town>
<http://unlocode.rkbexplorer.com/id/AEDUY>	<http://www.aktors.org/ontology/portal#has-latitude>	25.7780637
<http://unlocode.rkbexplorer.com/id/AEDUY>	<http://www.aktors.org/ontology/portal#has-longitude>	55.9310912
<http://unlocode.rkbexplorer.com/id/AEDUY>	<http://www.aktors.org/ontology/portal#is-located-in>	<http://unlocode.rkbexplorer.com/id/AE>
<http://unlocode.rkbexplorer.com/id/AEDUY>	<http://www.aktors.org/ontology/support#has-pretty-name>	Ras Zubbaya (Ras Dubayyah)
<http://unlocode.rkbexplorer.com/id/AEDUY>	rdf:type	<http://www.aktors.org/ontology/portal#Town>
<http://unlocode.rkbexplorer.com/id/AEDXB>	<http://www.aktors.org/ontology/portal#has-latitude>	25.2500000
<http://unlocode.rkbexplorer.com/id/AEDXB>	<http://www.aktors.org/ontology/portal#has-longitude>	55.2666666
<http://unlocode.rkbexplorer.com/id/AEDXB>	<http://www.aktors.org/ontology/portal#is-located-in>	<http://unlocode.rkbexplorer.com/id/AE>
<http://unlocode.rkbexplorer.com/id/AEDXB>	<http://www.aktors.org/ontology/support#has-pretty-name>	Dubai
<http://unlocode.rkbexplorer.com/id/AEDXB>	rdf:type	<http://www.aktors.org/ontology/portal#Town>

Explanation

SELECT *

asked for a selection

WHERE {
  ?subject ?predicate ?object
}

specified a condition. Since we used question marks for the three triple parts we made all three parts of the triple variable so any/each triple in the database will fulfill the condition.

The query shows all triples you uploaded from the RDF files "as is".

Now you can see that RDF unlike SiDiF mostly uses lenghty URLs to express things. So the Triple for Dubai being in AE gets to be:

<http://unlocode.rkbexplorer.com/id/AEDXB>	<http://www.aktors.org/ontology/portal#is-located-in>	<http://unlocode.rkbexplorer.com/id/AE>

And there are multiple triples for the subject <http://unlocode.rkbexplorer.com/id/AEDXB> So lets select only those.

Select by subject

SPARQL Query

SELECT * 
WHERE {
  <http://unlocode.rkbexplorer.com/id/AEDXB> ?predicate ?object
}

Result

We get 5 results:

predicate	                                          object
<http://www.aktors.org/ontology/portal#has-latitude>	  25.2500000
<http://www.aktors.org/ontology/portal#has-longitude>	  55.2666666
<http://www.aktors.org/ontology/portal#is-located-in>	  <http://unlocode.rkbexplorer.com/id/AE>
<http://www.aktors.org/ontology/support#has-pretty-name>  Dubai
rdf:type	                                          <http://www.aktors.org/ontology/portal#Town>

Explanation

This time the subject in the condition was not variable anymore but fixed to <http://unlocode.rkbexplorer.com/id/AEDXB>. We already new that such a subject existed from the query that selected all triples.

Select multiple predicates of one subject in one query

SPARQL Query

SELECT ?lat ?lon
WHERE {
  <http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#has-latitude> ?lat.
  <http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#has-longitude> ?lon.                                            
}

result

We get one result.

lat	        lon
25.2500000	55.2666666

explanation

Instead of the asterisk * we had used for the SELECT so far this time we specified two variables:

SELECT ?lat ?lon

and these where used for two conditions:

  <http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#has-latitude> ?lat.
  <http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#has-longitude> ?lon.

keeping the subject fixed at <http://unlocode.rkbexplorer.com/id/AEDXB> but varying the predicate:

  1. <http://www.aktors.org/ontology/portal#has-latitude> assigning the result to the variable lat
  2. <http://www.aktors.org/ontology/portal#has-longitude> assigning the result to the variable lon

Using prefixes

SPARQL Query

PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>
SELECT *
WHERE {
  unlocode:AEDXB ?predicate ?object                                           
}

result

We get five results again

predicate	                                          object
<http://www.aktors.org/ontology/portal#has-latitude>	  25.2500000
<http://www.aktors.org/ontology/portal#has-longitude>	  55.2666666
<http://www.aktors.org/ontology/portal#is-located-in>	  <http://unlocode.rkbexplorer.com/id/AE>
<http://www.aktors.org/ontology/support#has-pretty-name>  Dubai
rdf:type	                                          <http://www.aktors.org/ontology/portal#Town>

explanation

The Prefix specification:

PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>

will replace each prefix use of:

unlocode:

with the specified url and then appending the value after the colon

AEXDB

so

unlocode:AEXDB

is as if we had written:

<http://unlocode.rkbexplorer.com/id/AEXDB>

Using PREFIX is very useful to make your SPARQL queries a lot more readable.

Specifying the type and selecting values

SPARQL Query

PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>
PREFIX portal: <http://www.aktors.org/ontology/portal#>
PREFIX support: <http://www.aktors.org/ontology/support#>

SELECT *
WHERE {
  ?subject rdf:type portal:Town.
  ?subject portal:has-latitude   ?lat.
  ?subject portal:has-longitude  ?lon.
  ?subject portal:is-located-in  ?locatedIn.
  ?subject support:has-pretty-name ?name.
}

result

We get 48517 results

predicate	                                          object
subject	lat	lon	locatedIn	name
<http://unlocode.rkbexplorer.com/id/ARANA>	-28.4666666	-62.8333333	<http://unlocode.rkbexplorer.com/id/AR-G>	Anatuya
<http://unlocode.rkbexplorer.com/id/ARAND>	-27.6000000	-66.3166666	<http://unlocode.rkbexplorer.com/id/AR-K>	Andalgala
<http://unlocode.rkbexplorer.com/id/ARCCP>	-27.3333333	-65.5833333	<http://unlocode.rkbexplorer.com/id/AR-T>	Conception
<http://unlocode.rkbexplorer.com/id/ARCTC>	-26.8265906	-65.2203670	<http://unlocode.rkbexplorer.com/id/AR-K>	Catamarca
<http://unlocode.rkbexplorer.com/id/ARELB>	-27.9166666	-65.8833333	<http://unlocode.rkbexplorer.com/id/AR-K>	El Bolson
...

explanation

The condition

?subject rdf:type portal:Town.

made sure we only get the triples for subject of the rdf type "Town". Basically this is the set of triples that we imported from unlcode-towns.rdf in the first place.

LIMIT, ORDER and "a" keyword"

SPARQL Query

PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>
PREFIX portal: <http://www.aktors.org/ontology/portal#>
PREFIX support: <http://www.aktors.org/ontology/support#>

SELECT *
WHERE {
  ?subject a portal:Town.
  ?subject portal:has-latitude   ?lat.
  ?subject portal:has-longitude  ?lon.
  ?subject portal:is-located-in  ?locatedIn.
  ?subject support:has-pretty-name ?name.
}
ORDER BY ?name
LIMIT 10

result

We get 10 results

<http://unlocode.rkbexplorer.com/id/TOEUA>	-21.3666666	-174.9333333	<http://unlocode.rkbexplorer.com/id/TO>	'Eua Island
<http://unlocode.rkbexplorer.com/id/BEGVO>	50.7500000	5.7500000	<http://unlocode.rkbexplorer.com/id/BE-VLI>	's Gravenvoeren
<http://unlocode.rkbexplorer.com/id/NLSGL>	52.2333333	5.1166666	<http://unlocode.rkbexplorer.com/id/NL>	's-Graveland
<http://unlocode.rkbexplorer.com/id/NLGRA>	51.7833333	4.6166666	<http://unlocode.rkbexplorer.com/id/NL>	's-Gravendeel
<http://unlocode.rkbexplorer.com/id/NLHAG>	52.0833333	4.3000000	<http://unlocode.rkbexplorer.com/id/NL>	's-Gravenhage (Den Haag)
<http://unlocode.rkbexplorer.com/id/NLSGM>	51.6666666	4.8000000	<http://unlocode.rkbexplorer.com/id/NL>	's-Gravenmoer
<http://unlocode.rkbexplorer.com/id/NLSGP>	51.4500000	3.9000000	<http://unlocode.rkbexplorer.com/id/NL>	's-Gravenpolder
<http://unlocode.rkbexplorer.com/id/BEGWE>	51.2666666	4.5500000	<http://unlocode.rkbexplorer.com/id/BE-VAN>	's-Gravenwezel
<http://unlocode.rkbexplorer.com/id/NLGRZ>	52.0000000	4.1666666	<http://unlocode.rkbexplorer.com/id/NL>	's-Gravenzande
<http://unlocode.rkbexplorer.com/id/NLSHB>	52.5333333	6.0166666	<http://unlocode.rkbexplorer.com/id/NL>	's-Heerenbroek

explanation

?subject a portal:Town.

is a shortcut for

?subject rdf:type portal:Town.
ORDER BY ?name

sorts the result by the support-has-pretty-name predicate that we put in the ?name variable

LIMIT 10

Sets the maximum amount of results we get to 10.

lookup relation information

SPARQL Query

PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>
PREFIX portal: <http://www.aktors.org/ontology/portal#>
PREFIX support: <http://www.aktors.org/ontology/support#>

SELECT *
WHERE {
  ?town a portal:Town.
  ?town support:has-pretty-name ?townname.
  ?town portal:has-latitude ?lat.
  ?town portal:has-longitude ?lon.
  ?town portal:is-located-in ?loc.
  ?loc support:has-pretty-name ?locname.
}
ORDER by ?locname ?townname
LIMIT 7

result

town	                                        townname	lat	        lon	        loc	                           locname
<http://unlocode.rkbexplorer.com/id/AFBAG>	Bagram	        34.9500000	69.2500000	<http://unlocode.rkbexplorer.com/id/AF>	AFGHANISTAN
<http://unlocode.rkbexplorer.com/id/AFBIN>	Bamian	        34.8200748	67.8112337	<http://unlocode.rkbexplorer.com/id/AF>	AFGHANISTAN
<http://unlocode.rkbexplorer.com/id/AFBST>	Bost	        31.5832023	64.3602926	<http://unlocode.rkbexplorer.com/id/AF>	AFGHANISTAN
<http://unlocode.rkbexplorer.com/id/AFCCN>	Chakcharan	33.6500012	62.3166972	<http://unlocode.rkbexplorer.com/id/AF>	AFGHANISTAN
<http://unlocode.rkbexplorer.com/id/AFDAZ>	Darwaz	        31.8372189	67.7843650	<http://unlocode.rkbexplorer.com/id/AF>	AFGHANISTAN
<http://unlocode.rkbexplorer.com/id/AFISQ>	Eslam Qal'eh	34.6666666	61.0666666	<http://unlocode.rkbexplorer.com/id/AF>	AFGHANISTAN
<http://unlocode.rkbexplorer.com/id/AFFBD>	Faizabad	37.1214418	70.5785208	<http://unlocode.rkbexplorer.com/id/AF>	AFGHANISTAN

explanation

The information in the ?loc variable is taken to lookup the has-pretty-name information of the country code "pointer" supplied.

double lookup of relation information with FILTER

SPARQL query

PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>
PREFIX portal: <http://www.aktors.org/ontology/portal#>
PREFIX support: <http://www.aktors.org/ontology/support#>

SELECT ?townname ?lat ?lon ?regionname ?countryname
WHERE {
  ?town a portal:Town.
  ?town support:has-pretty-name ?townname.
  ?town portal:has-latitude ?lat.
  ?town portal:has-longitude ?lon.
  ?town portal:is-located-in ?region.

  ?region support:has-pretty-name ?regionname.
  ?region portal:is-part-of ?country.

  ?country support:has-pretty-name ?countryname.
 
   FILTER regex(?regionname,"bayern","i")
}
ORDER by ?townname
LIMIT 7

result

townname	lat	        lon	        regionname  countryname
Abensberg	48.8157530	11.8473832	Bayern	    GERMANY
Abtswind	49.7707076	10.3745558	Bayern	    GERMANY
Adelsdorf	49.7000000	10.8833333	Bayern	    GERMANY
Affalterbach	49.6166666	11.2000000	Bayern	    GERMANY
Ahorn	        50.2166666	10.9333333	Bayern	    GERMANY
Aichach	        48.4579674	11.1301884	Bayern	    GERMANY
Ainring	        47.8166666	12.9333333	Bayern	    GERMANY

explanation

This time we have left out the linking variables and only selected the has-pretty-name lookup results. Also we have filtered the result with a regular expression that matches any region that has "bayern" in it's pretty name. "i" is regular expression option to ignore upper and lower case letters.

Group by

PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>
PREFIX portal: <http://www.aktors.org/ontology/portal#>
PREFIX support: <http://www.aktors.org/ontology/support#>

SELECT ?type (COUNT(?type) AS ?typecount)
WHERE {
  ?subject a ?type.
}
GROUP by ?type

result

type	                                                typecount
<http://www.aktors.org/ontology/portal#Town>	        47998
<http://www.aktors.org/ontology/portal#Country>	          237
<http://www.aktors.org/ontology/portal#Municipal-Unit>	  969
owl:Ontology	                                            2

explanation

SELECT ?type (COUNT(?type) AS ?typecount)

Is an aggregate selection together with a corresponding:

GROUP by ?type

If you do not specify the GROUP by clause you get a

MalformedQueryException: Bad aggregate

WikiData

The WikiData Project offers a SPARQL based query service at https://query.wikidata.org/

See the Concise list of SPARQL query examples for a multitude of further examples.

Query Examples

Whisky Distilleries worldwide

SPARQL Query

# Find World wide whisky distilleries using the Wikidata Query service
# taken from
# http://wiki.bitplan.com/index.php/SPARQL#Whisky_Distilleries_worldwide
# Created 2018-01 by Wolfgang Fahl BITPlan GmbH
# 
# display the results as a map
#defaultView:Map
#
# select the distillery and its coordinates
SELECT ?distillery ?coord 
WHERE 
{
  # any subject
  # which is an instance of
  # https://www.wikidata.org/wiki/Property:P31
  # Whisky distillery
  # https://www.wikidata.org/wiki/Q10373548
  ?distillery wdt:P31 wd:Q10373548.
  # get the coordinate location value for any found distillery
  # https://www.wikidata.org/wiki/Property:P625
  ?distillery wdt:P625 ?coord.
}

link to this query

result

Wdworld2018-01.png Clipped to Scotland

explanation

#defaultView:Map

A comment that instructs the results of the query to be shown as a map by the Wikidata Query service

?item wdt:P31 wd:Q10373548.

Whisky Distilleries in Scotland

SPARQL Query

#defaultView:Map
SELECT ?item ?coord 
WHERE 
{
  ?item wdt:P31 wd:Q10373548.
  ?item wdt:P131 ?located_in .
  ?located_in wdt:P131* wd:Q22.
  ?item wdt:P625 ?coord.
}

Link to this query

result

Wdscotland2018-01.png

explanation

The condition

?located_in wdt:P131* wd:Q22.

is a transitive link condition. The asterisk symbol "*" states that the property https://www.wikidata.org/wiki/Property:P131 (located in the administrative territorial entity) should eventually lead to scotland. As of 2018-01 the map shows less entries then in the previous query since some distilleries do not have a link-chain via https://www.wikidata.org/wiki/Property:P131 (located in the administrative territorial entity) to the entity https://www.wikidata.org/wiki/Q22 (Scotland).

As an example look at https://www.wikidata.org/wiki/Q49646 (Jura distillery). It has a statement https://www.wikidata.org/wiki/Property:P131 (located in the administrative territorial entity) linking to https://www.wikidata.org/wiki/Q111509 (Jura - island in the inner Hebrides of Scotland). Another https://www.wikidata.org/wiki/Property:P131 (located in the administrative territorial entity) links to https://www.wikidata.org/wiki/Q202174 (Argyll and Bute - unitary authority council area in Scotland) which eventually links to https://www.wikidata.org/wiki/Q22 (Scotland - country in North-West Europe, part of the United Kingdom)

Link-Chain

The following graph explains the link chain from the Jura distillery Entity to the Scotland Entity. The nodes are clickable.

The link from https://www.wikidata.org/wiki/Q111509 to https://www.wikidata.org/wiki/Q202174 was added manually by me on 2018-01-04. Quite a few other links to make the query work are still missing.

Whisky Distilleries in Scotland by geographic coordinates

SPARQL Query

# Whisky distilleries in the box between
# Hardoldswick / Unst / Shetland Islands
# and Sligo Northisland which is a rough estimate of a scotland bounding box
# and might capture some Irish distilleries as well
#defaultView:Map
SELECT ?item ?coord 
WHERE 
{
  wd:Q2642797 wdt:P625 ?NECorner .
  wd:Q190002  wdt:P625 ?SWCorner .
  ?item wdt:P31 wd:Q10373548.
  SERVICE wikibase:box {
      ?item wdt:P625 ?coord .
      bd:serviceParam wikibase:cornerSouthWest ?SWCorner .
      bd:serviceParam wikibase:cornerNorthEast ?NECorner .
    }
}

link to this query

result

Wdscotland2018-01bycoordinates.png

explanation

The Northeast corner of Scotland might be defined by Haroldswick on Unst, Shettland Islands https://www.wikidata.org/wiki/Q2642797

The Southwest corner of Scotland would in fact be in Ireland if you try to create a rectangle so in the western province Connacht of Ireland might qualify https://www.wikidata.org/wiki/Q190002

A Wikidata query may use the wikibase box service to filter coordinates that lie within a box. We are using this to find distilleries, where the location is in this box.

Given that the south west corner of the box is in Ireland we might catch some Irish distilleries but unti of 2018-01-05 none showed. Until I added the Bushmills distillery in Country Antrim, Norhern Island https://www.wikidata.org/wiki/Q268267 :-)

Links