SPARQL

From BITPlan Wiki
Jump to navigation Jump to search

Content

What is SPARQL

SPARQL is a query language for semantic databases using the Resource Description Framework (RDF) format

Tutorial

There are quite a few tutorials out there for SPARQL e.g.

  1. W3C SPARQL By Example
  2. Apache Jena SPARQL

The W3C tutorial is somewhat outdated and mostly didn't work for me. The Apache Jena tutorial mostly also works with the Blazegraph database which we'll use in this tutorial. Just the output looks different and some examples won't work as shown in the tutorial.

This tutorial is for people which are new to semantic concepts but would like to use an example with a fair amount of data but not too much of complexity in the structure of the data.

Semantic Concepts

Personally I learned Semantic Concepts using Semantic MediaWiki see

  1. Wolfgang Fahl's User page at www.semantic-mediawiki.org
  2. Semantic Concepts Talk at SMWCon 2015

When using SPARQL a tutorial needs to get a slightly different touch, so for those who know the talk above I'll explain some key concepts based on an example using:

  1. Countries
  2. Towns
  3. Municipal Units

Triples

A semantic statement has the form

<subject> <predicate> <object>

e.g.

Dubai is-located-in AE

is such a semantic statement which is also called a Triple.

The natural language statement "Dubai is located in United Arab Emirates" is purposely slightly modified to a more "computer-ready" form. The predicate has been written as is-located-in to make it a proper Identifier. The country-name "United Arab Emirates" has been replaced by its two letter United Nations Location Code AE. A triple like this has a natural graph representation:

TripleStore

A Triplestore is a database that can store and query triples. In fact for educational purposes I have written a simple Triplestore myself:

For that simple triplestore the triples are supplied in Simple Data Interchange Format. Again that format is mostly for educational purposes although it can also be used for small usecases with just a few thousand triples. Please also note that there is no SPARQL support in that project.

For more than a non-educational use a Triplestore is needed that can handle larger amounts of data and support SPARQL. The Wikipedia List of Subject-Predicate-Object Databases shows you some options. For this tutorial we'll use Blazegraph.

Setting up the Blazegraph Triple Store

logo.png

You need Java to be installed on you machine.

Download the blazegraph.jar file from https://www.blazegraph.com/download/ and start it with

java -jar blazegraph.jar

In fact it's better if you start the jar file with an option to allow bigger xml files to be handled:

java -Djdk.xml.entityExpansionLimit=0 -jar blazegraph.jar

otherwise you might later run into the error:

org.openrdf.rio.RDFParseException: JAXP00010001: The parser has encountered more than "64000" entity expansions in this document; this is the limit imposed by the JDK

you should see

Welcome to the Blazegraph(tm) Database.

Go to http://localhost:9999/blazegraph/ to get started.

And you might want to do just that and click that link.

Where Blazegraph stores it's data

The default setting for Blazegraphs journal file is to use blazegraph.jnl in the directory where you started the jar file. On my Mac OS Laptop the initial file size is some 200 MBytes.

ls -l blazegraph.jnl 
-rw-r--r--  1 wf  staff  209715200  4 Jan 11:50 blazegraph.jnl

The Blazegraph Web UI

Blazegraph1.png

The Web-UI shows the Tabs:

  1. WELCOME
  2. QUERY
  3. UPDATE
  4. EXPLORE
  5. NAMESPACES
  6. STATUS
  7. PERFORMANCE

Let's start with the UPDATE tab to load some sample data.

The sample Data

The human readable form of some of our sample data and their description is available at:

RDF Version of the data

You might want to download and unzip http://unlocode.rkbexplorer.com/models/dump.tgz. The result should be a directory with the following content:

pan:models wf$ls -l
total 31832
-rw-r--r--  1 wf  staff       265  4 Jan 07:27 catalog-v001.xml
-rw-r--r--@ 1 wf  staff     42194 18 Feb  2009 unlocode-countries.rdf
-rw-r--r--@ 1 wf  staff    228389 18 Feb  2009 unlocode-municipalunits.rdf
-rw-r--r--@ 1 wf  staff  16017733 18 Feb  2009 unlocode-towns.rdf

Now drag and drop the three files:

  1. unlocode-countries.rdf
  2. unlocode-municipalunits.rdf
  3. unlocode-towns.rdf

one after another into the field with the text

(Type in or drag a file containing RDF data, ...

and click the update button below the field after each drag&drop operation. The output will be

Modified: 484
Milliseconds: 430
Modified: 1917
Milliseconds: ...
Running update: 287
Modified: 239567
Milliseconds: 2260

The Milliseconds may vary on your machine. If you run into the 64000 entity limit you may need to restart your blazegraph.jar file with the Java VM options outlined above.

SPARQL Queries

Select all Triples

Now our environment should be ready to hit the "QUERY" tab and enter our first SPARQL query. You might want to simply cut&paste the code from the SPARQL Query descriptions for each example below into the field with the text

(Input a SPARQL query)

and then hit the "Execute" Button below this field.

SPARQL Query

SELECT * 
WHERE {
  ?subject ?predicate ?object
}

Result

Query running ...

Will be visible shortly than you'll see the result table, which will have total results of 242375 triples, displaying the first 50:

subject	                                        predicate	                                        object
<http://unlocode.rkbexplorer.com/id/AEDHF>	<http://www.aktors.org/ontology/portal#has-longitude>	54.5333333
<http://unlocode.rkbexplorer.com/id/AEDHF>	<http://www.aktors.org/ontology/portal#is-located-in>	<http://unlocode.rkbexplorer.com/id/AE>
<http://unlocode.rkbexplorer.com/id/AEDHF>	<http://www.aktors.org/ontology/support#has-pretty-name>	Al Dhafra
<http://unlocode.rkbexplorer.com/id/AEDHF>	rdf:type	<http://www.aktors.org/ontology/portal#Town>
<http://unlocode.rkbexplorer.com/id/AEDUY>	<http://www.aktors.org/ontology/portal#has-latitude>	25.7780637
<http://unlocode.rkbexplorer.com/id/AEDUY>	<http://www.aktors.org/ontology/portal#has-longitude>	55.9310912
<http://unlocode.rkbexplorer.com/id/AEDUY>	<http://www.aktors.org/ontology/portal#is-located-in>	<http://unlocode.rkbexplorer.com/id/AE>
<http://unlocode.rkbexplorer.com/id/AEDUY>	<http://www.aktors.org/ontology/support#has-pretty-name>	Ras Zubbaya (Ras Dubayyah)
<http://unlocode.rkbexplorer.com/id/AEDUY>	rdf:type	<http://www.aktors.org/ontology/portal#Town>
<http://unlocode.rkbexplorer.com/id/AEDXB>	<http://www.aktors.org/ontology/portal#has-latitude>	25.2500000
<http://unlocode.rkbexplorer.com/id/AEDXB>	<http://www.aktors.org/ontology/portal#has-longitude>	55.2666666
<http://unlocode.rkbexplorer.com/id/AEDXB>	<http://www.aktors.org/ontology/portal#is-located-in>	<http://unlocode.rkbexplorer.com/id/AE>
<http://unlocode.rkbexplorer.com/id/AEDXB>	<http://www.aktors.org/ontology/support#has-pretty-name>	Dubai
<http://unlocode.rkbexplorer.com/id/AEDXB>	rdf:type	<http://www.aktors.org/ontology/portal#Town>

Explanation

SELECT *

asked for a selection

WHERE {
  ?subject ?predicate ?object
}

specified a condition. Since we used question marks for the three triple parts we made all three parts of the triple variable so any/each triple in the database will fulfill the condition.

The query shows all triples you uploaded from the RDF files "as is".

Now you can see that RDF unlike SiDiF mostly uses lenghty URLs to express things. So the Triple for Dubai being in AE gets to be:

<http://unlocode.rkbexplorer.com/id/AEDXB>	<http://www.aktors.org/ontology/portal#is-located-in>	<http://unlocode.rkbexplorer.com/id/AE>

And there are multiple triples for the subject <http://unlocode.rkbexplorer.com/id/AEDXB> So lets select only those.

Select by subject

SPARQL Query

SELECT * 
WHERE {
  <http://unlocode.rkbexplorer.com/id/AEDXB> ?predicate ?object
}

Result

We get 5 triples:

predicate	                                          object
<http://www.aktors.org/ontology/portal#has-latitude>	  25.2500000
<http://www.aktors.org/ontology/portal#has-longitude>	  55.2666666
<http://www.aktors.org/ontology/portal#is-located-in>	  <http://unlocode.rkbexplorer.com/id/AE>
<http://www.aktors.org/ontology/support#has-pretty-name>  Dubai
rdf:type	                                          <http://www.aktors.org/ontology/portal#Town>

Explanation

This time the subject in the condition was not variable anymore but fixed to <http://unlocode.rkbexplorer.com/id/AEDXB>. We already new that such a subject existed from the query that selected all triples.

Select multiple predicates of one subject in one query

SPARQL Query

SELECT ?lat ?lon
WHERE {
  <http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#has-latitude> ?lat.
  <http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#has-longitude> ?lon.                                            
}

result

We get one result.

lat	        lon
25.2500000	55.2666666

explanation

Instead of the asterisk * we had used for the SELECT so far this time we specified two variables:

SELECT ?lat ?lon

and these where used for two conditions:

  <http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#has-latitude> ?lat.
  <http://unlocode.rkbexplorer.com/id/AEDXB> <http://www.aktors.org/ontology/portal#has-longitude> ?lon.

keeping the subject fixed at <http://unlocode.rkbexplorer.com/id/AEDXB> but varying the predicate:

  1. <http://www.aktors.org/ontology/portal#has-latitude> assigning the result to the variable lat
  2. <http://www.aktors.org/ontology/portal#has-longitude> assigning the result to the variable lon

Using prefixes

SPARQL Query

PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>
SELECT *
WHERE {
  unlocode:AEDXB ?predicate ?object                                           
}

result

We get five results again

predicate	                                          object
<http://www.aktors.org/ontology/portal#has-latitude>	  25.2500000
<http://www.aktors.org/ontology/portal#has-longitude>	  55.2666666
<http://www.aktors.org/ontology/portal#is-located-in>	  <http://unlocode.rkbexplorer.com/id/AE>
<http://www.aktors.org/ontology/support#has-pretty-name>  Dubai
rdf:type	                                          <http://www.aktors.org/ontology/portal#Town>

explanation

The Prefix specification:

PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>

will replace each prefix use of:

unlocode:

with the specified url and then appending the value after the colon

AEXDB

so

unlocode:AEXDB

is as if we had written:

<http://unlocode.rkbexplorer.com/id/AEXDB>

Using PREFIX is very useful to make your SPARQL queries a lot more readable.

Specifying the type and selecting values

SPARQL Query

PREFIX unlocode:<http://unlocode.rkbexplorer.com/id/>
PREFIX portal: <http://www.aktors.org/ontology/portal#>
PREFIX support: <http://www.aktors.org/ontology/support#>

SELECT *
WHERE {
  ?subject rdf:type portal:Town.
  ?subject portal:has-latitude   ?lat.
  ?subject portal:has-longitude  ?lon.
  ?subject portal:is-located-in  ?locatedIn.
  ?subject support:has-pretty-name ?name.
}
==== result ====
We get 48517 results 
<pre>
predicate	                                          object
subject	lat	lon	locatedIn	name
<http://unlocode.rkbexplorer.com/id/ARANA>	-28.4666666	-62.8333333	<http://unlocode.rkbexplorer.com/id/AR-G>	Anatuya
<http://unlocode.rkbexplorer.com/id/ARAND>	-27.6000000	-66.3166666	<http://unlocode.rkbexplorer.com/id/AR-K>	Andalgala
<http://unlocode.rkbexplorer.com/id/ARCCP>	-27.3333333	-65.5833333	<http://unlocode.rkbexplorer.com/id/AR-T>	Conception
<http://unlocode.rkbexplorer.com/id/ARCTC>	-26.8265906	-65.2203670	<http://unlocode.rkbexplorer.com/id/AR-K>	Catamarca
<http://unlocode.rkbexplorer.com/id/ARELB>	-27.9166666	-65.8833333	<http://unlocode.rkbexplorer.com/id/AR-K>	El Bolson
...
</pre>

==== explanation ====
The condition 
<source lang='sparql'>
?subject rdf:type portal:Town.

made sure we only get the triples for subject of the rdf type "Town". Basically this is the set of triples that we imported from unlcode-towns.rdf in the first place.