Truly Tabular RDF

From BITPlan Wiki
Revision as of 16:41, 2 August 2022 by Wf (talk | contribs)
Jump to navigation Jump to search

Info

The columns of the truly tabular tool are explained at Truly Tabular RDF/Info

Querying tabular data from Wikidata

Game of Thrones / Jon Snow Example

A starting point for analyzing the data in a triplestore such as wikidata might be a single item of interest such as the 160px-Game_of_Thrones_2011_logo.svg.png character Jon Snow instance ofGame of thrones character

SELECT (COUNT(DISTINCT ?item) AS ?count) WHERE 
{
  # Q20086263:Game of Thrones character
  ?item wdt:P31 wd:Q20086263;
}

try it! The above SPARQL query will let us know the total amount of similar items we might find in Wikidata. which is 117 at the time of writing this work.

A quite natural idea is to create a query now to get a table of all Game of thrones characters with columns for the most interesting aspects.

As a starting point we might use the Wikidata Query Service and select the sample "Cats" and refine it for our purposes with some properties we find on Jon Snows page to create a "naive" query:

Naive Game of Thrones Character tabular query

# Game of Thrones Characters
SELECT
  ?item ?itemLabel
  ?genderLabel
  ?countryLabel
  ?placeOfBirthLabel
  ?placeOfDeathLabel
WHERE
{
  ?item wdt:P31 wd:Q20086263. # must be a game of thrones character
  ?item wdt:P21 ?gender.
  ?item wdt:P27 ?country.
  ?item wdt:P19 ?placeOfBirth.
  ?item wdt:P20 ?placeOfDeath.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

try it!

result
item itemLabel genderLabel countryLabel placeOfBirthLabel placeOfDeathLabel
http://www.wikidata.org/entity/Q259818 Eddard Stark male Seven Kingdoms Winterfell King's Landing
http://www.wikidata.org/entity/Q1120793 Theon Greyjoy male Seven Kingdoms Pyke Winterfell
http://www.wikidata.org/entity/Q2941743 Catelyn Stark female Seven Kingdoms Riverrun Twins
http://www.wikidata.org/entity/Q3183235 Jon Snow male Seven Kingdoms Tower of Joy Castle Black
http://www.wikidata.org/entity/Q3665163 Cersei Lannister female Seven Kingdoms Casterly Rock Red Keep
http://www.wikidata.org/entity/Q3806180 Jaime Lannister male Seven Kingdoms Casterly Rock Red Keep
http://www.wikidata.org/entity/Q11680202 Beric Dondarrion male Seven Kingdoms The Stormlands Winterfell
http://www.wikidata.org/entity/Q12056060 Stannis Baratheon male Seven Kingdoms Storm's End The North
http://www.wikidata.org/entity/Q12815803 Rickon Stark male Seven Kingdoms Winterfell The North
http://www.wikidata.org/entity/Q12900597 Joffrey Baratheon male Seven Kingdoms King's Landing King's Landing
http://www.wikidata.org/entity/Q12900933 Margaery Tyrell female Seven Kingdoms Highgarden Great Sept of Baelor
http://www.wikidata.org/entity/Q12902445 Tywin Lannister male Seven Kingdoms Casterly Rock King's Landing
http://www.wikidata.org/entity/Q12902594 Viserys Targaryen male Seven Kingdoms King's Landing Vaes Dothrak
http://www.wikidata.org/entity/Q13634884 Robb Stark male Seven Kingdoms Riverrun Twins
http://www.wikidata.org/entity/Q13634885 Robert Baratheon male Seven Kingdoms Storm's End King's Landing
http://www.wikidata.org/entity/Q17355556 Janos Slynt male Seven Kingdoms King's Landing The North
http://www.wikidata.org/entity/Q18920137 Loras Tyrell male Seven Kingdoms Highgarden Great Sept of Baelor
http://www.wikidata.org/entity/Q19791067 Lysa Arryn female Seven Kingdoms Riverrun Eyrie
http://www.wikidata.org/entity/Q23730371 Hoster Tully male Seven Kingdoms Riverrun Winterfell
http://www.wikidata.org/entity/Q23747242 Shireen Baratheon female Seven Kingdoms Dragonstone The North
http://www.wikidata.org/entity/Q23759020 Rickard Stark male Seven Kingdoms Winterfell King's Landing
http://www.wikidata.org/entity/Q23759067 Benjen Stark male Seven Kingdoms Winterfell Beyond the Wall
http://www.wikidata.org/entity/Q23759071 Lyanna Stark female Seven Kingdoms Winterfell Tower of Joy
http://www.wikidata.org/entity/Q23782360 Brynden Tully male Seven Kingdoms Riverrun Riverrun
http://www.wikidata.org/entity/Q23782368 Walder Frey male Seven Kingdoms Twins Twins
http://www.wikidata.org/entity/Q24814689 Aemon Targaryen male Seven Kingdoms King's Landing Castle Black
http://www.wikidata.org/entity/Q28468107 Doran Martell male Seven Kingdoms Sunspear Dorne
http://www.wikidata.org/entity/Q55433701 Jojen Reed male Seven Kingdoms Greywater Watch Beyond the Wall

The result of this query is nicely tabular and can be exported to CSV format to be imported to our favorite relational database or spreadsheet - but it does not contain 117 entries any more but only 28. SPARQL queries will not return "NULL" values as our relational database query language SQL would. The query needs to be amended with the "OPTIONAL" keyword as shown below.

Naive Game of Thrones Character tabular query with OPTIONAL

# Game of Thrones Characters
SELECT
  ?item ?itemLabel
  ?genderLabel
  ?countryLabel
  ?placeOfBirthLabel
  ?placeOfDeathLabel
WHERE
{
  ?item wdt:P31 wd:Q20086263. # must be a game of thrones character
  OPTIONAL { ?item wdt:P21 ?gender}.
  OPTIONAL { ?item wdt:P27 ?country}.
  OPTIONAL { ?item wdt:P19 ?placeOfBirth}.
  OPTIONAL { ?item wdt:P20 ?placeOfDeath}.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

try it!

This table looks promising it has 117 entries and the columns are filled as expected.

The real surprise for a novice will be the result of the Query below when three more columns have been added. All over sudden the query result has 304 rows for our 117 characters. There are duplicate rows for each character and strange combinations of entries. A notorious example seems to be Jorah Mormont having 8 different values for the "occupation" column.

Naive Game of Thrones Character tabular query with duplicate rows for some characters

# Game of Thrones Characters
SELECT
  ?item ?itemLabel
  ?genderLabel
  ?countryLabel
  ?placeOfBirthLabel
  ?placeOfDeathLabel
  ?killedByLabel
  ?familyLabel
  ?occupationLabel
WHERE
{
  ?item wdt:P31 wd:Q20086263. # must be a game of thrones character
  OPTIONAL { ?item wdt:P21 ?gender}.
  OPTIONAL { ?item wdt:P27 ?country}.
  OPTIONAL { ?item wdt:P19 ?placeOfBirth}.
  OPTIONAL { ?item wdt:P20 ?placeOfDeath}.
  OPTIONAL { ?item wdt:P157 ?killedBy}.
  OPTIONAL { ?item wdt:P53 ?family}.
  OPTIONAL { ?item wdt:P106 ?occupation}.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

try it!

Naive SPARQL Query

  1. Start with a wikidata item your are intested in e.g. International Semantic Web Conference ISWC 2022
  2. use the instance of property to find similar items of the same class academic conference
  3. straight-forward select further properties by adding statements similar to
    OPTIONAL { ?conference wdt:P1813 ?short_name }
    
    to the WHERE clause.
    1. P1813 short name
    2. P17 country
    3. P1476 title

This naive approach will lead to more results for Step 3 (e.g. 7730) than for step 2 (e.g. 7695) which is a surprise for most novices since this effect would not happen with a similar SQL query

SELECT short_name,country,title from academic_conference

Result of Step #2

# Academic conference wikidata query
# WF 2021-01-30
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?conference ?conferenceLabel 
WHERE
{
  #  academic conference series (Q2020153)
  ?conference wdt:P31 wd:Q2020153.
  # label
  ?conference rdfs:label ?conferenceLabel filter (lang(?conferenceLabel) = "en").
}

try it!

conference conferenceLabel
http://www.wikidata.org/entity/Q75698988 The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
http://www.wikidata.org/entity/Q75707991 Digital Humanities 2020
http://www.wikidata.org/entity/Q75709854 Digital Humanities 2018
...

Result of Step 3

# Academic conference wikidata query
# WF 2021-01-30
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT 
  ?conference ?conferenceLabel
  ?short_name
  ?country
  ?title
WHERE
{
  #  academic conference series (Q2020153)
  ?conference wdt:P31 wd:Q2020153.
  # label
  ?conference rdfs:label ?conferenceLabel filter (lang(?conferenceLabel) = "en").
  # short name
  OPTIONAL { ?conference wdt:P1813 ?short_name }
  # country
  OPTIONAL { ?conference wdt:P17 ?country }
  # title
  OPTIONAL { ?conference wdt:P1476 ?title }
}

try it!

More elaborate example: novel series

  1. start with Lord of the Rings
  2. find similar Novel Series

Naive SPARQL Query

# truly tabular query for 
# Q1667921:novel series
# generated by trulytabular.py on 2022-07-27T17:33:43.681991
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?novel_series ?novel_seriesLabel
  ?instance_of
  ?language_of_work_or_name
  ?genre
  ?author
  ?country_of_origin
  ?has_part_s_
  ?publication_date
  ?Freebase_ID
  ?ISFDB_series_ID
  ?title
  ?Google_Knowledge_Graph_ID
WHERE {
  # instanceof Q1667921:novel series
  ?novel_series wdt:P31 wd:Q1667921.
  # label
  ?novel_series rdfs:label ?novel_seriesLabel  
  FILTER (LANG(?novel_seriesLabel) = "en").
  # instance of (P31)
  OPTIONAL { ?novel_series wdt:P31 ?instance_of. }
  # language of work or name (P407)
  OPTIONAL { ?novel_series wdt:P407 ?language_of_work_or_name. }
  # genre (P136)
  OPTIONAL { ?novel_series wdt:P136 ?genre. }
  # author (P50)
  OPTIONAL { ?novel_series wdt:P50 ?author. }
  # country of origin (P495)
  OPTIONAL { ?novel_series wdt:P495 ?country_of_origin. }
  # has part(s) (P527)
  OPTIONAL { ?novel_series wdt:P527 ?has_part_s_. }
  # publication date (P577)
  OPTIONAL { ?novel_series wdt:P577 ?publication_date. }
  # Freebase ID (P646)
  OPTIONAL { ?novel_series wdt:P646 ?Freebase_ID. }
  # ISFDB series ID (P1235)
  OPTIONAL { ?novel_series wdt:P1235 ?ISFDB_series_ID. }
  # title (P1476)
  OPTIONAL { ?novel_series wdt:P1476 ?title. }
  # Google Knowledge Graph ID (P2671)
  OPTIONAL { ?novel_series wdt:P2671 ?Google_Knowledge_Graph_ID. }
}

try it!

Aggregate SPARQL Query with SAMPLE

# truly tabular query for 
# Q1667921:novel series
# generated by trulytabular.py on 2022-07-27T17:33:43.681991
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?novel_series ?novel_seriesLabel
  (SAMPLE (?instance_of) AS ?instance_of  )
  (SAMPLE (?language_of_work_or_name) AS ?language_of_work_or_name)
  (SAMPLE (?genre) AS ?genre)
  (SAMPLE (?author) AS ?author)
  (SAMPLE (?country_of_origin) AS ?country_of_origin)
  (SAMPLE (?has_part_s_) AS ?has_part_s_)
  (SAMPLE (?publication_date) AS ?publication_date)
  (SAMPLE (?Freebase_ID) AS ?Freebase_ID)
  (SAMPLE (?ISFDB_series_ID) AS ?ISFDB_series_ID)
  (SAMPLE (?title) AS ?title )
  (SAMPLE (?Google_Knowledge_Graph_ID) AS ?Google_Knowledge_Graph_ID)
WHERE {
  # instanceof Q1667921:novel series
  ?novel_series wdt:P31 wd:Q1667921.
  # label
  ?novel_series rdfs:label ?novel_seriesLabel  
  FILTER (LANG(?novel_seriesLabel) = "en").
  # instance of (P31)
  OPTIONAL { ?novel_series wdt:P31 ?instance_of. }
  # language of work or name (P407)
  OPTIONAL { ?novel_series wdt:P407 ?language_of_work_or_name. }
  # genre (P136)
  OPTIONAL { ?novel_series wdt:P136 ?genre. }
  # author (P50)
  OPTIONAL { ?novel_series wdt:P50 ?author. }
  # country of origin (P495)
  OPTIONAL { ?novel_series wdt:P495 ?country_of_origin. }
  # has part(s) (P527)
  OPTIONAL { ?novel_series wdt:P527 ?has_part_s_. }
  # publication date (P577)
  OPTIONAL { ?novel_series wdt:P577 ?publication_date. }
  # Freebase ID (P646)
  OPTIONAL { ?novel_series wdt:P646 ?Freebase_ID. }
  # ISFDB series ID (P1235)
  OPTIONAL { ?novel_series wdt:P1235 ?ISFDB_series_ID. }
  # title (P1476)
  OPTIONAL { ?novel_series wdt:P1476 ?title. }
  # Google Knowledge Graph ID (P2671)
  OPTIONAL { ?novel_series wdt:P2671 ?Google_Knowledge_Graph_ID. }
} GROUP BY ?novel_series ?novel_seriesLabe

How tabular are the Academic Conference entries in wikidata?

Result as of 2022-03

property total f1 total% non tabular non tabular% f2 f3 f14 f4 f7 f5 f9
7518
short name 6750 6741 89.8 9 0.1 9
country 7077 7077 94.1 0 0
title 6718 6700 89.4 18 0.3 10 8
part of the series 7139 7120 95 19 0.3 15 4
VIAF ID 2096 2092 27.9 4 0.2 3 1
GND ID 3049 3043 40.6 6 0.2 4 2
location 7209 7180 95.9 29 0.4 24 4 1
start time 6916 6914 92 2 0 2
end time 6912 6909 91.9 3 0 3
official website 596 586 7.9 10 1.7 9 1
main subject 1882 1722 25 160 8.5 131 23 2 2 1 1
described at URL 6512 6510 86.6 2 0 1 1
language used 87 84 1.2 3 3.4 3
is proceedings from 921 901 12.3 20 2.2 16 3 1
WikiCFP event ID 98 98 1.3 0 0

Truly tabular examples