Difference between revisions of "Geograpy"
(→Code) |
|||
Line 27: | Line 27: | ||
<source lang='python'> | <source lang='python'> | ||
import geograpy | import geograpy | ||
− | url='https:// | + | url='https://en.wikipedia.org/wiki/2012_Summer_Olympics_torch_relay' |
places = geograpy.get_geoPlace_context(url = url) | places = geograpy.get_geoPlace_context(url = url) | ||
print(places) | print(places) |
Revision as of 12:25, 10 October 2020
OsProject
OsProject | |
---|---|
edit | |
id | geograpy3 |
state | |
owner | somnathrakshit |
title | geograpy |
url | https://github.com/somnathrakshit/geograpy3 |
version | 0.1.16 |
description | |
date | 2020/09/26 |
since | |
until |
What is it?
Geograpy3 is a Python library to extract geographic details like:
- country
- region
- city
from plaintext and websites.
Examples
Example 1 - London 2012 Olympic torch relay route
Let's take the BBC News article of May 2011 'London 2012 Olympic torch relay route revealed'. In this article quite a few countries, regions and cities are mentioned. Let's extract that information using geograpy3
Code
import geograpy
url='https://en.wikipedia.org/wiki/2012_Summer_Olympics_torch_relay'
places = geograpy.get_geoPlace_context(url = url)
print(places)
Result
python example1.py
countries=['Jersey', 'Guernsey', 'Greece', 'Belarus', 'South Africa', 'Australia', 'New Zealand', 'United Kingdom', 'Ireland', 'United States', 'Canada']
regions=['Newcastle', 'Bristol', 'Oxford', 'Southampton', 'Greek', 'Sheffield', 'Greece', 'Media', 'Land', 'Cornwall', 'June', 'Nottingham', 'London', 'Dublin', 'Belfast', 'Guernsey', 'Locog', 'Olympia', 'Shetland', 'Jersey', 'Cardiff']
cities=['Dublin', 'Newcastle', 'Belfast', 'Sheffield', 'Cardiff', 'Oxford', 'Southampton', 'Nottingham', 'London', 'Bristol', 'Media', 'Olympia', 'Guernsey', 'Cornwall']
other=[]
Getting the source code
git clone https://github.com/somnathrakshit/geograpy3
cd geograpy3
scripts/install
History
first geograpy (2013)
The name "geograpy" was coined by Chris Albon
Angela Oduor Lungat, Brunobg, Jonathon Morgan, Romina Suarez and other contributors from Ushahidi, Nairobi, Kenya created the first and popular geograpy version. It was forked more than a hundred times and had more than 200 Stars on github.
This version was restricted to python2 and as of 2020-09 there are still some 29 open issues in this project. The project is officially archived and you might want to use geograpy3 instead.
geograpy2 (2014)
The geograpy2 fork was created in 2014. It solves several problems (such as support for utf8, places names with multiple words, confusion over homonyms etc).
Since 2015 the project didn't move forward much so you might want to use geograpy3 instead. https://github.com/Corollarium/geograpy2
geograpy3 (2018)
geograpy3 was forked from geograpy2 in 2018 by Somnath Rakshit. It added python3 compatibility. In 2020 Wolfgang Fahl joined the project since he had a need to use it for the Proceedings Title Parser as part of the ConfIDent project
Data used
Overview
Cities
The cities table is derived from the GeoLite2 by MaxMind database
Countries
The countries table is derived from Wikidata:
# get a list of countries
# for geograpy3 library
# see https://github.com/somnathrakshit/geograpy3/issues/15
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
# get City details with Country
SELECT DISTINCT ?country ?countryLabel ?countryIsoCode ?countryPopulation ?countryGDP_perCapita ?coord WHERE {
# instance of City Country
?country wdt:P31/wdt:P279* wd:Q3624078 .
# label for the country
?country rdfs:label ?countryLabel filter (lang(?countryLabel) = "en").
# get the coordinates
?country wdt:P625 ?coord.
# https://www.wikidata.org/wiki/Property:P297 ISO 3166-1 alpha-2 code
?country wdt:P297 ?countryIsoCode.
# population of country
?country wdt:P1082 ?countryPopulation.
# https://www.wikidata.org/wiki/Property:P2132
# nonminal GDP per capita
?country wdt:P2132 ?countryGDP_perCapita.
}
try it! - 190 results in some 1.3 s as of 2020-09
Regions
The regions list is derived from Wikidata
# get a list of regions
# for geograpy3 library
# see https://github.com/somnathrakshit/geograpy3/issues/15
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
SELECT DISTINCT ?country ?countryLabel ?countryIsoCode ?region (max(?regionAlpha2) as ?regionIsoCode) ?regionLabel (max(?population) as ?regionPopulation) ?location
WHERE
{
# administrative unit of first order
?region wdt:P31/wdt:P279* wd:Q10864048.
OPTIONAL {
?region rdfs:label ?regionLabel filter (lang(?regionLabel) = "en").
}
# filter historic regions
# FILTER NOT EXISTS {?region wdt:P576 ?end}
# get the population
# https://www.wikidata.org/wiki/Property:P1082
OPTIONAL { ?region wdt:P1082 ?population. }
# # https://www.wikidata.org/wiki/Property:P297
OPTIONAL {
?region wdt:P17 ?country.
# label for the country
?country rdfs:label ?countryLabel filter (lang(?countryLabel) = "en").
?country wdt:P297 ?countryIsoCode.
}
# isocode state/province
?region wdt:P300 ?regionAlpha2.
# https://www.wikidata.org/wiki/Property:P625
OPTIONAL { ?region wdt:P625 ?location. }
} GROUP BY ?country ?countryLabel ?countryIsoCode ?region ?regionIsoCode ?regionLabel ?location
ORDER BY ?regionIsoCode
try it! - 3727 results in 7.9 s as if 2020-09
Adding city details from Wikidata
Query
# get a list of human settlements having a geoName identifier
# to add to geograpy3 library
# see https://github.com/somnathrakshit/geograpy3/issues/15
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?city ?cityLabel ?cityPop ?geoNameId ?country ?countryLabel ?countryIsoCode ?countryPopulation
WHERE {
# geoName Identifier
?city wdt:P1566 ?geoNameId.
# instance of human settlement https://www.wikidata.org/wiki/Q486972
?city wdt:P31/wdt:P279* wd:Q486972 .
# population of city
OPTIONAL { ?city wdt:P1082 ?cityPop.}
# label of the City
?city rdfs:label ?cityLabel filter (lang(?cityLabel) = "en").
# country this city belongs to
?city wdt:P17 ?country .
# label for the country
?country rdfs:label ?countryLabel filter (lang(?countryLabel) = "en").
# https://www.wikidata.org/wiki/Property:P297 ISO 3166-1 alpha-2 code
?country wdt:P297 ?countryIsoCode.
# population of country
?country wdt:P1082 ?countryPopulation.
OPTIONAL {
?country wdt:P2132 ?countryGdpPerCapita.
}
}
try it! - you may probably experience a timeout on this query. It takes about 1 min on a local wikidata copy based on blazegraph
If your are intested in the result you can download the Sqlite version of query result and e.g. inspect it with the DB Browser for SQLite
CityPops Stats
Here are some statistic queries about the data imported from Wikidata
select count(*) from cityPops where cityPop is not Null
164503
select count(*) from cityPops
453306
select count(distinct geoNameId) from cityPops
414198
select count(*)
from cities c
join cityPops cp on c.geoname_id =cp.geoNameId
90482
Difference in Name/Label
17499 differences:
select c.city_name as name,cp.cityLabel,c.*,city as wikidataurl,cityPop
from cities c
join cityPops cp
on c.geoname_id=cp.geoNameId
where not c.city_name =cp.cityLabel
group by geoNameId