Geograpy

From BITPlan Wiki
Revision as of 09:59, 26 September 2020 by Wf (talk | contribs) (→‎Data used)
Jump to navigation Jump to search

OsProject

OsProject
edit
id  geograpy3
state  
owner  somnathrakshit
title  geograpy
url  https://github.com/somnathrakshit/geograpy3
version  0.1.15
description  
date  2020/09/26
since  
until  

What is it?

Geograpy3 is a Python library to extract geographic details like:

  • country
  • region
  • city

from plaintext and websites.

Examples

Let's take the BBC New article of May 2011 'London 2012 Olympic torch relay route revealed'. In this article quite a few countries, regions and cities are mentioned. Let's extract that information using geograpy3

Code

example1.py

import geograpy
url='https://www.bbc.com/news/av/world-africa-54272558'
places = geograpy.get_geoPlace_context(url = url) 
print(places)

Result

countries=['Jersey', 'Guernsey', 'Greece', 'Belarus', 'South Africa', 'Australia', 'New Zealand', 'United Kingdom', 'Ireland', 'United States', 'Canada']
regions=['Newcastle', 'Bristol', 'Oxford', 'Southampton', 'Greek', 'Sheffield', 'Greece', 'Media', 'Land', 'Cornwall', 'June', 'Nottingham', 'London', 'Dublin', 'Belfast', 'Guernsey', 'Locog', 'Olympia', 'Shetland', 'Jersey', 'Cardiff']
cities=['Dublin', 'Newcastle', 'Belfast', 'Sheffield', 'Cardiff', 'Oxford', 'Southampton', 'Nottingham', 'London', 'Bristol', 'Media', 'Olympia', 'Guernsey', 'Cornwall']
other=[]

Getting the source code

git clone https://github.com/somnathrakshit/geograpy3
cd geograpy3
scripts/install

History

first geograpy (2013)

The name "geograpy" was coined by Chris Albon

Angela Oduor Lungat, Brunobg, Jonathon Morgan, Romina Suarez and other contributors from Ushahidi, Nairobi, Kenya created the first and popular geograpy version. It was forked more than a hundred times and had more than 200 Stars on github.

This version was restricted to python2 and as of 2020-09 there are still some 29 open issues in this project. The project is officially archived and you might want to use geograpy3 instead.

geograpy2 (2014)

The geograpy2 fork was created in 2014. It solves several problems (such as support for utf8, places names with multiple words, confusion over homonyms etc).

Since 2015 the project didn't move forward much so you might want to use geograpy3 instead. https://github.com/Corollarium/geograpy2

geograpy3 (2018)

geograpy3 was forked from geograpy2 in 2018 by Somnath Rakshit. It added python3 compatibility. In 2020 Wolfgang Fahl joined the project since he had a need to use it for the Proceedings Title Parser as part of the ConfIDent project

Data used

title geograpy Tables 2020-09-26 [© 2020 geograpy3 project] end title package geograpy3 {

 class cities << Entity >> {
  city_name : TEXT 
  continent_code : TEXT 
  continent_name : TEXT 
  country_iso_code : TEXT 
  country_name : TEXT 
  geoname_id : TEXT <<PK>>
  is_in_european_union : TEXT 
  locale_code : TEXT 
  metro_code : TEXT 
  subdivision_1_iso_code : TEXT 
  subdivision_1_name : TEXT 
  subdivision_2_iso_code : TEXT 
  subdivision_2_name : TEXT 
  time_zone : TEXT 
 }
 class countries << Entity >> {
  coord : TEXT 
  country : TEXT 
  countryGDP_perCapita : FLOAT 
  countryIsoCode : TEXT <<PK>>
  countryLabel : TEXT 
  countryPopulation : FLOAT 
 }
 class regions << Entity >> {
  country : TEXT 
  countryIsoCode : TEXT 
  countryLabel : TEXT 
  location : TEXT 
  region : TEXT 
  regionIsoCode : TEXT 
  regionLabel : TEXT 
  regionPopulation : FLOAT 
 }
 class City_wikidata << Entity >> {
  cityPopulation : FLOAT 
  coord : TEXT 
  country : TEXT 
  countryGDP_perCapita : FLOAT 
  countryIsoCode : TEXT 
  countryLabel : TEXT 
  countryPopulation : FLOAT 
  date : TIMESTAMP 
  name : TEXT 
  ratio : TEXT 
  region : TEXT 
  regionIsoCode : TEXT 
  regionLabel : TEXT 
  wikidataurl : TEXT 
 }
 class prefixes << Entity >> {
  count : INTEGER 
  level : INTEGER 
  prefix : TEXT <<PK>>
 }
 class ambiguous << Entity >> {
  name : TEXT <<PK>>
 }
 class cityPops << Entity >> {
  city : TEXT 
  cityLabel : TEXT 
  cityPop : FLOAT 
  country : TEXT 
  countryIsoCode : TEXT 
  countryLabel : TEXT 
  countryPopulation : FLOAT 
  geoNameId : TEXT 
 }
 class citiesWithPopulation << Entity >> {
  cityPop : FLOAT 
  city_name : TEXT 
  continent_code : TEXT 
  continent_name : TEXT 
  country_iso_code : TEXT 
  country_name : TEXT 
  geoname_id : TEXT 
  is_in_european_union : TEXT 
  locale_code : TEXT 
  metro_code : TEXT 
  subdivision_1_iso_code : TEXT 
  subdivision_1_name : TEXT 
  subdivision_2_iso_code : TEXT 
  subdivision_2_name : TEXT 
  time_zone : TEXT 
  wikidataurl : TEXT 
 }

}

' BITPlan Corporate identity skin params ' Copyright (c) 2015 BITPlan GmbH ' see http://wiki.bitplan.com/PlantUmlSkinParams#BITPlanCI ' skinparams generated by com.bitplan.restmodelmanager skinparam note {

 BackGroundColor #FFFFFF
 FontSize 12
 ArrowColor #FF8000
 BorderColor #FF8000
 FontColor black
 FontName Technical

} skinparam component {

 BackGroundColor #FFFFFF
 FontSize 12
 ArrowColor #FF8000
 BorderColor #FF8000
 FontColor black
 FontName Technical

} skinparam package {

 BackGroundColor #FFFFFF
 FontSize 12
 ArrowColor #FF8000
 BorderColor #FF8000
 FontColor black
 FontName Technical

} skinparam usecase {

 BackGroundColor #FFFFFF
 FontSize 12
 ArrowColor #FF8000
 BorderColor #FF8000
 FontColor black
 FontName Technical

} skinparam activity {

 BackGroundColor #FFFFFF
 FontSize 12
 ArrowColor #FF8000
 BorderColor #FF8000
 FontColor black
 FontName Technical

} skinparam classAttribute {

 BackGroundColor #FFFFFF
 FontSize 12
 ArrowColor #FF8000
 BorderColor #FF8000
 FontColor black
 FontName Technical

} skinparam interface {

 BackGroundColor #FFFFFF
 FontSize 12
 ArrowColor #FF8000
 BorderColor #FF8000
 FontColor black
 FontName Technical

} skinparam class {

 BackGroundColor #FFFFFF
 FontSize 12
 ArrowColor #FF8000
 BorderColor #FF8000
 FontColor black
 FontName Technical

} skinparam object {

 BackGroundColor #FFFFFF
 FontSize 12
 ArrowColor #FF8000
 BorderColor #FF8000
 FontColor black
 FontName Technical

} hide Circle ' end of skinparams '

Adding city details from Wikidata

Query

# get a list of human settlements having a geoName identifier
# to add to geograpy3 library
# see https://github.com/somnathrakshit/geograpy3/issues/15
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?city ?cityLabel ?cityPop ?geoNameId ?country ?countryLabel ?countryIsoCode ?countryPopulation
WHERE {
  # geoName Identifier
  ?city wdt:P1566 ?geoNameId.
  # instance of human settlement https://www.wikidata.org/wiki/Q486972
  ?city wdt:P31/wdt:P279* wd:Q486972 .
  # population of city
  OPTIONAL { ?city wdt:P1082 ?cityPop.}

  # label of the City
  ?city rdfs:label ?cityLabel filter (lang(?cityLabel) = "en").
  # country this city belongs to
  ?city wdt:P17 ?country .
  # label for the country
  ?country rdfs:label ?countryLabel filter (lang(?countryLabel) = "en").
  # https://www.wikidata.org/wiki/Property:P297 ISO 3166-1 alpha-2 code
  ?country wdt:P297 ?countryIsoCode.
  # population of country
  ?country wdt:P1082 ?countryPopulation.
  OPTIONAL {
     ?country wdt:P2132 ?countryGdpPerCapita.
  }
}

try it! - you may probably experience a timeout on this query. It takes about 1 min on a local wikidata copy based on blazegraph

If your are intested in the result you can download the Sqlite version of query result and e.g. inspect it with the DB Browser for SQLite

CityPops Stats

Here are some statistic queries about the data imported from Wikidata

select count(*) from cityPops where cityPop is not Null
164503
select count(*) from cityPops 
453306
select count(distinct geoNameId) from cityPops
414198
select count(*) 
from cities c 
join cityPops cp on c.geoname_id =cp.geoNameId
90482