ProceedingsTitleParser

From BITPlan Wiki
Revision as of 06:38, 10 July 2020 by Wf (talk | contribs) (→‎Usage)
Jump to navigation Jump to search
OsProject
edit
id  ProceedingsTitleParser
state  
owner  WolfgangFahl
title  Shallow Semantic Parser to extract metadata from scientific proceedings titles
url  https://github.com/WolfgangFahl/ProceedingsTitleParser
version  0.0.1
description  
date  2020-07-02
since  
until  

PreRequisites

  1. git
  2. python 3 (>=3.6)
  3. some unix command line tools like curl, grep, wc

Tested on Linux (Ubuntu bionic/Travis) and MacOS High Sierra 10.13.6 (using macports)

Installation

git clone https://github.com/WolfgangFahl/ProceedingsTitleParser
./install

Getting the sample data

./getsamples

Testing

./test

Usage

The Proceedings Title Parser currently has three modes:

  1. Proceedings Title Parsing
  2. Named Entity Recognition (NER)
  3. Extract/Scrape

All three modes expect some lines of text as input. The mode is automatically detected/selected by the content of the lines given as input. == Example The input:

Proceedings of the 2020 International Conference on High Performance Computing & Simulation (HPCS 2020)
BIR 2019
http://ceur-ws.org/Vol-2635/
  • will trigger Proceedings Title Parsing mode for the first line
  • Named Entity Recognition mode for the second line
  • Extract/Scape mode for the third line

Proceedings Title Parsing

In Proceedings Title Parsing mode the content of a line is parsed word by word to find typical elements of Proceedings titles like

  • ordinal (First, III., 6th, 23rd)
  • city (Paris, New York, London, Berlin, Barcelona)
  • country (USA, Italy, Germany, France, UK)
  • provinces (California, Texas, Florida, Ontario, NRW)
  • scope (International, European, Czech, Italian, National)
  • year (2016,1988,2017,2018,2019,2020)

while the above element can be found by comparing with a list of known cities, countries, provinces finding the acronym needs a lookup in a corpus of proceedings which is automatically performed to check whether a known acronym like ISWC, ICEIS, SIGSPATIAL might be found. If the acronym is found a link to the resulting source record is shown.

Example

Input

Proceedings of the 2020 International Conference on High Performance Computing & Simulation (HPCS 2020)

Result Source: OPEN RESEARCH Acronym: HPCS 2020 Url: https://www.openresearch.org/wiki/HPCS%202020 Found by: HPCS 2020

{'year': '2020', 'scope': 'International', 'event': 'Conference', 'topic': 'High Performance Computing & Simulation', 'acronym': 'HPCS 2020', 'title': 'Proceedings of the 2020 International Conference on High Performance Computing & Simulation (HPCS 2020)', 'source': 'line', 'publish': 'Proceedings', 'syntax': 'on', 'delimiter': '&'}
{ "acronym": "HPCS 2020", "city": "Barcelona", "country": "Spain", "creation_date": "2020-03-26 06:01:12", "end_date": "2020-07-24 00:00:00", "event": "HPCS 2020", "foundBy": "HPCS 2020", "homePage": null, "homepage": "http://conf.cisedu.info/rp/hpcs20/", "modification_date": "2020-03-26 08:16:33", "series": "HPCS", "source": "OPEN RESEARCH", "start_date": "2020-07-20 00:00:00", "title": "2020 International Conference on High Performance Computing & Simulation", "url": "https://www.openresearch.org/wiki/HPCS 2020" }

Named Entity Recognition mode (NER)

In named entity recognition mode the words to be looked up can be directly entered without following the patterns of typical proceedings titles. Syntax elements like "Proceedings of ... " may be left out. This mode will often also give good results but can not use the information provided by syntactically elements like "at". For an example "Proceedings of the 1st conference of the history of Rome at Paris" the NER equivalent "1 conference history Rome Paris" will obviously be ambiguous.

Example

Input

BIR 2019

Result=

# Source Acronym Url Found by
1 OPEN RESEARCH BIR 2019 https://www.openresearch.org/wiki/BIR%202019 BIR 2019
2 BIR 2019 BIR 2019 http://ceur-ws.org/Vol-2345 BIR 2019
{'title': 'BIR 2019', 'source': 'line', 'year': '2019'}
{ "acronym": "BIR 2019", "city": null, "country": null, "creation_date": "2020-03-09 11:01:20", "end_date": "2019-09-25 00:00:00", "event": "BIR 2019", "foundBy": "BIR 2019", "homePage": null, "homepage": "https://bir2019.ue.katowice.pl/", "modification_date": "2020-07-06 11:47:07", "series": "BIR", "source": "OPEN RESEARCH", "start_date": "2019-09-23 00:00:00", "title": "18th International Conference on Perspectives in Business Informatics Research", "url": "https://www.openresearch.org/wiki/BIR 2019" }
{ "acronym": "BIR 2019", "city": "Cologne", "country": "Germany", "enum": "8th", "event": "BIR 2019", "eventId": "Vol-2345", "eventType": "Workshop", "foundBy": "BIR 2019", "homePage": null, "month": "April", "ordinal": 14, "publish": "Proceedings", "scope": "International", "source": "CEUR-WS", "syntax": "on", "title": "Proceedings of the 8th International Workshop on Bibliometric-enhanced Information Retrieval (BIR 2019)co-located with the 41st European Conference on Information Retrieval (ECIR 2019),Cologne, Germany, April 14th, 2019.Submitted by: Guillaume Cabanac", "topic": "Bibliometric-enhanced Information Retrieval", "url": "http://ceur-ws.org/Vol-2345", "year": "2019" }

Extract / scrape mode

If the line contain an url of a known source of conference of proceedings title information the page will be automatically visited and the meta data extracted.

Example

Input

http://ceur-ws.org/Vol-2599/

Result:

# Source Acronym Url Found by
1 CEUR-WS BlockSW http://ceur-ws.org/Vol-2599 BlockSW
{'prefix': 'Blockchain enabled Semantic Web', 'event': 'Workshop', 'acronym': 'BlockSW', 'title': 'Proceedings of the Blockchain enabled Semantic Web Workshop (BlockSW) and Contextualized Knowledge Graphs (CKG) Workshop', 'source': 'CEUR-WS', 'eventId': 'Vol-2599', 'publish': 'Proceedings', 'syntax': 'and'}
{ "acronym": "BlockSW", "event": "BlockSW", "eventId": "Vol-2599", "eventType": "Workshop", "foundBy": "BlockSW", "homePage": null, "month": "October", "prefix": "Blockchain enabled Semantic Web", "publish": "Proceedings", "source": "CEUR-WS", "syntax": "New", "title": "Proceedings of the Blockchain enabled Semantic Web Workshop (BlockSW) and Contextualized Knowledge Graphs (CKG) Workshop (BlockSW-CKG 2019),Auckland, New Zealand, October 27, 2019.Submitted by: Reza Samavi", "url": "http://ceur-ws.org/Vol-2599", "year": "2019" }