ProceedingsTitleParser

From BITPlan Wiki
Jump to navigation Jump to search

Click here to comment!

OsProject
edit
id  ProceedingsTitleParser
state  
owner  WolfgangFahl
title  Shallow Semantic Parser to extract metadata from scientific proceedings titles
url  https://github.com/WolfgangFahl/ProceedingsTitleParser
version  0.0.1
description  
date  2020-07-02
since  
until  

Usage

What is it?

The Proceedings Title Parser Service is a specialized search engine for scientific proceedings and events. It searches in a corpus/database based on data sourced from

  1. http://www.openresearch.org
  2. http://ceur-ws.org
  3. http://www.wikidata.org
  4. http://confref.org
  5. http://crossref.org
  6. https://dblp.org/
  7. GND (in progress)
  8. http://www.wikicfp.com/cfp/
  9. ...

see Data Source statistics

Search Modes

The Proceedings Title Parser currently has three modes:

  1. Proceedings Title Parsing
  2. Named Entity Recognition (NER)
  3. Extract/Scrape

All three modes expect some lines of text as input. The mode is automatically detected/selected by the content of the lines given as input.

Example

The input:

Proceedings of the 2020 International Conference on High Performance Computing & Simulation (HPCS 2020)
BIR 2019
http://ceur-ws.org/Vol-2599/
  • will trigger Proceedings Title Parsing mode for the first line
  • Named Entity Recognition mode for the second line
  • Extract/Scape mode for the third line

Try it!

Proceedings Title Parsing mode

In Proceedings Title Parsing mode the content of a line is parsed word by word to find typical elements of Proceedings titles like

  • ordinal (First, III., 6th, 23rd)
  • city (Paris, New York, London, Berlin, Barcelona)
  • country (USA, Italy, Germany, France, UK)
  • provinces (California, Texas, Florida, Ontario, NRW)
  • scope (International, European, Czech, Italian, National)
  • year (2016,1988,2017,2018,2019,2020)
  • ...

while the above elements can be found by comparing with a list of known cities, countries, provinces, ... finding the event acronym needs a lookup in a corpus/database of proceedings and events which is automatically performed to check whether a known event acronym like ISWC, ICEIS, SIGSPATIAL, ... might be found with the given context e.g. year. If the acronym (with year) is found a link to the resulting source record is shown.

Example

Input

Proceedings of the 2020 International Conference on High Performance Computing & Simulation (HPCS 2020)

Try it!

# Source Acronym Url Found by
1 OPEN RESEARCH HPCS 2020 https://www.openresearch.org/wiki/HPCS%202020 HPCS 2020
{'year': '2020', 'scope': 'International', 'event': 'Conference', 'topic': 'High Performance Computing & Simulation', 'acronym': 'HPCS 2020', 'title': 'Proceedings of the 2020 International Conference on High Performance Computing & Simulation (HPCS 2020)', 'source': 'line', 'publish': 'Proceedings', 'syntax': 'on', 'delimiter': '&'}
{ "acronym": "HPCS 2020", "city": "Barcelona", "country": "Spain", "creation_date": "2020-03-26 06:01:12", "end_date": "2020-07-24 00:00:00", "event": "HPCS 2020", "foundBy": "HPCS 2020", "homePage": null, "homepage": "http://conf.cisedu.info/rp/hpcs20/", "modification_date": "2020-03-26 08:16:33", "series": "HPCS", "source": "OPEN RESEARCH", "start_date": "2020-07-20 00:00:00", "title": "2020 International Conference on High Performance Computing & Simulation", "url": "https://www.openresearch.org/wiki/HPCS 2020" }

Named Entity Recognition mode (NER)

In named entity recognition mode the words to be looked up can be directly entered without following the patterns of typical proceedings titles. Syntax elements like "Proceedings of ... " may be left out. This mode will often also give good results but can not use the information provided by syntactically elements like "at". For an example "Proceedings of the 1st conference of the history of Rome at Paris" the NER equivalent "1 conference history Rome Paris" will obviously be ambiguous.

Example

Input

BIR 2019

Try it!

Result

# Source Acronym Url Found by
1 OPEN RESEARCH BIR 2019 https://www.openresearch.org/wiki/BIR%202019 BIR 2019
2 BIR 2019 BIR 2019 http://ceur-ws.org/Vol-2345 BIR 2019
3 confref BIR 2019 http://portal.confref.org/list/bir2019 BIR 2019
'title': 'BIR 2019', 'source': 'line', 'year': '2019'}
{ "acronym": "BIR 2019", "city": null, "country": null, "creation_date": "2020-03-09 11:01:20", "end_date": "2019-09-25 00:00:00", "event": "BIR 2019", "foundBy": "BIR 2019", "homepage": "https://bir2019.ue.katowice.pl/", "modification_date": "2020-07-06 11:47:07", "series": "BIR", "source": "OPEN RESEARCH", "start_date": "2019-09-23 00:00:00", "title": "18th International Conference on Perspectives in Business Informatics Research", "url": "https://www.openresearch.org/wiki/BIR 2019" }
{ "acronym": "BIR 2019", "city": "Cologne", "country": "Germany", "enum": "8th", "event": "BIR 2019", "eventId": "Vol-2345", "eventType": "Workshop", "foundBy": "BIR 2019", "homePage": null, "month": "April", "ordinal": 14, "publish": "Proceedings", "scope": "International", "source": "CEUR-WS", "syntax": "on", "title": "Proceedings of the 8th International Workshop on Bibliometric-enhanced Information Retrieval (BIR 2019)co-located with the 41st European Conference on Information Retrieval (ECIR 2019),Cologne, Germany, April 14th, 2019.Submitted by: Guillaume Cabanac", "topic": "Bibliometric-enhanced Information Retrieval", "url": "http://ceur-ws.org/Vol-2345", "year": "2019" }
{ "acronym": "BIR 2019", "address": null, "area": { "id": 2, "value": "Computer Science" }, "cameraReadyDate": null, "city": "Katowice", "confSeries": { "dblpId": "https://dblp.org/db/conf/bir/", "description": null, "eissn": null, "id": "bir", "issn": null, "name": "Business Informatics Research" }, "country": "Poland", "description": null, "endDate": "2019-09-25", "event": "BIR 2019", "foundBy": "BIR 2019", "homepage": null, "id": "bir2019", "keywords": [ "Data mining", "Enterprise architecture", "Business process model", "Aggregation", "Acceptance", "Banking", "Blockchain", "Comparison", "Digital learning", "e-Health", "Agile modelling method engineering", "Literature review", "Digitalization", "ecosystem", "Barriers of change", "Bing", "Business process management system", "digital Workplace Health Promotion (dWHP)", "Dual video cast", "e-Lecture" ], "name": "Business Informatics Research", "notificationDate": null, "ranks": [], "shortDescription": null, "source": "confref", "startDate": "2019-09-23", "submissionDate": null, "submissionExtended": false, "url": "http://portal.confref.org/list/bir2019", "year": 2019 }

Extract / scrape mode

If the line contains an url of a known source of conference of proceedings title information the page will be automatically visited and the meta data extracted.

Example

Input

http://ceur-ws.org/Vol-2599/

Try it!

Result:

# Source Acronym Url Found by
1 CEUR-WS BlockSW http://ceur-ws.org/Vol-2599 BlockSW
{'prefix': 'Blockchain enabled Semantic Web', 'event': 'Workshop', 'acronym': 'BlockSW', 'title': 'Proceedings of the Blockchain enabled Semantic Web Workshop (BlockSW) and Contextualized Knowledge Graphs (CKG) Workshop', 'source': 'CEUR-WS', 'eventId': 'Vol-2599', 'publish': 'Proceedings', 'syntax': 'and'}
{ "acronym": "BlockSW", "event": "BlockSW", "eventId": "Vol-2599", "eventType": "Workshop", "foundBy": "BlockSW", "homePage": null, "month": "October", "prefix": "Blockchain enabled Semantic Web", "publish": "Proceedings", "source": "CEUR-WS", "syntax": "New", "title": "Proceedings of the Blockchain enabled Semantic Web Workshop (BlockSW) and Contextualized Knowledge Graphs (CKG) Workshop (BlockSW-CKG 2019),Auckland, New Zealand, October 27, 2019.Submitted by: Reza Samavi", "url": "http://ceur-ws.org/Vol-2599", "year": "2019" }

Result formats / Content Negotiation

The following result formats are supported:

  • html (default)
  • csv
  • json
  • xml

To select a result format you can either add the "&format" query parameter as part of your url or specify the corresponding accept header in your query.

Example format parameter queries

csv

http://ptp.bitplan.com/parse?examples=example2&titles=PAKM+2000&format=csv

Try it!

result is the same as with content-negotiation text/csv

json

http://ptp.bitplan.com/parse?examples=example2&titles=BIR+2019&format=json

Try it! result is the same as with content-negotiation application/json

xml

http://ptp.bitplan.com/parse?examples=example2&titles=EuroPar+2020&format=xml

Try it! result is the same as with content-negotiation application/xml or text/xml

wikison

To get WikiSon format

http://ptp.bitplan.com/parse?examples=example2&titles=EuroPar+2020&format=wikison

Try it!

Result:

{{Event
|homepage=https://2020.euro-par.org/
|event=EuroPar 2020
|series=EuroPar
|acronym=EuroPar 2020
|title=International European Conference on Parallel and Distributed Computing
|city=Warsaw
|country=Poland
|start_date=2020-08-24 00:00:00
|end_date=2020-08-28 00:00:00
|url=https://www.openresearch.org/wiki/EuroPar 2020
}}

Examples with accept header

csv

curl -H "Accept: text/csv" "http://ptp.bitplan.com/parse?examples=example2&titles=PAKM+2000"
Result for text/csv
"month","homePage","eventType","country","acronym","ordinal","url","year","event","eventId","source","syntax","enum","foundBy","location","publish","scope","title"
"October","","Conference","Switzerland","PAKM 2000",3,"http://ceur-ws.org/Vol-34","2000","PAKM 2000","Vol-34","CEUR-WS","the","Third","PAKM 2000","Basel","Proceedings","International","Proceedings of the Third International Conference (PAKM 2000), Basel, Switzerland, October 30-31, 2000.Submitted by: Ulrich Reimer"

json

curl -H "Accept: application/json" "http://ptp.bitplan.com/parse?examples=example2&titles=BIR+2019"
Result for application/json
{
	"count": 3,
	"events": [{
		"acronym": "BIR 2019",
		"city": null,
		"country": null,
		"creation_date": "2020-03-09T11:01:20+00:00",
		"end_date": "2019-09-25T00:00:00+00:00",
		"event": "BIR 2019",
		"foundBy": "BIR 2019",
		"homePage": null,
		"homepage": "https://bir2019.ue.katowice.pl/",
		"modification_date": "2020-07-06T11:47:07+00:00",
		"series": "BIR",
		"source": "OPEN RESEARCH",
		"start_date": "2019-09-23T00:00:00+00:00",
		"title": "18th International Conference on Perspectives in Business Informatics Research",
		"url": "https://www.openresearch.org/wiki/BIR 2019"
	}, {
		"acronym": "BIR 2019",
		"city": "Cologne",
		"country": "Germany",
		"enum": "8th",
		"event": "BIR 2019",
		"eventId": "Vol-2345",
		"eventType": "Workshop",
		"foundBy": "BIR 2019",
		"homePage": null,
		"month": "April",
		"ordinal": 14,
		"publish": "Proceedings",
		"scope": "International",
		"source": "CEUR-WS",
		"syntax": "on",
		"title": "Proceedings of the 8th International Workshop on Bibliometric-enhanced Information Retrieval (BIR 2019)co-located with the 41st European Conference on Information Retrieval (ECIR 2019),Cologne, Germany, April 14th, 2019.Submitted by: Guillaume Cabanac",
		"topic": "Bibliometric-enhanced Information Retrieval",
		"url": "http://ceur-ws.org/Vol-2345",
		"year": "2019"
	}, {
		"acronym": "BIR 2019",
		"address": null,
		"area": {
			"value": "Computer Science",
			"id": 2
		},
		"cameraReadyDate": null,
		"city": "Katowice",
		"confSeries": {
			"id": "bir",
			"issn": null,
			"eissn": null,
			"dblpId": "https://dblp.org/db/conf/bir/",
			"name": "Business Informatics Research",
			"description": null
		},
		"country": "Poland",
		"description": null,
		"endDate": "2019-09-25",
		"event": "BIR 2019",
		"foundBy": "BIR 2019",
		"homepage": null,
		"id": "bir2019",
		"keywords": ["Data mining", "Enterprise architecture", "Business process model", "Aggregation", "Acceptance", "Banking", "Blockchain", "Comparison", "Digital learning", "e-Health", "Agile modelling method engineering", "Literature review", "Digitalization", "ecosystem", "Barriers of change", "Bing", "Business process management system", "digital Workplace Health Promotion (dWHP)", "Dual video cast", "e-Lecture"],
		"name": "Business Informatics Research",
		"notificationDate": null,
		"ranks": [],
		"shortDescription": null,
		"source": "confref",
		"startDate": "2019-09-23",
		"submissionDate": null,
		"submissionExtended": false,
		"url": "http://portal.confref.org/list/bir2019",
		"year": 2019
	}]
}

xml

As an alternative to application/xml the mime-type text/xml is also accepted with the same result.

curl -H "Accept: application/xml" "http://ptp.bitplan.com/parse?examples=example2&titles=EuroPar+2020"
Result for application/xml
<?xml version="1.0" ?>
<events>
	<event>
		<foundBy>EuroPar 2020</foundBy>
		<homepage>https://2020.euro-par.org/</homepage>
		<event>EuroPar 2020</event>
		<series>EuroPar</series>
		<acronym>EuroPar 2020</acronym>
		<title>International European Conference on Parallel and Distributed Computing</title>
		<city>Warsaw</city>
		<country>Poland</country>
		<start_date>2020-08-24T00:00:00</start_date>
		<end_date>2020-08-28T00:00:00</end_date>
		<creation_date>2020-02-27T14:44:52</creation_date>
		<modification_date>2020-02-27T14:44:52</modification_date>
		<url>https://www.openresearch.org/wiki/EuroPar 2020</url>
		<source>OPEN RESEARCH</source>
	</event>
</events>

Running your own service

PreRequisites

If you'd like to run your own copy of this service you'll need:

  1. git
  2. python 3 (>=3.6)
  3. python3-pip
  4. jq
  5. some unix command line tools like curl, grep, wc

Tested on Linux (Ubuntu bionic/Travis) and MacOS High Sierra 10.13.6 (using macports) as well as Windows Subsystem for Linux / Ubuntu 20.04 environment

Installation

git clone https://github.com/WolfgangFahl/ProceedingsTitleParser
./install

Windows Enviroment

Under Windows 10 you might want to use the Windows Subsystem for Linux using the Ubuntu 20.04 LTS environment. See https://docs.microsoft.com/windows/wsl/install-win10

Getting the sample data

Getting the sample data from the different sources may take a few minutes. You'll need som 60 MBytes of disk space as of 2020-07-11

scripts/getsamples

Updating the sample data

dblp

see https://dblp.uni-trier.de/xml/ for the xml file input

#!/bin/bash
# WF 2020-07-17
# get proceedings xml nodes from dblp xml download
xml=$HOME/downloads/dblp.xml
tmpxml=/tmp/proceedings-dblp.xml
json=sampledata/dblp.json
head -3 $xml > $tmpxml
# make sure there are newlines before and after end tags
# of type proceeding with sed and then filter with awk
# https://stackoverflow.com/a/24707372/1497139
cat $xml | sed $'s/<proceedings/\\\n&/g' | sed $'s/<.proceedings>/&\\\n/g' | awk '
# select proceedings nodes (which should be clearly separated by now)
/<proceedings/,/<\/proceedings>/ {
  print
}' >> $tmpxml
echo "</dblp>" >> $tmpxml
xq . $tmpxml > $json

Testing

./test

Implementation

Structure

Sources of Proceedings Titles

Wikidata

There are some 36 million scholarly articles in Wikipedia as of 2020-10. Even trying to count them times out on the official Wikidata Query services:

Query number of scholarly articles

# scholarly articles
SELECT ?item ?itemLabel 
WHERE 
{
  ?item wdt:P31 wd:Q13442814.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 10

try it!

Query links between proceedings and events

SELECT ?article ?articleLabel ?event ?eventLabel  WHERE {
  ?article wdt:P4745 ?event.  
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

try it!

Dump

To mitigate the time out it's possible to creata n RDF dump of the instances:

Links between event and proceedings

example:

Data Analysis

Year

How many event records for a given year?

  1select year,count(*) from event 
  2group by year
  3order by 1 desc
  4year	count(*)
  519670	1
  62109	1
  72106	1
  82105	1
  92091	1
 102088	1
 112081	1
 122026	3
 132025	1
 142024	2
 152022	3
 162021	1069
 172020	8318
 182019	19032
 192018	19546
 202017	17618
 212016	15697
 222015	14221
 232014	13831
 242013	12621
 252012	12292
 262011	11926
 272010	10416
 282009	9198
 292008	9024
 302007	5569
 312006	4365
 322005	3943
 332004	3438
 342003	3092
 352002	2782
 362001	2519
 372000	2400
 381999	2111
 391998	2063
 401997	1986
 411996	1732
 421995	1612
 431994	1578
 441993	1474
 451992	1308
 461991	1194
 471990	1063
 481989	1061
 491988	1029
 501987	704
 511986	715
 521985	608
 531984	486
 541983	460
 551982	405
 561981	346
 571980	314
 581979	302
 591978	276
 601977	197
 611976	152
 621975	144
 631974	113
 641973	102
 651972	82
 661971	78
 671970	66
 681969	66
 691968	54
 701967	55
 711966	41
 721965	32
 731964	31
 741963	29
 751962	28
 761961	19
 771960	17
 781959	10
 791958	9
 801957	8
 811956	8
 821955	7
 831954	3
 841953	5
 851952	4
 861951	5
 871950	4
 881949	2
 891948	2
 901947	1
 911941	1
 921938	1
 931935	1
 941932	1
 951930	1
 961929	1
 971926	1
 981923	1
 991922	1
1001920	2
1011914	1
1021913	1
1031912	1
1041911	1
1051910	1
1061908	1
1071905	1
1081904	1
1091901	1
1101900	146
1111895	1
1121894	1
1131890	1
1141889	1
1151880	1
1161862	1
11735	1
1180	14
119(null)	24389

year records that need attention

select year,source,count(*) from event 
where year>2030 or year<1850 or year=1900
group by year
order by 1 desc
year	source	count(*)
19670	confref	1
2109	gnd	1
2106	gnd	1
2105	gnd	1
2091	wikicfp	1
2088	gnd	1
2081	wikicfp	1
1900	crossref	146
35	wikicfp	1
0	confref	14

Choice of Database/Storage system

The following candidates for a Database/storage system where considered:

  1. Python native solutions
    1. using JSON files as caches
    2. ruruki
  2. Graph databases
    1. Gremlin python
    2. Dgraph
    3. Weaviate
  3. RDF TripleStore
    1. Apache Jena
  4. SQL Database
    1. sqlite

To investigate the feasibility of the approaches the Open Source Project DgraphAndWeaviateTest was created and issues for the different providers where created and questions on https://stackoverflow.com asked and answers given.

Evaluation results

Native JSON Files

Native Json Files have a very good performance and are simple to handle e.g. reading the crossRef data takes approx 1 sec

read 45601 events in   1.1 s

The disadvantage is that the query capabilities of the JSON only approach are limited. In the protype only lookup by eventID and eventAcronym was implemented via HashTables/Dicts.

Ruruki

Searching for an "in memory graph database solution for python" we found ruruki. The performance was disappointing. See testSpeed

creating 1000 vertices took   1.7 s = 598.3 v/s

Gremlin python

Using Gremlin would have been our favorite choice based on the experience with the SimpleGraph project. From our experience in-memory gremlin works excellent with less than 100.000 vertices when using Java. It is still useable for around 1 million vertices. If there are a lot more vertices it's necessay to back gremlin with a proper Graph database. We were not successful with any of the Graph database yet. The situation get's worse when Python is used as a programming language see: Gremlin python. Python is only supported as a language variant and it's very awkward to work in that environment. From our point of view this is not production ready so we didn't bother to investigate further.

DGraph

Dgraph looks like a good candidate for a graph database. It's simple to handle, has a docker based installation and a python interface. The tests looked very promising but in our development environment there has been some instability as reported in unreliability issue for Dgraph. It's not clear what causes this problem but it shows up in an undeterministic way in the travis CI tests and is a showstopper at this time. We will certainly reconsider Dgraph later.

Weaviate

Weaviate looks like an excellent fit for our usecase since it has NLP support built in and comes with a "contextionary" for e.g. english out of the box. That means the dictionary approach of the Proceedings Title Parser would have GloVe support immediately.

Apache Jena

Apache Jena has an RDF triplestore with SPARQL query capabilities. The question was whether the integration with Python and the "List of Dicts" based approach of the EventManager would be feasible. The github issues

had to be fixed

sqlite

sqlite outperforms all other approaches, especially if it used as an "in-memory" database. But even reading from disk is not much slower.

sqlite and JSON are similar in read and write performance. With sqlite SQL queries can be used wich is not directly possible with JSON.

Using SQL instead of JSON has some disadvantages e.g. a "schemaless" working mode is not possible. We worked around the problem by adding "ListOfDict" support where the columns are automatically detected and the DDL commands to create a schema need not be maintained manually.

Using SQL instead of a Graph database or Triplestore based approach has some disadvantages - e.g. graph queries are not directly possible but need to be translated to SQL - JOIN based relational queries. Given the low number of entities to be stored we hope this will not be too troublesome.

performance result for the "cities" example

see https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/master/tests/testSqlite3.py

adding 128769 City records took 0.300 s => 428910 records/s
selecting 128769 City records took 0.300 s => 428910 records/s

Other Related Services

History of Library Catalogs

Digitization of library catalogs started with punched cards (Dewey1959) and continued to modern online catalogs.