ProceedingsTitleParser
OsProject | |
---|---|
edit | |
id | ProceedingsTitleParser |
state | |
owner | WolfgangFahl |
title | Shallow Semantic Parser to extract metadata from scientific proceedings titles |
url | https://github.com/WolfgangFahl/ProceedingsTitleParser |
version | 0.0.1 |
description | |
date | 2020-07-02 |
since | |
until |
Usage
What is it?
The Proceedings Title Parser Service is a specialized search engine for scientific proceedings and events. It searches in a corpus/database based on data from
- http://www.openresearch.org ✓
- http://ceur-ws.org ✓
- http://www.wikidata.org (in progress)
- http://confref.org (in progress)
- http://crossref.org (in progress)
- https://dblp.org/ (in progress)
- GND (in progress)
- http://www.wikicfp.com/cfp/ (planned)
- ...
Search Modes
The Proceedings Title Parser currently has three modes:
- Proceedings Title Parsing
- Named Entity Recognition (NER)
- Extract/Scrape
All three modes expect some lines of text as input. The mode is automatically detected/selected by the content of the lines given as input.
Example
The input:
Proceedings of the 2020 International Conference on High Performance Computing & Simulation (HPCS 2020) BIR 2019 http://ceur-ws.org/Vol-2599/
- will trigger Proceedings Title Parsing mode for the first line
- Named Entity Recognition mode for the second line
- Extract/Scape mode for the third line
Proceedings Title Parsing mode
In Proceedings Title Parsing mode the content of a line is parsed word by word to find typical elements of Proceedings titles like
- ordinal (First, III., 6th, 23rd)
- city (Paris, New York, London, Berlin, Barcelona)
- country (USA, Italy, Germany, France, UK)
- provinces (California, Texas, Florida, Ontario, NRW)
- scope (International, European, Czech, Italian, National)
- year (2016,1988,2017,2018,2019,2020)
- ...
while the above elements can be found by comparing with a list of known cities, countries, provinces, ... finding the event acronym needs a lookup in a corpus/database of proceedings and events which is automatically performed to check whether a known event acronym like ISWC, ICEIS, SIGSPATIAL, ... might be found with the given context e.g. year. If the acronym (with year) is found a link to the resulting source record is shown.
Example
Input
Proceedings of the 2020 International Conference on High Performance Computing & Simulation (HPCS 2020)
# | Source | Acronym | Url | Found by |
---|---|---|---|---|
1 | OPEN RESEARCH | HPCS 2020 | https://www.openresearch.org/wiki/HPCS%202020 | HPCS 2020 |
{'year': '2020', 'scope': 'International', 'event': 'Conference', 'topic': 'High Performance Computing & Simulation', 'acronym': 'HPCS 2020', 'title': 'Proceedings of the 2020 International Conference on High Performance Computing & Simulation (HPCS 2020)', 'source': 'line', 'publish': 'Proceedings', 'syntax': 'on', 'delimiter': '&'}
{ "acronym": "HPCS 2020", "city": "Barcelona", "country": "Spain", "creation_date": "2020-03-26 06:01:12", "end_date": "2020-07-24 00:00:00", "event": "HPCS 2020", "foundBy": "HPCS 2020", "homePage": null, "homepage": "http://conf.cisedu.info/rp/hpcs20/", "modification_date": "2020-03-26 08:16:33", "series": "HPCS", "source": "OPEN RESEARCH", "start_date": "2020-07-20 00:00:00", "title": "2020 International Conference on High Performance Computing & Simulation", "url": "https://www.openresearch.org/wiki/HPCS 2020" }
Named Entity Recognition mode (NER)
In named entity recognition mode the words to be looked up can be directly entered without following the patterns of typical proceedings titles. Syntax elements like "Proceedings of ... " may be left out. This mode will often also give good results but can not use the information provided by syntactically elements like "at". For an example "Proceedings of the 1st conference of the history of Rome at Paris" the NER equivalent "1 conference history Rome Paris" will obviously be ambiguous.
Example
Input
BIR 2019
Result
# | Source | Acronym | Url | Found by |
---|---|---|---|---|
1 | OPEN RESEARCH | BIR 2019 | https://www.openresearch.org/wiki/BIR%202019 | BIR 2019 |
2 | BIR 2019 | BIR 2019 | http://ceur-ws.org/Vol-2345 | BIR 2019 |
3 | confref | BIR 2019 | http://portal.confref.org/list/bir2019 | BIR 2019 |
'title': 'BIR 2019', 'source': 'line', 'year': '2019'}
{ "acronym": "BIR 2019", "city": null, "country": null, "creation_date": "2020-03-09 11:01:20", "end_date": "2019-09-25 00:00:00", "event": "BIR 2019", "foundBy": "BIR 2019", "homepage": "https://bir2019.ue.katowice.pl/", "modification_date": "2020-07-06 11:47:07", "series": "BIR", "source": "OPEN RESEARCH", "start_date": "2019-09-23 00:00:00", "title": "18th International Conference on Perspectives in Business Informatics Research", "url": "https://www.openresearch.org/wiki/BIR 2019" }
{ "acronym": "BIR 2019", "city": "Cologne", "country": "Germany", "enum": "8th", "event": "BIR 2019", "eventId": "Vol-2345", "eventType": "Workshop", "foundBy": "BIR 2019", "homePage": null, "month": "April", "ordinal": 14, "publish": "Proceedings", "scope": "International", "source": "CEUR-WS", "syntax": "on", "title": "Proceedings of the 8th International Workshop on Bibliometric-enhanced Information Retrieval (BIR 2019)co-located with the 41st European Conference on Information Retrieval (ECIR 2019),Cologne, Germany, April 14th, 2019.Submitted by: Guillaume Cabanac", "topic": "Bibliometric-enhanced Information Retrieval", "url": "http://ceur-ws.org/Vol-2345", "year": "2019" }
{ "acronym": "BIR 2019", "address": null, "area": { "id": 2, "value": "Computer Science" }, "cameraReadyDate": null, "city": "Katowice", "confSeries": { "dblpId": "https://dblp.org/db/conf/bir/", "description": null, "eissn": null, "id": "bir", "issn": null, "name": "Business Informatics Research" }, "country": "Poland", "description": null, "endDate": "2019-09-25", "event": "BIR 2019", "foundBy": "BIR 2019", "homepage": null, "id": "bir2019", "keywords": [ "Data mining", "Enterprise architecture", "Business process model", "Aggregation", "Acceptance", "Banking", "Blockchain", "Comparison", "Digital learning", "e-Health", "Agile modelling method engineering", "Literature review", "Digitalization", "ecosystem", "Barriers of change", "Bing", "Business process management system", "digital Workplace Health Promotion (dWHP)", "Dual video cast", "e-Lecture" ], "name": "Business Informatics Research", "notificationDate": null, "ranks": [], "shortDescription": null, "source": "confref", "startDate": "2019-09-23", "submissionDate": null, "submissionExtended": false, "url": "http://portal.confref.org/list/bir2019", "year": 2019 }
Extract / scrape mode
If the line contains an url of a known source of conference of proceedings title information the page will be automatically visited and the meta data extracted.
Example
Input
http://ceur-ws.org/Vol-2599/
Result:
# | Source | Acronym | Url | Found by |
---|---|---|---|---|
1 | CEUR-WS | BlockSW | http://ceur-ws.org/Vol-2599 | BlockSW |
{'prefix': 'Blockchain enabled Semantic Web', 'event': 'Workshop', 'acronym': 'BlockSW', 'title': 'Proceedings of the Blockchain enabled Semantic Web Workshop (BlockSW) and Contextualized Knowledge Graphs (CKG) Workshop', 'source': 'CEUR-WS', 'eventId': 'Vol-2599', 'publish': 'Proceedings', 'syntax': 'and'}
{ "acronym": "BlockSW", "event": "BlockSW", "eventId": "Vol-2599", "eventType": "Workshop", "foundBy": "BlockSW", "homePage": null, "month": "October", "prefix": "Blockchain enabled Semantic Web", "publish": "Proceedings", "source": "CEUR-WS", "syntax": "New", "title": "Proceedings of the Blockchain enabled Semantic Web Workshop (BlockSW) and Contextualized Knowledge Graphs (CKG) Workshop (BlockSW-CKG 2019),Auckland, New Zealand, October 27, 2019.Submitted by: Reza Samavi", "url": "http://ceur-ws.org/Vol-2599", "year": "2019" }
Result formats / Content Negotiation
The following result formats are supported:
- html (default)
- json
- xml
To select a result format you can either add the "&format" query parameter as part of your url or specify the corresponding accept header in your query.
Example format parameter queries
csv
http://ptp.bitplan.com/parse?examples=example2&titles=PAKM+2000&format=csv
result is the same as with content-negotiation text/csv
json
http://ptp.bitplan.com/parse?examples=example2&titles=BIR+2019&format=json
Try it! result is the same as with content-negotiation application/json
xml
http://ptp.bitplan.com/parse?examples=example2&titles=EuroPar+2020&format=xml
Try it! result is the same as with content-negotiation application/xml or text/xml
wikison
To get WikiSon format
http://ptp.bitplan.com/parse?examples=example2&titles=EuroPar+2020&format=wikison
Result:
{{Event
|homepage=https://2020.euro-par.org/
|event=EuroPar 2020
|series=EuroPar
|acronym=EuroPar 2020
|title=International European Conference on Parallel and Distributed Computing
|city=Warsaw
|country=Poland
|start_date=2020-08-24 00:00:00
|end_date=2020-08-28 00:00:00
|url=https://www.openresearch.org/wiki/EuroPar 2020
}}
Examples with accept header
csv
curl -H "Accept: text/csv" "http://ptp.bitplan.com/parse?examples=example2&titles=PAKM+2000"
Result for text/csv
"month","homePage","eventType","country","acronym","ordinal","url","year","event","eventId","source","syntax","enum","foundBy","location","publish","scope","title" "October","","Conference","Switzerland","PAKM 2000",3,"http://ceur-ws.org/Vol-34","2000","PAKM 2000","Vol-34","CEUR-WS","the","Third","PAKM 2000","Basel","Proceedings","International","Proceedings of the Third International Conference (PAKM 2000), Basel, Switzerland, October 30-31, 2000.Submitted by: Ulrich Reimer"
json
curl -H "Accept: application/json" "http://ptp.bitplan.com/parse?examples=example2&titles=BIR+2019"
Result for application/json
{
"count": 3,
"events": [{
"acronym": "BIR 2019",
"city": null,
"country": null,
"creation_date": "2020-03-09T11:01:20+00:00",
"end_date": "2019-09-25T00:00:00+00:00",
"event": "BIR 2019",
"foundBy": "BIR 2019",
"homePage": null,
"homepage": "https://bir2019.ue.katowice.pl/",
"modification_date": "2020-07-06T11:47:07+00:00",
"series": "BIR",
"source": "OPEN RESEARCH",
"start_date": "2019-09-23T00:00:00+00:00",
"title": "18th International Conference on Perspectives in Business Informatics Research",
"url": "https://www.openresearch.org/wiki/BIR 2019"
}, {
"acronym": "BIR 2019",
"city": "Cologne",
"country": "Germany",
"enum": "8th",
"event": "BIR 2019",
"eventId": "Vol-2345",
"eventType": "Workshop",
"foundBy": "BIR 2019",
"homePage": null,
"month": "April",
"ordinal": 14,
"publish": "Proceedings",
"scope": "International",
"source": "CEUR-WS",
"syntax": "on",
"title": "Proceedings of the 8th International Workshop on Bibliometric-enhanced Information Retrieval (BIR 2019)co-located with the 41st European Conference on Information Retrieval (ECIR 2019),Cologne, Germany, April 14th, 2019.Submitted by: Guillaume Cabanac",
"topic": "Bibliometric-enhanced Information Retrieval",
"url": "http://ceur-ws.org/Vol-2345",
"year": "2019"
}, {
"acronym": "BIR 2019",
"address": null,
"area": {
"value": "Computer Science",
"id": 2
},
"cameraReadyDate": null,
"city": "Katowice",
"confSeries": {
"id": "bir",
"issn": null,
"eissn": null,
"dblpId": "https://dblp.org/db/conf/bir/",
"name": "Business Informatics Research",
"description": null
},
"country": "Poland",
"description": null,
"endDate": "2019-09-25",
"event": "BIR 2019",
"foundBy": "BIR 2019",
"homepage": null,
"id": "bir2019",
"keywords": ["Data mining", "Enterprise architecture", "Business process model", "Aggregation", "Acceptance", "Banking", "Blockchain", "Comparison", "Digital learning", "e-Health", "Agile modelling method engineering", "Literature review", "Digitalization", "ecosystem", "Barriers of change", "Bing", "Business process management system", "digital Workplace Health Promotion (dWHP)", "Dual video cast", "e-Lecture"],
"name": "Business Informatics Research",
"notificationDate": null,
"ranks": [],
"shortDescription": null,
"source": "confref",
"startDate": "2019-09-23",
"submissionDate": null,
"submissionExtended": false,
"url": "http://portal.confref.org/list/bir2019",
"year": 2019
}]
}
xml
As an alternative to application/xml the mime-type text/xml is also accepted with the same result.
curl -H "Accept: application/xml" "http://ptp.bitplan.com/parse?examples=example2&titles=EuroPar+2020"
Result for application/xml
<?xml version="1.0" ?>
<events>
<event>
<foundBy>EuroPar 2020</foundBy>
<homepage>https://2020.euro-par.org/</homepage>
<event>EuroPar 2020</event>
<series>EuroPar</series>
<acronym>EuroPar 2020</acronym>
<title>International European Conference on Parallel and Distributed Computing</title>
<city>Warsaw</city>
<country>Poland</country>
<start_date>2020-08-24T00:00:00</start_date>
<end_date>2020-08-28T00:00:00</end_date>
<creation_date>2020-02-27T14:44:52</creation_date>
<modification_date>2020-02-27T14:44:52</modification_date>
<url>https://www.openresearch.org/wiki/EuroPar 2020</url>
<source>OPEN RESEARCH</source>
</event>
</events>
Running your own service
PreRequisites
If you'd like to run your own copy of this service you'll need:
- git
- python 3 (>=3.6)
- some unix command line tools like curl, grep, wc
Tested on Linux (Ubuntu bionic/Travis) and MacOS High Sierra 10.13.6 (using macports)
Installation
git clone https://github.com/WolfgangFahl/ProceedingsTitleParser
./install
Getting the sample data
Getting the sample data from the different sources may take a few minutes. You'll need som 60 MBytes of disk space as of 2020-07-11
scripts/getsamples
Testing
./test
Implementation
Choice of Database/Storage system
The following candidates for a Database/storage system where considered:
- Python native solutions
- using JSON files as caches
- [ruruki]
- Graph databases
- RDF TripleStore
To investigate the feasibility of the approaches the Open Source Project DgraphAndWeaviateTest was created and issues for the different providers where created and questions asked and answer given e.g.:
Issues
- https://github.com/semi-technologies/weaviate/issues/1215
- https://discuss.dgraph.io/t/dgraph-v20-07-0-v20-03-0-unreliability-in-mac-os-environment/9376/14
- https://discuss.dgraph.io/t/input-for-predicate-location-of-type-scalar-is-uid/9381
Stackoverflow questions
- https://stackoverflow.com/questions/63486767/how-can-i-get-the-fuseki-api-via-sparqlwrapper-to-properly-report-a-detailed-err
- https://stackoverflow.com/questions/63435157/listofdict-to-rdf-conversion-in-python-targeting-apache-jena-fuseki
- https://stackoverflow.com/questions/63358495/how-to-delete-all-nodes-with-a-given-type
- https://stackoverflow.com/questions/63260073/starting-zero-alpha-and-ratel-in-a-single-command-e-g-in-macosx-and-other-envir
- https://stackoverflow.com/questions/63098344/weaviate-error-code-400-parsing-body-from-failed-invalid-character-g-looki
- https://stackoverflow.com/questions/63075787/translating-sidif-to-weaviate
Stackoverflow answers
- https://stackoverflow.com/questions/63435157/listofdict-to-rdf-conversion-in-python-targeting-apache-jena-fuseki/63440396#63440396#
- https://stackoverflow.com/questions/63358495/how-to-delete-all-nodes-with-a-given-type/63358827#63358827
- https://stackoverflow.com/questions/63260073/starting-zero-alpha-and-ratel-in-a-single-command-e-g-in-macosx-and-other-envir/63265154#63265154