Difference between revisions of "ProceedingsTitleParser"
(→xml) |
|||
(55 intermediate revisions by the same user not shown) | |||
Line 13: | Line 13: | ||
== What is it? == | == What is it? == | ||
The Proceedings Title Parser Service is a specialized search engine for scientific proceedings and events. | The Proceedings Title Parser Service is a specialized search engine for scientific proceedings and events. | ||
− | It searches in a corpus/database based on data from | + | It searches in a corpus/database based on data sourced from |
# http://www.openresearch.org {{Done}} | # http://www.openresearch.org {{Done}} | ||
# http://ceur-ws.org {{Done}} | # http://ceur-ws.org {{Done}} | ||
− | # http://www.wikidata.org | + | # http://www.wikidata.org {{Done}} |
− | # http://confref.org | + | # http://confref.org {{Done}} |
− | # https://dblp.org/ | + | # http://crossref.org {{Done}} |
+ | # https://dblp.org/ {{Done}} | ||
# [https://www.dnb.de/DE/Professionell/Standardisierung/GND/gnd_node.html GND] (in progress) | # [https://www.dnb.de/DE/Professionell/Standardisierung/GND/gnd_node.html GND] (in progress) | ||
− | # http://www.wikicfp.com/cfp/ | + | # http://www.wikicfp.com/cfp/ {{Done}} |
# ... | # ... | ||
+ | see [http://ptp.bitplan.com/settings Data Source statistics] | ||
== Search Modes == | == Search Modes == | ||
Line 122: | Line 124: | ||
The following result formats are supported: | The following result formats are supported: | ||
* html (default) | * html (default) | ||
+ | * csv | ||
* json | * json | ||
* xml | * xml | ||
To select a result format you can either add the "&format" query parameter as part of your url or specify the corresponding accept header in your query. | To select a result format you can either add the "&format" query parameter as part of your url or specify the corresponding accept header in your query. | ||
=== Example format parameter queries === | === Example format parameter queries === | ||
+ | ==== csv ==== | ||
+ | <source lang='bash'> | ||
+ | http://ptp.bitplan.com/parse?examples=example2&titles=PAKM+2000&format=csv | ||
+ | </source> | ||
+ | [http://ptp.bitplan.com/parse?examples=example2&titles=PAKM+2000&format=csv Try it!] | ||
+ | |||
+ | result is the same as with [[ProceedingsTitleParser#Result_for_text.2Fcsv|content-negotiation text/csv]] | ||
+ | |||
==== json ==== | ==== json ==== | ||
<source lang='bash'> | <source lang='bash'> | ||
Line 131: | Line 142: | ||
</source> | </source> | ||
[http://ptp.bitplan.com/parse?examples=example2&titles=BIR+2019&format=json Try it!] | [http://ptp.bitplan.com/parse?examples=example2&titles=BIR+2019&format=json Try it!] | ||
− | result is the same as with content-negotiation application/json | + | result is the same as with [[ProceedingsTitleParser#Result_for_application.2Fjson|content-negotiation application/json]] |
+ | |||
==== xml ==== | ==== xml ==== | ||
<source lang='bash'> | <source lang='bash'> | ||
Line 137: | Line 149: | ||
</source> | </source> | ||
[http://ptp.bitplan.com/parse?examples=example2&titles=EuroPar+2020&format=xml Try it!] | [http://ptp.bitplan.com/parse?examples=example2&titles=EuroPar+2020&format=xml Try it!] | ||
− | result is the same as with content-negotiation application/xml or text/xml | + | result is the same as with [[ProceedingsTitleParser#Result_for_application.2Fxml|content-negotiation application/xml or text/xml]] |
+ | |||
+ | ==== wikison ==== | ||
+ | To get [http://wiki.bitplan.com/index.php/WikiSon WikiSon format] | ||
+ | <source lang='bash'> | ||
+ | http://ptp.bitplan.com/parse?examples=example2&titles=EuroPar+2020&format=wikison | ||
+ | </source> | ||
+ | [http://ptp.bitplan.com/parse?examples=example2&titles=EuroPar+2020&format=wikison Try it!] | ||
+ | |||
+ | Result: | ||
+ | <source lang='bash'> | ||
+ | {{Event | ||
+ | |homepage=https://2020.euro-par.org/ | ||
+ | |event=EuroPar 2020 | ||
+ | |series=EuroPar | ||
+ | |acronym=EuroPar 2020 | ||
+ | |title=International European Conference on Parallel and Distributed Computing | ||
+ | |city=Warsaw | ||
+ | |country=Poland | ||
+ | |start_date=2020-08-24 00:00:00 | ||
+ | |end_date=2020-08-28 00:00:00 | ||
+ | |url=https://www.openresearch.org/wiki/EuroPar 2020 | ||
+ | }} | ||
+ | </source> | ||
=== Examples with accept header === | === Examples with accept header === | ||
+ | |||
+ | ==== csv ==== | ||
+ | <source lang='bash'> | ||
+ | curl -H "Accept: text/csv" "http://ptp.bitplan.com/parse?examples=example2&titles=PAKM+2000" | ||
+ | </source> | ||
+ | ===== Result for text/csv ===== | ||
+ | <pre> | ||
+ | "month","homePage","eventType","country","acronym","ordinal","url","year","event","eventId","source","syntax","enum","foundBy","location","publish","scope","title" | ||
+ | "October","","Conference","Switzerland","PAKM 2000",3,"http://ceur-ws.org/Vol-34","2000","PAKM 2000","Vol-34","CEUR-WS","the","Third","PAKM 2000","Basel","Proceedings","International","Proceedings of the Third International Conference (PAKM 2000), Basel, Switzerland, October 30-31, 2000.Submitted by: Ulrich Reimer" | ||
+ | </pre> | ||
+ | |||
==== json ==== | ==== json ==== | ||
<source lang='bash'> | <source lang='bash'> | ||
curl -H "Accept: application/json" "http://ptp.bitplan.com/parse?examples=example2&titles=BIR+2019" | curl -H "Accept: application/json" "http://ptp.bitplan.com/parse?examples=example2&titles=BIR+2019" | ||
</source> | </source> | ||
− | Result | + | ===== Result for application/json ===== |
<source lang='json'> | <source lang='json'> | ||
{ | { | ||
Line 222: | Line 268: | ||
} | } | ||
</source> | </source> | ||
+ | |||
==== xml ==== | ==== xml ==== | ||
+ | As an alternative to application/xml the mime-type text/xml is also accepted with the same result. | ||
<source lang='bash'> | <source lang='bash'> | ||
curl -H "Accept: application/xml" "http://ptp.bitplan.com/parse?examples=example2&titles=EuroPar+2020" | curl -H "Accept: application/xml" "http://ptp.bitplan.com/parse?examples=example2&titles=EuroPar+2020" | ||
</source> | </source> | ||
− | Result | + | ===== Result for application/xml ===== |
− | < | + | <syntaxhighlight lang='xml'> |
<?xml version="1.0" ?> | <?xml version="1.0" ?> | ||
<events> | <events> | ||
Line 244: | Line 292: | ||
<modification_date>2020-02-27T14:44:52</modification_date> | <modification_date>2020-02-27T14:44:52</modification_date> | ||
<url>https://www.openresearch.org/wiki/EuroPar 2020</url> | <url>https://www.openresearch.org/wiki/EuroPar 2020</url> | ||
− | <source>OPEN RESEARCH< | + | <source>OPEN RESEARCH</source> |
</event> | </event> | ||
</events> | </events> | ||
− | </ | + | </syntaxhighlight> |
= Running your own service = | = Running your own service = | ||
Line 254: | Line 302: | ||
# git | # git | ||
# python 3 (>=3.6) | # python 3 (>=3.6) | ||
+ | # python3-pip | ||
+ | # {{Link|target=jq}} | ||
# some unix command line tools like curl, grep, wc | # some unix command line tools like curl, grep, wc | ||
− | Tested on Linux (Ubuntu bionic/Travis) and MacOS High Sierra 10.13.6 (using macports) | + | Tested on Linux (Ubuntu bionic/Travis) and MacOS High Sierra 10.13.6 (using macports) as well as Windows Subsystem for Linux / Ubuntu 20.04 environment |
== Installation == | == Installation == | ||
Line 262: | Line 312: | ||
./install | ./install | ||
</source> | </source> | ||
+ | === Windows Enviroment === | ||
+ | Under Windows 10 you might want to use the Windows Subsystem for Linux using the Ubuntu 20.04 LTS environment. See https://docs.microsoft.com/windows/wsl/install-win10 | ||
== Getting the sample data == | == Getting the sample data == | ||
Getting the sample data from the different sources may take a few minutes. You'll need som 60 MBytes of disk space as of 2020-07-11 | Getting the sample data from the different sources may take a few minutes. You'll need som 60 MBytes of disk space as of 2020-07-11 | ||
<source lang='bash'> | <source lang='bash'> | ||
− | ./ | + | scripts/getsamples |
+ | </source> | ||
+ | === Updating the sample data === | ||
+ | ==== dblp ==== | ||
+ | see https://dblp.uni-trier.de/xml/ for the xml file input | ||
+ | <source lang='bash'> | ||
+ | #!/bin/bash | ||
+ | # WF 2020-07-17 | ||
+ | # get proceedings xml nodes from dblp xml download | ||
+ | xml=$HOME/downloads/dblp.xml | ||
+ | tmpxml=/tmp/proceedings-dblp.xml | ||
+ | json=sampledata/dblp.json | ||
+ | head -3 $xml > $tmpxml | ||
+ | # make sure there are newlines before and after end tags | ||
+ | # of type proceeding with sed and then filter with awk | ||
+ | # https://stackoverflow.com/a/24707372/1497139 | ||
+ | cat $xml | sed $'s/<proceedings/\\\n&/g' | sed $'s/<.proceedings>/&\\\n/g' | awk ' | ||
+ | # select proceedings nodes (which should be clearly separated by now) | ||
+ | /<proceedings/,/<\/proceedings>/ { | ||
+ | print | ||
+ | }' >> $tmpxml | ||
+ | echo "</dblp>" >> $tmpxml | ||
+ | xq . $tmpxml > $json | ||
</source> | </source> | ||
Line 273: | Line 347: | ||
./test | ./test | ||
</source> | </source> | ||
+ | |||
+ | = Implementation = | ||
+ | == Structure == | ||
+ | <graphviz format='svg'> | ||
+ | digraph ptp { | ||
+ | node [color=Blue] | ||
+ | rankdir = LR; | ||
+ | countryNode [ label="Greece" ] | ||
+ | countryNode->country | ||
+ | |||
+ | delimNode [ label="," ] | ||
+ | delimNode->delim | ||
+ | |||
+ | cityNode [ label="Rhodes" ] | ||
+ | cityNode -> city | ||
+ | |||
+ | acronymNode [ label="(ODBASE 2019)" ] | ||
+ | acronymNode->acronym | ||
+ | |||
+ | fieldNode [ label="Ontologies, DataBases, and Applications of Semantics" ] | ||
+ | fieldNode->field | ||
+ | |||
+ | syntaxNode2 [label="on" ] | ||
+ | syntaxNode2->syntax | ||
+ | |||
+ | eventTypeNode [ label="Conference" ] | ||
+ | eventTypeNode->eventType | ||
+ | |||
+ | scopeNode [ label="International" ] | ||
+ | scopeNode -> scope | ||
+ | |||
+ | enumNode [ label="18th" ] | ||
+ | enumNode -> enum | ||
+ | |||
+ | prefixNode [ label=" of The " ] | ||
+ | prefixNode->prefix | ||
+ | |||
+ | pNode [ label="Proceedings"] | ||
+ | pNode->publish | ||
+ | } | ||
+ | </graphviz> | ||
+ | |||
+ | == Sources of Proceedings Titles == | ||
+ | === Wikidata === | ||
+ | There are some 36 million scholarly articles in Wikipedia as of 2020-10. Even trying to count them times out on the official Wikidata Query services: | ||
+ | ==== Query number of scholarly articles ==== | ||
+ | <source lang='sparql'> | ||
+ | # scholarly articles | ||
+ | SELECT ?item ?itemLabel | ||
+ | WHERE | ||
+ | { | ||
+ | ?item wdt:P31 wd:Q13442814. | ||
+ | SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } | ||
+ | } LIMIT 10 | ||
+ | </source> | ||
+ | [https://query.wikidata.org/#%23%20scholarly%20articles%0ASELECT%20%3Fitem%20%3FitemLabel%20%0AWHERE%20%0A%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ13442814.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D%20LIMIT%2010 try it!] | ||
+ | ==== Query links between proceedings and events ==== | ||
+ | <source lang='sparql'> | ||
+ | SELECT ?article ?articleLabel ?event ?eventLabel WHERE { | ||
+ | ?article wdt:P4745 ?event. | ||
+ | SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } | ||
+ | } | ||
+ | </source> | ||
+ | [https://query.wikidata.org/#SELECT%20%3Farticle%20%3FarticleLabel%20%3Fevent%20%3FeventLabel%20%20WHERE%20%7B%0A%20%20%3Farticle%20wdt%3AP4745%20%3Fevent.%20%20%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22.%20%7D%0A%7D try it!] | ||
+ | |||
+ | ==== Dump ==== | ||
+ | To mitigate the time out it's possible to creata n RDF dump of the instances: | ||
+ | * https://wdumps.toolforge.org/dump/822 | ||
+ | ==== Links between event and proceedings ==== | ||
+ | * https://m.wikidata.org/wiki/Property:P4745 | ||
+ | example: | ||
+ | * https://m.wikidata.org/wiki/Q63278360 | ||
+ | |||
+ | == Data Analysis == | ||
+ | === Year === | ||
+ | ==== How many event records for a given year? ==== | ||
+ | <source lang='sql' highlight="1-3,5-11,110,117,118" line> | ||
+ | select year,count(*) from event | ||
+ | group by year | ||
+ | order by 1 desc | ||
+ | year count(*) | ||
+ | 19670 1 | ||
+ | 2109 1 | ||
+ | 2106 1 | ||
+ | 2105 1 | ||
+ | 2091 1 | ||
+ | 2088 1 | ||
+ | 2081 1 | ||
+ | 2026 3 | ||
+ | 2025 1 | ||
+ | 2024 2 | ||
+ | 2022 3 | ||
+ | 2021 1069 | ||
+ | 2020 8318 | ||
+ | 2019 19032 | ||
+ | 2018 19546 | ||
+ | 2017 17618 | ||
+ | 2016 15697 | ||
+ | 2015 14221 | ||
+ | 2014 13831 | ||
+ | 2013 12621 | ||
+ | 2012 12292 | ||
+ | 2011 11926 | ||
+ | 2010 10416 | ||
+ | 2009 9198 | ||
+ | 2008 9024 | ||
+ | 2007 5569 | ||
+ | 2006 4365 | ||
+ | 2005 3943 | ||
+ | 2004 3438 | ||
+ | 2003 3092 | ||
+ | 2002 2782 | ||
+ | 2001 2519 | ||
+ | 2000 2400 | ||
+ | 1999 2111 | ||
+ | 1998 2063 | ||
+ | 1997 1986 | ||
+ | 1996 1732 | ||
+ | 1995 1612 | ||
+ | 1994 1578 | ||
+ | 1993 1474 | ||
+ | 1992 1308 | ||
+ | 1991 1194 | ||
+ | 1990 1063 | ||
+ | 1989 1061 | ||
+ | 1988 1029 | ||
+ | 1987 704 | ||
+ | 1986 715 | ||
+ | 1985 608 | ||
+ | 1984 486 | ||
+ | 1983 460 | ||
+ | 1982 405 | ||
+ | 1981 346 | ||
+ | 1980 314 | ||
+ | 1979 302 | ||
+ | 1978 276 | ||
+ | 1977 197 | ||
+ | 1976 152 | ||
+ | 1975 144 | ||
+ | 1974 113 | ||
+ | 1973 102 | ||
+ | 1972 82 | ||
+ | 1971 78 | ||
+ | 1970 66 | ||
+ | 1969 66 | ||
+ | 1968 54 | ||
+ | 1967 55 | ||
+ | 1966 41 | ||
+ | 1965 32 | ||
+ | 1964 31 | ||
+ | 1963 29 | ||
+ | 1962 28 | ||
+ | 1961 19 | ||
+ | 1960 17 | ||
+ | 1959 10 | ||
+ | 1958 9 | ||
+ | 1957 8 | ||
+ | 1956 8 | ||
+ | 1955 7 | ||
+ | 1954 3 | ||
+ | 1953 5 | ||
+ | 1952 4 | ||
+ | 1951 5 | ||
+ | 1950 4 | ||
+ | 1949 2 | ||
+ | 1948 2 | ||
+ | 1947 1 | ||
+ | 1941 1 | ||
+ | 1938 1 | ||
+ | 1935 1 | ||
+ | 1932 1 | ||
+ | 1930 1 | ||
+ | 1929 1 | ||
+ | 1926 1 | ||
+ | 1923 1 | ||
+ | 1922 1 | ||
+ | 1920 2 | ||
+ | 1914 1 | ||
+ | 1913 1 | ||
+ | 1912 1 | ||
+ | 1911 1 | ||
+ | 1910 1 | ||
+ | 1908 1 | ||
+ | 1905 1 | ||
+ | 1904 1 | ||
+ | 1901 1 | ||
+ | 1900 146 | ||
+ | 1895 1 | ||
+ | 1894 1 | ||
+ | 1890 1 | ||
+ | 1889 1 | ||
+ | 1880 1 | ||
+ | 1862 1 | ||
+ | 35 1 | ||
+ | 0 14 | ||
+ | (null) 24389 | ||
+ | </source> | ||
+ | ==== year records that need attention ==== | ||
+ | <source lang='sql' highlight='1-4'> | ||
+ | select year,source,count(*) from event | ||
+ | where year>2030 or year<1850 or year=1900 | ||
+ | group by year | ||
+ | order by 1 desc | ||
+ | year source count(*) | ||
+ | 19670 confref 1 | ||
+ | 2109 gnd 1 | ||
+ | 2106 gnd 1 | ||
+ | 2105 gnd 1 | ||
+ | 2091 wikicfp 1 | ||
+ | 2088 gnd 1 | ||
+ | 2081 wikicfp 1 | ||
+ | 1900 crossref 146 | ||
+ | 35 wikicfp 1 | ||
+ | 0 confref 14 | ||
+ | </source> | ||
+ | |||
+ | == Choice of Database/Storage system == | ||
+ | The following candidates for a Database/storage system where considered: | ||
+ | # Python native solutions | ||
+ | ## using JSON files as caches | ||
+ | ## [https://pypi.org/project/ruruki/ ruruki] | ||
+ | # Graph databases | ||
+ | ## {{Link|target=Gremlin python}} | ||
+ | ## {{Link|target=Dgraph}} | ||
+ | ## {{Link|target=Weaviate}} | ||
+ | # RDF TripleStore | ||
+ | ## {{Link|target=Apache Jena}} | ||
+ | # SQL Database | ||
+ | ## [https://www.sqlite.org/index.html sqlite] | ||
+ | |||
+ | To investigate the feasibility of the approaches the Open Source Project {{Link|target=DgraphAndWeaviateTest}} was created and issues for the different providers where created and [https://stackoverflow.com/users/1497139/wolfgang-fahl?tab=questions questions] on https://stackoverflow.com asked and [https://stackoverflow.com/users/1497139/wolfgang-fahl?tab=answers answers] given. | ||
+ | |||
+ | === Evaluation results === | ||
+ | ==== Native JSON Files ==== | ||
+ | Native Json Files have a very good performance and are simple to handle e.g. reading the crossRef data takes approx 1 sec | ||
+ | <source lang='bash'>read 45601 events in 1.1 s</source> | ||
+ | The disadvantage is that the query capabilities of the JSON only approach are limited. In the protype only lookup by eventID and eventAcronym was implemented via HashTables/Dicts. | ||
+ | ==== Ruruki ==== | ||
+ | Searching for an "in memory graph database solution for python" we found [https://pypi.org/project/ruruki/ ruruki]. The performance was disappointing. | ||
+ | See [https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/e2064a1f11b8f0723abddba3ec69e15846fc9d99/tests/testRuruki.py#L73 testSpeed] | ||
+ | <source lang='bash'> | ||
+ | creating 1000 vertices took 1.7 s = 598.3 v/s | ||
+ | </source> | ||
+ | ==== Gremlin python ==== | ||
+ | Using {{Link|target=Gremlin}} would have been our favorite choice based on the experience with the {{Link|target=SimpleGraph}} project. | ||
+ | From our experience in-memory gremlin works excellent with less than 100.000 vertices when using Java. It is still useable for around 1 million vertices. If there are a lot more vertices it's necessay to back gremlin with a proper Graph database. We were not successful with any of the Graph database yet. | ||
+ | The situation get's worse when Python is used as a programming language see: {{Link|target=Gremlin python}}. Python is only supported as a language variant and it's very awkward to work in that environment. From our point of view this is not production ready so we didn't bother to investigate further. | ||
+ | ==== DGraph ==== | ||
+ | {{Link|target=Dgraph}} looks like a good candidate for a graph database. It's simple to handle, has a docker based installation and a python interface. | ||
+ | The tests looked very promising but in our development environment there has been some instability as reported in | ||
+ | [https://discuss.dgraph.io/t/dgraph-v20-07-0-v20-03-0-unreliability-in-mac-os-environment/9376/14 unreliability issue for Dgraph]. It's not clear what causes this problem but it shows up in an undeterministic way in the travis CI tests and is a showstopper at this time. We will certainly reconsider Dgraph later. | ||
+ | ==== Weaviate ==== | ||
+ | {{Link|target=Weaviate}} looks like an excellent fit for our usecase since it has [https://rq.bitplan.com/index.php/NLP NLP] support built in and comes with a "contextionary" for e.g. english out of the box. That means the dictionary approach of the Proceedings Title Parser would have [https://rq.bitplan.com/index.php/GloVe GloVe] support immediately. | ||
+ | ==== Apache Jena ==== | ||
+ | {{Link|target=Apache Jena}} has an RDF triplestore with {{Link|target=SPARQL}} query capabilities. | ||
+ | The question was whether the integration with Python and the "List of Dicts" based approach of the [https://github.com/WolfgangFahl/ProceedingsTitleParser/blob/master/ptp/event.py EventManager] would be feasible. | ||
+ | The [https://github.com/WolfgangFahl/DgraphAndWeaviateTest/issues github issues] | ||
+ | * [https://github.com/WolfgangFahl/DgraphAndWeaviateTest/issues/2 Add Apache Jena to tests] | ||
+ | * [https://github.com/WolfgangFahl/DgraphAndWeaviateTest/issues/3 add batch support for sparql insert] | ||
+ | * [https://github.com/WolfgangFahl/DgraphAndWeaviateTest/issues/4 Handle double quoted string content] | ||
+ | * [https://github.com/WolfgangFahl/DgraphAndWeaviateTest/issues/5 Handle newlines in string content] | ||
+ | * http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Workshops/SSWS/Emmons-et-all-SSWS2011.pdf | ||
+ | had to be fixed | ||
+ | ==== sqlite ==== | ||
+ | sqlite outperforms all other approaches, especially if it used as an "in-memory" database. But even reading from disk is not much slower. | ||
+ | |||
+ | sqlite and JSON are similar in read and write performance. With sqlite SQL queries can be used wich is not directly possible with JSON. | ||
+ | |||
+ | Using SQL instead of JSON has some disadvantages e.g. a "schemaless" working mode is not possible. We worked around the problem by adding "ListOfDict" support where the columns are automatically detected and the DDL commands to create a schema need not be maintained manually. | ||
+ | |||
+ | Using SQL instead of a Graph database or Triplestore based approach has some disadvantages - e.g. graph queries are not directly possible but need to be translated to SQL - JOIN based relational queries. Given the low number of entities to be stored we hope this will not be too troublesome. | ||
+ | |||
+ | ===== performance result for the "cities" example ===== | ||
+ | see https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/master/tests/testSqlite3.py | ||
+ | <pre> | ||
+ | adding 128769 City records took 0.300 s => 428910 records/s | ||
+ | selecting 128769 City records took 0.300 s => 428910 records/s | ||
+ | </pre> | ||
+ | = Other Related Services = | ||
+ | * https://anystyle.io/ | ||
+ | * https://github.com/inukshuk/anystyle/ | ||
+ | |||
+ | = History of Library Catalogs = | ||
+ | Digitization of library catalogs started with [https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.5090100105 punched cards (Dewey1959)] and continued to modern online catalogs. |
Latest revision as of 06:12, 5 May 2021
OsProject | |
---|---|
edit | |
id | ProceedingsTitleParser |
state | |
owner | WolfgangFahl |
title | Shallow Semantic Parser to extract metadata from scientific proceedings titles |
url | https://github.com/WolfgangFahl/ProceedingsTitleParser |
version | 0.0.1 |
description | |
date | 2020-07-02 |
since | |
until |
Usage
What is it?
The Proceedings Title Parser Service is a specialized search engine for scientific proceedings and events. It searches in a corpus/database based on data sourced from
- http://www.openresearch.org ✓
- http://ceur-ws.org ✓
- http://www.wikidata.org ✓
- http://confref.org ✓
- http://crossref.org ✓
- https://dblp.org/ ✓
- GND (in progress)
- http://www.wikicfp.com/cfp/ ✓
- ...
Search Modes
The Proceedings Title Parser currently has three modes:
- Proceedings Title Parsing
- Named Entity Recognition (NER)
- Extract/Scrape
All three modes expect some lines of text as input. The mode is automatically detected/selected by the content of the lines given as input.
Example
The input:
Proceedings of the 2020 International Conference on High Performance Computing & Simulation (HPCS 2020) BIR 2019 http://ceur-ws.org/Vol-2599/
- will trigger Proceedings Title Parsing mode for the first line
- Named Entity Recognition mode for the second line
- Extract/Scape mode for the third line
Proceedings Title Parsing mode
In Proceedings Title Parsing mode the content of a line is parsed word by word to find typical elements of Proceedings titles like
- ordinal (First, III., 6th, 23rd)
- city (Paris, New York, London, Berlin, Barcelona)
- country (USA, Italy, Germany, France, UK)
- provinces (California, Texas, Florida, Ontario, NRW)
- scope (International, European, Czech, Italian, National)
- year (2016,1988,2017,2018,2019,2020)
- ...
while the above elements can be found by comparing with a list of known cities, countries, provinces, ... finding the event acronym needs a lookup in a corpus/database of proceedings and events which is automatically performed to check whether a known event acronym like ISWC, ICEIS, SIGSPATIAL, ... might be found with the given context e.g. year. If the acronym (with year) is found a link to the resulting source record is shown.
Example
Input
Proceedings of the 2020 International Conference on High Performance Computing & Simulation (HPCS 2020)
# | Source | Acronym | Url | Found by |
---|---|---|---|---|
1 | OPEN RESEARCH | HPCS 2020 | https://www.openresearch.org/wiki/HPCS%202020 | HPCS 2020 |
{'year': '2020', 'scope': 'International', 'event': 'Conference', 'topic': 'High Performance Computing & Simulation', 'acronym': 'HPCS 2020', 'title': 'Proceedings of the 2020 International Conference on High Performance Computing & Simulation (HPCS 2020)', 'source': 'line', 'publish': 'Proceedings', 'syntax': 'on', 'delimiter': '&'}
{ "acronym": "HPCS 2020", "city": "Barcelona", "country": "Spain", "creation_date": "2020-03-26 06:01:12", "end_date": "2020-07-24 00:00:00", "event": "HPCS 2020", "foundBy": "HPCS 2020", "homePage": null, "homepage": "http://conf.cisedu.info/rp/hpcs20/", "modification_date": "2020-03-26 08:16:33", "series": "HPCS", "source": "OPEN RESEARCH", "start_date": "2020-07-20 00:00:00", "title": "2020 International Conference on High Performance Computing & Simulation", "url": "https://www.openresearch.org/wiki/HPCS 2020" }
Named Entity Recognition mode (NER)
In named entity recognition mode the words to be looked up can be directly entered without following the patterns of typical proceedings titles. Syntax elements like "Proceedings of ... " may be left out. This mode will often also give good results but can not use the information provided by syntactically elements like "at". For an example "Proceedings of the 1st conference of the history of Rome at Paris" the NER equivalent "1 conference history Rome Paris" will obviously be ambiguous.
Example
Input
BIR 2019
Result
# | Source | Acronym | Url | Found by |
---|---|---|---|---|
1 | OPEN RESEARCH | BIR 2019 | https://www.openresearch.org/wiki/BIR%202019 | BIR 2019 |
2 | BIR 2019 | BIR 2019 | http://ceur-ws.org/Vol-2345 | BIR 2019 |
3 | confref | BIR 2019 | http://portal.confref.org/list/bir2019 | BIR 2019 |
'title': 'BIR 2019', 'source': 'line', 'year': '2019'}
{ "acronym": "BIR 2019", "city": null, "country": null, "creation_date": "2020-03-09 11:01:20", "end_date": "2019-09-25 00:00:00", "event": "BIR 2019", "foundBy": "BIR 2019", "homepage": "https://bir2019.ue.katowice.pl/", "modification_date": "2020-07-06 11:47:07", "series": "BIR", "source": "OPEN RESEARCH", "start_date": "2019-09-23 00:00:00", "title": "18th International Conference on Perspectives in Business Informatics Research", "url": "https://www.openresearch.org/wiki/BIR 2019" }
{ "acronym": "BIR 2019", "city": "Cologne", "country": "Germany", "enum": "8th", "event": "BIR 2019", "eventId": "Vol-2345", "eventType": "Workshop", "foundBy": "BIR 2019", "homePage": null, "month": "April", "ordinal": 14, "publish": "Proceedings", "scope": "International", "source": "CEUR-WS", "syntax": "on", "title": "Proceedings of the 8th International Workshop on Bibliometric-enhanced Information Retrieval (BIR 2019)co-located with the 41st European Conference on Information Retrieval (ECIR 2019),Cologne, Germany, April 14th, 2019.Submitted by: Guillaume Cabanac", "topic": "Bibliometric-enhanced Information Retrieval", "url": "http://ceur-ws.org/Vol-2345", "year": "2019" }
{ "acronym": "BIR 2019", "address": null, "area": { "id": 2, "value": "Computer Science" }, "cameraReadyDate": null, "city": "Katowice", "confSeries": { "dblpId": "https://dblp.org/db/conf/bir/", "description": null, "eissn": null, "id": "bir", "issn": null, "name": "Business Informatics Research" }, "country": "Poland", "description": null, "endDate": "2019-09-25", "event": "BIR 2019", "foundBy": "BIR 2019", "homepage": null, "id": "bir2019", "keywords": [ "Data mining", "Enterprise architecture", "Business process model", "Aggregation", "Acceptance", "Banking", "Blockchain", "Comparison", "Digital learning", "e-Health", "Agile modelling method engineering", "Literature review", "Digitalization", "ecosystem", "Barriers of change", "Bing", "Business process management system", "digital Workplace Health Promotion (dWHP)", "Dual video cast", "e-Lecture" ], "name": "Business Informatics Research", "notificationDate": null, "ranks": [], "shortDescription": null, "source": "confref", "startDate": "2019-09-23", "submissionDate": null, "submissionExtended": false, "url": "http://portal.confref.org/list/bir2019", "year": 2019 }
Extract / scrape mode
If the line contains an url of a known source of conference of proceedings title information the page will be automatically visited and the meta data extracted.
Example
Input
http://ceur-ws.org/Vol-2599/
Result:
# | Source | Acronym | Url | Found by |
---|---|---|---|---|
1 | CEUR-WS | BlockSW | http://ceur-ws.org/Vol-2599 | BlockSW |
{'prefix': 'Blockchain enabled Semantic Web', 'event': 'Workshop', 'acronym': 'BlockSW', 'title': 'Proceedings of the Blockchain enabled Semantic Web Workshop (BlockSW) and Contextualized Knowledge Graphs (CKG) Workshop', 'source': 'CEUR-WS', 'eventId': 'Vol-2599', 'publish': 'Proceedings', 'syntax': 'and'}
{ "acronym": "BlockSW", "event": "BlockSW", "eventId": "Vol-2599", "eventType": "Workshop", "foundBy": "BlockSW", "homePage": null, "month": "October", "prefix": "Blockchain enabled Semantic Web", "publish": "Proceedings", "source": "CEUR-WS", "syntax": "New", "title": "Proceedings of the Blockchain enabled Semantic Web Workshop (BlockSW) and Contextualized Knowledge Graphs (CKG) Workshop (BlockSW-CKG 2019),Auckland, New Zealand, October 27, 2019.Submitted by: Reza Samavi", "url": "http://ceur-ws.org/Vol-2599", "year": "2019" }
Result formats / Content Negotiation
The following result formats are supported:
- html (default)
- csv
- json
- xml
To select a result format you can either add the "&format" query parameter as part of your url or specify the corresponding accept header in your query.
Example format parameter queries
csv
http://ptp.bitplan.com/parse?examples=example2&titles=PAKM+2000&format=csv
result is the same as with content-negotiation text/csv
json
http://ptp.bitplan.com/parse?examples=example2&titles=BIR+2019&format=json
Try it! result is the same as with content-negotiation application/json
xml
http://ptp.bitplan.com/parse?examples=example2&titles=EuroPar+2020&format=xml
Try it! result is the same as with content-negotiation application/xml or text/xml
wikison
To get WikiSon format
http://ptp.bitplan.com/parse?examples=example2&titles=EuroPar+2020&format=wikison
Result:
{{Event
|homepage=https://2020.euro-par.org/
|event=EuroPar 2020
|series=EuroPar
|acronym=EuroPar 2020
|title=International European Conference on Parallel and Distributed Computing
|city=Warsaw
|country=Poland
|start_date=2020-08-24 00:00:00
|end_date=2020-08-28 00:00:00
|url=https://www.openresearch.org/wiki/EuroPar 2020
}}
Examples with accept header
csv
curl -H "Accept: text/csv" "http://ptp.bitplan.com/parse?examples=example2&titles=PAKM+2000"
Result for text/csv
"month","homePage","eventType","country","acronym","ordinal","url","year","event","eventId","source","syntax","enum","foundBy","location","publish","scope","title" "October","","Conference","Switzerland","PAKM 2000",3,"http://ceur-ws.org/Vol-34","2000","PAKM 2000","Vol-34","CEUR-WS","the","Third","PAKM 2000","Basel","Proceedings","International","Proceedings of the Third International Conference (PAKM 2000), Basel, Switzerland, October 30-31, 2000.Submitted by: Ulrich Reimer"
json
curl -H "Accept: application/json" "http://ptp.bitplan.com/parse?examples=example2&titles=BIR+2019"
Result for application/json
{
"count": 3,
"events": [{
"acronym": "BIR 2019",
"city": null,
"country": null,
"creation_date": "2020-03-09T11:01:20+00:00",
"end_date": "2019-09-25T00:00:00+00:00",
"event": "BIR 2019",
"foundBy": "BIR 2019",
"homePage": null,
"homepage": "https://bir2019.ue.katowice.pl/",
"modification_date": "2020-07-06T11:47:07+00:00",
"series": "BIR",
"source": "OPEN RESEARCH",
"start_date": "2019-09-23T00:00:00+00:00",
"title": "18th International Conference on Perspectives in Business Informatics Research",
"url": "https://www.openresearch.org/wiki/BIR 2019"
}, {
"acronym": "BIR 2019",
"city": "Cologne",
"country": "Germany",
"enum": "8th",
"event": "BIR 2019",
"eventId": "Vol-2345",
"eventType": "Workshop",
"foundBy": "BIR 2019",
"homePage": null,
"month": "April",
"ordinal": 14,
"publish": "Proceedings",
"scope": "International",
"source": "CEUR-WS",
"syntax": "on",
"title": "Proceedings of the 8th International Workshop on Bibliometric-enhanced Information Retrieval (BIR 2019)co-located with the 41st European Conference on Information Retrieval (ECIR 2019),Cologne, Germany, April 14th, 2019.Submitted by: Guillaume Cabanac",
"topic": "Bibliometric-enhanced Information Retrieval",
"url": "http://ceur-ws.org/Vol-2345",
"year": "2019"
}, {
"acronym": "BIR 2019",
"address": null,
"area": {
"value": "Computer Science",
"id": 2
},
"cameraReadyDate": null,
"city": "Katowice",
"confSeries": {
"id": "bir",
"issn": null,
"eissn": null,
"dblpId": "https://dblp.org/db/conf/bir/",
"name": "Business Informatics Research",
"description": null
},
"country": "Poland",
"description": null,
"endDate": "2019-09-25",
"event": "BIR 2019",
"foundBy": "BIR 2019",
"homepage": null,
"id": "bir2019",
"keywords": ["Data mining", "Enterprise architecture", "Business process model", "Aggregation", "Acceptance", "Banking", "Blockchain", "Comparison", "Digital learning", "e-Health", "Agile modelling method engineering", "Literature review", "Digitalization", "ecosystem", "Barriers of change", "Bing", "Business process management system", "digital Workplace Health Promotion (dWHP)", "Dual video cast", "e-Lecture"],
"name": "Business Informatics Research",
"notificationDate": null,
"ranks": [],
"shortDescription": null,
"source": "confref",
"startDate": "2019-09-23",
"submissionDate": null,
"submissionExtended": false,
"url": "http://portal.confref.org/list/bir2019",
"year": 2019
}]
}
xml
As an alternative to application/xml the mime-type text/xml is also accepted with the same result.
curl -H "Accept: application/xml" "http://ptp.bitplan.com/parse?examples=example2&titles=EuroPar+2020"
Result for application/xml
<?xml version="1.0" ?>
<events>
<event>
<foundBy>EuroPar 2020</foundBy>
<homepage>https://2020.euro-par.org/</homepage>
<event>EuroPar 2020</event>
<series>EuroPar</series>
<acronym>EuroPar 2020</acronym>
<title>International European Conference on Parallel and Distributed Computing</title>
<city>Warsaw</city>
<country>Poland</country>
<start_date>2020-08-24T00:00:00</start_date>
<end_date>2020-08-28T00:00:00</end_date>
<creation_date>2020-02-27T14:44:52</creation_date>
<modification_date>2020-02-27T14:44:52</modification_date>
<url>https://www.openresearch.org/wiki/EuroPar 2020</url>
<source>OPEN RESEARCH</source>
</event>
</events>
Running your own service
PreRequisites
If you'd like to run your own copy of this service you'll need:
- git
- python 3 (>=3.6)
- python3-pip
- jq
- some unix command line tools like curl, grep, wc
Tested on Linux (Ubuntu bionic/Travis) and MacOS High Sierra 10.13.6 (using macports) as well as Windows Subsystem for Linux / Ubuntu 20.04 environment
Installation
git clone https://github.com/WolfgangFahl/ProceedingsTitleParser
./install
Windows Enviroment
Under Windows 10 you might want to use the Windows Subsystem for Linux using the Ubuntu 20.04 LTS environment. See https://docs.microsoft.com/windows/wsl/install-win10
Getting the sample data
Getting the sample data from the different sources may take a few minutes. You'll need som 60 MBytes of disk space as of 2020-07-11
scripts/getsamples
Updating the sample data
dblp
see https://dblp.uni-trier.de/xml/ for the xml file input
#!/bin/bash
# WF 2020-07-17
# get proceedings xml nodes from dblp xml download
xml=$HOME/downloads/dblp.xml
tmpxml=/tmp/proceedings-dblp.xml
json=sampledata/dblp.json
head -3 $xml > $tmpxml
# make sure there are newlines before and after end tags
# of type proceeding with sed and then filter with awk
# https://stackoverflow.com/a/24707372/1497139
cat $xml | sed $'s/<proceedings/\\\n&/g' | sed $'s/<.proceedings>/&\\\n/g' | awk '
# select proceedings nodes (which should be clearly separated by now)
/<proceedings/,/<\/proceedings>/ {
print
}' >> $tmpxml
echo "</dblp>" >> $tmpxml
xq . $tmpxml > $json
Testing
./test
Implementation
Structure
Sources of Proceedings Titles
Wikidata
There are some 36 million scholarly articles in Wikipedia as of 2020-10. Even trying to count them times out on the official Wikidata Query services:
Query number of scholarly articles
# scholarly articles
SELECT ?item ?itemLabel
WHERE
{
?item wdt:P31 wd:Q13442814.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 10
Query links between proceedings and events
SELECT ?article ?articleLabel ?event ?eventLabel WHERE {
?article wdt:P4745 ?event.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Dump
To mitigate the time out it's possible to creata n RDF dump of the instances:
Links between event and proceedings
example:
Data Analysis
Year
How many event records for a given year?
1select year,count(*) from event
2group by year
3order by 1 desc
4year count(*)
519670 1
62109 1
72106 1
82105 1
92091 1
102088 1
112081 1
122026 3
132025 1
142024 2
152022 3
162021 1069
172020 8318
182019 19032
192018 19546
202017 17618
212016 15697
222015 14221
232014 13831
242013 12621
252012 12292
262011 11926
272010 10416
282009 9198
292008 9024
302007 5569
312006 4365
322005 3943
332004 3438
342003 3092
352002 2782
362001 2519
372000 2400
381999 2111
391998 2063
401997 1986
411996 1732
421995 1612
431994 1578
441993 1474
451992 1308
461991 1194
471990 1063
481989 1061
491988 1029
501987 704
511986 715
521985 608
531984 486
541983 460
551982 405
561981 346
571980 314
581979 302
591978 276
601977 197
611976 152
621975 144
631974 113
641973 102
651972 82
661971 78
671970 66
681969 66
691968 54
701967 55
711966 41
721965 32
731964 31
741963 29
751962 28
761961 19
771960 17
781959 10
791958 9
801957 8
811956 8
821955 7
831954 3
841953 5
851952 4
861951 5
871950 4
881949 2
891948 2
901947 1
911941 1
921938 1
931935 1
941932 1
951930 1
961929 1
971926 1
981923 1
991922 1
1001920 2
1011914 1
1021913 1
1031912 1
1041911 1
1051910 1
1061908 1
1071905 1
1081904 1
1091901 1
1101900 146
1111895 1
1121894 1
1131890 1
1141889 1
1151880 1
1161862 1
11735 1
1180 14
119(null) 24389
year records that need attention
select year,source,count(*) from event
where year>2030 or year<1850 or year=1900
group by year
order by 1 desc
year source count(*)
19670 confref 1
2109 gnd 1
2106 gnd 1
2105 gnd 1
2091 wikicfp 1
2088 gnd 1
2081 wikicfp 1
1900 crossref 146
35 wikicfp 1
0 confref 14
Choice of Database/Storage system
The following candidates for a Database/storage system where considered:
- Python native solutions
- using JSON files as caches
- ruruki
- Graph databases
- RDF TripleStore
- SQL Database
To investigate the feasibility of the approaches the Open Source Project DgraphAndWeaviateTest was created and issues for the different providers where created and questions on https://stackoverflow.com asked and answers given.
Evaluation results
Native JSON Files
Native Json Files have a very good performance and are simple to handle e.g. reading the crossRef data takes approx 1 sec
read 45601 events in 1.1 s
The disadvantage is that the query capabilities of the JSON only approach are limited. In the protype only lookup by eventID and eventAcronym was implemented via HashTables/Dicts.
Ruruki
Searching for an "in memory graph database solution for python" we found ruruki. The performance was disappointing. See testSpeed
creating 1000 vertices took 1.7 s = 598.3 v/s
Gremlin python
Using Gremlin would have been our favorite choice based on the experience with the SimpleGraph project. From our experience in-memory gremlin works excellent with less than 100.000 vertices when using Java. It is still useable for around 1 million vertices. If there are a lot more vertices it's necessay to back gremlin with a proper Graph database. We were not successful with any of the Graph database yet. The situation get's worse when Python is used as a programming language see: Gremlin python. Python is only supported as a language variant and it's very awkward to work in that environment. From our point of view this is not production ready so we didn't bother to investigate further.
DGraph
Dgraph looks like a good candidate for a graph database. It's simple to handle, has a docker based installation and a python interface. The tests looked very promising but in our development environment there has been some instability as reported in unreliability issue for Dgraph. It's not clear what causes this problem but it shows up in an undeterministic way in the travis CI tests and is a showstopper at this time. We will certainly reconsider Dgraph later.
Weaviate
Weaviate looks like an excellent fit for our usecase since it has NLP support built in and comes with a "contextionary" for e.g. english out of the box. That means the dictionary approach of the Proceedings Title Parser would have GloVe support immediately.
Apache Jena
Apache Jena has an RDF triplestore with SPARQL query capabilities. The question was whether the integration with Python and the "List of Dicts" based approach of the EventManager would be feasible. The github issues
- Add Apache Jena to tests
- add batch support for sparql insert
- Handle double quoted string content
- Handle newlines in string content
- http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Workshops/SSWS/Emmons-et-all-SSWS2011.pdf
had to be fixed
sqlite
sqlite outperforms all other approaches, especially if it used as an "in-memory" database. But even reading from disk is not much slower.
sqlite and JSON are similar in read and write performance. With sqlite SQL queries can be used wich is not directly possible with JSON.
Using SQL instead of JSON has some disadvantages e.g. a "schemaless" working mode is not possible. We worked around the problem by adding "ListOfDict" support where the columns are automatically detected and the DDL commands to create a schema need not be maintained manually.
Using SQL instead of a Graph database or Triplestore based approach has some disadvantages - e.g. graph queries are not directly possible but need to be translated to SQL - JOIN based relational queries. Given the low number of entities to be stored we hope this will not be too troublesome.
performance result for the "cities" example
see https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/master/tests/testSqlite3.py
adding 128769 City records took 0.300 s => 428910 records/s selecting 128769 City records took 0.300 s => 428910 records/s
Other Related Services
History of Library Catalogs
Digitization of library catalogs started with punched cards (Dewey1959) and continued to modern online catalogs.