{"id":1946,"date":"2018-11-01T12:05:49","date_gmt":"2018-11-01T12:05:49","guid":{"rendered":"http:\/\/www.hirmeos.eu\/?p=1946"},"modified":"2018-11-01T13:00:40","modified_gmt":"2018-11-01T13:00:40","slug":"report-entity-fishing-for-scholarly-publishing-challenges-and-recommendations","status":"publish","type":"post","link":"https:\/\/www.hirmeos.eu\/2018\/11\/01\/report-entity-fishing-for-scholarly-publishing-challenges-and-recommendations\/","title":{"rendered":"Report – Entity-fishing for Scholarly Publishing: Challenges and Recommendations"},"content":{"rendered":"

Report\u00a0<\/em><\/p>\n

Entity-fishing <\/em><\/strong>for Scholarly Publishing: Challenges and Recommendations<\/strong><\/h1>\n

Andrea Bertino, Luca Foppiano and Javier Arias, Aysa Ekanger, Klaus Thoden<\/strong><\/p>\n

 <\/p>\n

On 4th September 2018 the G\u00f6ttingen State and University Library<\/a>, with the support of the Max Weber Stiftung<\/a>, organised the second HIRMEOS Workshop on <\/em>Entity-Fishing for Digital Humanities and Scholarly Publishing<\/em><\/a>.<\/a><\/p>\n

Entity-fishing<\/em>, a service developed by Inria <\/a>with the support of DARIAH-EU<\/a> and hosted at HUMA-NUM<\/a>, enables identification and resolution of entities: named entities like person-name, location, organizations, as well as specialist and less commonly classified ones, such as concepts, artifacts, etc. The technical specifications of the service are described in the paper “<\/a>entity-fishing<\/em><\/a>: a service in the DARIAH infrastructure<\/a>“<\/a>, while for a quick overview of how it has been implemented on the digital platforms involved in the HIRMEOS project you can have a look at this factsheet<\/a> and at the recording of the webinar<\/a> organized by the SUB G\u00f6ttingen.<\/p>\n

The workshop aimed to discuss and clarify practical concerns arising when using the service and possible new use cases presented by Edition Open Access<\/a>, ScholarLed<\/a> and Septentrio Academic Publishing<\/a>.<\/p>\n

This report describes challenges related to the development of these applications and provides recommendations for its integration and use on digital publishing platforms. The solutions proposed can be further applied on other domains.<\/strong><\/p>\n

 <\/p>\n

Beyond the tradigital <\/em>format: the need for TEI XML<\/h2>\n

The new consortium ScholarLed is developing a common catalogue of all monographs published by several scholarly presses, with a recommendation system that will suggest similar books to those browsed by the user. While the initial proposal expected the similarity linkage to be done manually, ScholarLed intends to explore if entity-fishing <\/em>could automate the process, finding common entities within the catalogued books. Furthermore, ScholarLed is exploring other possible use cases within the context of improving the discoverability of their publications.<\/p>\n

Indeed, there is great potential in adopting wikidata entities as a standard metadata export. What still needs to be clarified \u2013 observes Javier Arias<\/strong> from Open Book Publishers<\/a> \u2013 is how this data is to be disseminated and promoted to distribution platforms and data miners. In order to increase the interest in this service, the presses involved in ScholarLed think that it is essential to find the best way to export the annotations and associated Wikidata IDs along with other book metadata, avoiding these data getting lost during book redistribution or reprocessing. To this end, it would be important to embed the found wikidata entities in their TEI XML<\/a> files \u2013 a feature that at present is not yet available via entity-fishing<\/em> service.<\/p>\n

\"\"<\/a><\/p>\n

With regard to the realisation of a shared catalogue for ScholarLed, Javier Arias pointed out in his <\/em>presentation <\/em><\/a>some specific challenges concerning its implementation in the platforms, the\u00a0 accuracy of the service and the dissemination of entities and metadata.<\/em><\/p>\n

 <\/p>\n

In a similar way, Klaus Thoden<\/strong>, who has tested how to enrich the content and extend the usability and interoperability of the books published by Edition Open Access<\/em>, argues that it would be extremely important to add some kind of support for TEI XML files, allowing the user to input an unannotated XML file and get back the file with the entities embedded as TEI annotations. In the case of Edition Open Access a PDF file is just one output format amongst others and only supporting TEI XML would make the publications annotated through entity-fishing <\/em>fully reusable.<\/p>\n

\"\"<\/a><\/p>\n

Four usage scenarios for Edition Open Access in <\/em>Klaus Thoden\u2019s presentation<\/em><\/a>: cross-publication discoverability of entities; interconnection of publications; links to taxonomies; search and browse functionality for entities<\/em><\/p>\n

\u00a0<\/em><\/p>\n

There are therefore some inherent difficulties with the service in relation to the use of the PDF format, which, although allowing the user an experience of the digital document as similar as possible to that of a traditional printed document, seems to somewhat limit\u00a0 the reuse of the annotated publications. See Luca Foppiano\u2019s technical <\/a>assessment<\/a> of how to process input in\u00a0 TEI XML format.<\/p>\n

Luca Foppiano<\/strong> notes that it could be interesting to allow the manual annotations of PDFs, however the resulting feature might not perform as well as expected and implementing might not be worth the effort.<\/p>\n

The problem rests in how the data is segmented internally. By referring to the GROBID library<\/a>, the PDF is divided into a list of LayoutTokens<\/em>: every single word corresponds to roughly one token but this can vary depending on how the PDF has been generated. A LayoutToken<\/em> also contains coordinates, fonts and other useful information. It is important to understand that the process is not always working perfectly. The order of the tokens might not be corresponding to the real order, and there are other small problems.<\/p>\n

As of today, entity-fishing<\/em> is the only tool available that maps entities from the coordinates extracted from the PDF. However, the opposite direction is more challenging as the coordinates associated with an entity might correspond to one or more layout token, thus matching them might give imprecise results.<\/p>\n

A solution to this challenge could be the development of a new graphical user interface to\u00a0 load a PDF and give the user the possibility to manually annotate it. The output annotations will be then compatible with the LayoutToken<\/em> detected coordinates.<\/p>\n

 <\/p>\n

How much editorial work is required to validate the outputs of entity-fishing<\/em>?<\/h2>\n

Scholarly publishers willing to adopt entity-fishing<\/em> in order to allow deeper interaction with their content are interested to maintain a balance between the benefits of such implementations and the amount of editorial work associated with them. It is necessary to develop a workflow that allows a smooth curation of the data obtained from the system, but it is difficult to assess how much time is required for the curation of the extracted entities without the editors having the system tested. Therefore, the first step in implementing entity-fishing<\/em> must be to clarify carefully the specific kind of service the publisher is looking for, rather than discovering possible applications on the go. Klaus Thoden notes that since entity-fishing<\/em> is a machine driven approach and can lead to results that are clearly wrong or wrongly disambiguated, the user must also be made aware of how the results came about, as the credibility of a source suffers from wrong results. If, however, it is made clear that the displayed enrichments result from automatic marking, the user records the information with a pinch of salt.<\/p>\n

 <\/p>\n

How to store data for the extracted entities and present full-annotated text?<\/h2>\n

By using entity-fishing <\/em>to enrich the content of digital publications, we need to store the data for the extracted terms. At least two solutions are possible. A first possible scenario is that the editor modify the contents of the paragraph fields in the database and add additional span elements. Otherwise, the data can be stored elsewhere by using standoff annotation per paragraph.
\nIn addition, the viewer used to present the processed document should be adapted appropriately. For example, to separate the authored content and the automatic tagging, the publication viewer should contain a button where automated enrichments can be toggled. Thus, the user still has a choice of seeing or ignoring these enrichments. Furthermore, a functionality could be implemented that allows correction or flagging of incorrect terms.<\/p>\n

 <\/p>\n

Can entity-fishing<\/em> be used to disambiguate digitized old documents?<\/h2>\n

Aysa Ekanger<\/strong> was exploring the possibility of using entity-fishing<\/em> in order to recognize old toponyms and person names, and automatically disambiguate them \u2013 in a series of digitized books, Aurorae Borealis Studia Classica<\/em><\/a>, and possibly in some historical maps that are now being digitized at the University Library of Troms\u00f8. Old maps present an interest both to scholars and to the general public. For digitized maps, two hypothetical use cases were discussed: annotations and using entity-fishing to produce interactive map-viewing, with an old map and a modern equivalent side-by-side.<\/p>\n

Another instance of map production can come from Septentrio Reports<\/a>, where a lot of archaeological excavation reports from Troms\u00f8 Museum are about to be published (this is born-digital text). Apart from toponym entries in annotated PDFs, the entity-fishing <\/em>tool could be used to produce a map of excavations. The map could be visualized through an appropriate tool and placed on the level of an issue, as one issue should contain 6\u20137 reports from different excavation sites.<\/p>\n

Old digitized texts present some specific challenges. For instance, there is a custom of referring to scholars by latinized names: Johann Christoph Sturm is referred to as \u201cSturmius\u201d in this <\/a>German text from 1728. Old toponyms that would be disambiguated would be of at least two types: names of places that no longer exist and archaic names of existing places (including old orthography). In order to achieve this kind of disambiguation there are two requirements: 1) the terms are contained in a knowledge base, and 2) there is a a piece of code which can\u00a0 recognise the relevant tokens in the text.<\/p>\n

\"\"<\/a><\/p>\n

A<\/em><\/a>ysa Ekanger\u2019s presentation<\/em><\/a> with three possible usage scenarios: PDF annotation, word clouds and digitized maps<\/em><\/p>\n

 <\/p>\n

A further relevant technical concern, however not directly connected to entity-fishing service<\/em>, is the inaccuracy of the OCR: in digitized historic texts 1\u20138% of OCR results are wrong. Of course, the improvement of OCR is necessary, but in the meantime the editors of the text in question may have to resort to \u201cmanual\u201d labour for a small percentage of the text: any text mining application would therefore require a preliminary (manual or semi-automatic) correction of the errors in PDF due to the OCR process.<\/p>\n

 <\/p>\n

Best practices on how to use entity-fishing<\/em> in scholarly publishing<\/h2>\n

During the discussion, it became evident how important it is to have an exact preliminary definition of the usage scenario<\/strong> that a content provider such as a publisher intends to develop on the basis of entity-fishing<\/em>. In particular, it is important to ascertain how much work is required<\/strong> to manage the specific publishing service based on entity-fishing<\/em>.<\/p>\n

From a more technical point of view, these are the key recommendations made by the developers of entity-fishing<\/em>:<\/p>\n

    \n
  1. Consider the web interface as a prototype and not as a production-ready application<\/em><\/strong>
    \n<\/strong>The demo interface shows \u201cwhat can be done\u201d. There is a step in the middle that is to adapt the service to your own requirements and this has to be done by a data scientist, someone that can quickly look up the data and get some information out of it. For example, the language R can be used to read, visualise and manipulate a list of JSON objects. In addition, you should avoid to evaluate the service on the basis of restricted manual tests. Proper assessment must be carried out with the correct tooling and should converge in a prototype showing some (restricted) results.<\/li>\n
  2. Do not perform on-the-fly computation but store<\/em><\/strong> your annotations on a database to retrieve them when needed<\/strong>
    \nThere are plenty of reasons to support this recommendation, the most important are:<\/li>\n<\/ol>\n