Focused Tutorial on Capturing, Enriching, Disseminating Research Data Objects

Use Cases from Text+, NFDI4Culture and BERD@NFDI

This cross-disciplinary tutorial on tools and methods provides brief introductions to use cases from NFDI4Culture, Text+ and BERD@NFDI.

Presented use cases address capture, enrichment and dissemination of research data objects. Capture involves the creation of digital surrogates (e.g. with OCR) or the representation of existing artefacts in a digital representation (e.g. transcription). This is followed by the enrichment (e.g. annotation, tagging and association) of research data objects and is summarised with their dissemination, i.e. making them available and sharing them to support collaboration and reuse.

This Focused Tutorial will provide knowledge, methodological and technical expertise in the areas of data, metadata, taxonomies and standards with a view to the FAIR principles, and promotes a cross-disciplinary exchange and networking between the participating consortia.

Recap of the event

The very first hybrid BERD Academy Focused Tutorial took place on November 24 and 25. This tutorial highlighted use cases from the NFDI consortia Text+, NFDI4Culture, and BERD@NFDI, providing an interdisciplinary insight into the current research from the field of unstructured data. In three different sessions, methodological and technical expertise for the acquisition, enrichment, and dissemination of research data objects was discussed and an exchange and networking between the participating consortia was promoted. 

Based on the FAIR principles (https://www.go-fair.org/fair-principles), this Focused Tutorial specialized on the first step of the FAIRification process, the findability of research data. In order to make research data easily discoverable, machine readability of these data is required. To address this process, the Focused Tutorial was divided into three sessions with three presentations each:

Jan Kamlah, Thomas Schmidt and Renat Shigapov opened the first session with their presentation “Extracting research data from historical documents with eScriptorium and Python”. They presented an eScriptorium and Python based workflow for extracting and structuring research data from a historical company directory. Using the OCR engine Kraken, text and layout recognition could be optimally adapted to the source material even with a small training data set. This talk was followed by a complementary presentation entitled “Mixing and Matching: Combining HTR and manual transcription for heterogeneous collections” by Melanie Seltmann and Stefan Büdenbender. In their Citizen Science project “Gruss & Kuss” they will determine whether suing the OCR software Transcribus is worthwhile for handwritten corpora of different sizes. The findings and related training material will later be integrated into the TEXT+ registry. This session was concluded by Keli Du with his talk “Derived text formats and their evaluation”. This contribution addressed the usability of derived text formats (DTF) and the associated challenges for a legally compliant reproducibility of protected texts.

The second session began with the presentation “Generic vocabularies for RDM: Use and reuse of TaDiRAH” by Luise Borek and Canan Hastik. This presentation drew attention to the relevance of controlled multilingual vocabularies using the Taxonomy of Digital Research Activities in the Humanities as an example. Philipp Hegel presented four theoretical and methodological application scenarios for the use of external resources, integration of external information, development of domain-specific ontologies and associations in his talk “Editions and Linked Data: Some modes of application”. Based on this, Robert Nazarek presented the Content Management Service for Linked Open Data, WissKI, of the Germanische Nationalmuseum. This CIDOC CRM-based scientific communication infrastructure offers full-FAIRable solutions throughout the research data lifecycle. 

In the third session, presentations by Lozana Rossenova, Renat Shigapov and Stefan Dietze presented use cases of knowledge graphs in the consortia NFDI4Culture and BERD@NFDI as well as at GESIS. With “Wikibase+ Semantic enrichment of media” Lozana Rossenova presented Wikibase as a primary data management environment for linking media objects with annotations and their cultural context as well as the use of standards data. Renat Shigapov provided a general overview of activities in the context of NFDI with his presentation “Knowledge graphs in BERD and NFDI”, starting from challenges in the development of the knowledge graph for German companies. Finally, Stefan Dietze’s talk “Scholarly information extraction for research knowlegde graphs” showcased the application of NLP techniques for the identification and disambiguation of software and data sets in subject-specific scientific sources.

This diverse range of topics addressed at the Focused Tutorial attracted a broad target group of well over 140 participants. The key insight delivered by this workshop into the state of the art of capturing and creating digital surrogates by means of OCR was:

The dissemination, provision and sharing of digital representations of existing artifacts (inc. transcriptions, annotation, tagging and linking) supports the collaboration and re-use of that information and opens up new perspectives for interdisciplinary connectivity.

We thank all contributors and participants for this fruitful exchange and look forward to the next Focused Tutorial.

By Canan Hastik

Program Details and Slides

Capturing Research Data Objects

Closed

Hybrid, University of Mannheim, B6, Room 81-83

24 November 2022

10.30 a.m. – 12.45 p.m.

“Extracting research data from historical documents with eScriptorium and Python” by Jan Kamlah, Thomas Schmidt and Renat Shigapov

Download presentation

This talk will present a workflow based on eScriptorium and Python to extract research data from historical documents. eScriptorium is a rather young transcription tool and uses the OCR engine Kraken. The software offers not only the possibility of optimally adapting the text recognition, but also the layout recognition to the source material by means of training. Due to the high research data quality requirements, this step is necessary in many cases. By using existing base models, the training effort can be drastically reduced. The text recognition results can then be exported in PAGE-XML format for further processing. For this purpose, the Python tool “blatt” was developed within the project. It can parse the PAGE-XML exports, sort and extract the contents using algorithms and templates, and convert them into a structured table format such as CSV. In the first part of the presentation there is small introduction to the topic, the source material and the research question. Then it will be shown how a training process based on a base model with minimal training data can be performed using the software eScriptorium and which problem to pay attention to. In the last section, the Python tool “blatt” is presented, as well as the underlying ideas and algorithms.

“Mixing and matching: Combining HTR and manual transcription for heterogeneous collections” by Melanie Seltmann and Stefan Büdenbender

In the Text+ consortium, tasks include the creation of a registry of resources in the different data domains and collection of tutorials and workshops on common tools and work processes. For this purpose, various use cases from the participating institutions are being investigated. The results will be incorporated into corresponding collections.

One such use case for the data domain Editions is the Citizen Science project Gruß & Kuss – Briefe digital. Bürger*innen erhalten Liebesbriefe, in which love letters of ordinary persons are transcribed and explored with the help of citizen scientists. The paper will present explorative approaches of using HTR for transcribing this heterogeneous corpus. We will investigate at which scope of a bundle, respectively at which number of pages of the same handwriting, the training and use of an HTR model pays off as an alternative or addition to manual transcription.

“Derived text formats and their evaluation” by Keli Du

Text and data mining (TDM) using copyrighted texts faces many limitations in terms of storage, publication and subsequent use of the resulting corpus. To solve this problem, texts can be transformed into Derived Text Formats (DTFs) by removing copyright-related features. To ensure that the text in DTFs can still be used for various TDM tasks in DH, such as authorship attribution or sentiment analysis, it is necessary to evaluate the extent to which the information loss caused by DTFs affects the TDM results. In my presentation, I will provide an overview of the different DTFs and some preliminary evaluation results.

Enriching Research Data Objects

Closed

Hybrid, University of Mannheim, B6, Room 81-83

24 November 2022

2.15 p.m. – 4.30 p.m.

“Generic vocabularies for RDM: Use and reuse of TaDiRAH” by Luise Borek and Canan Hastik

The sustainable use and reuse of resources from science and research requires systematic documentation and interlinking. By standardizing the language used for documentation, better reusability of resources can be guaranteed. The Taxonomy of Digital Research Activities in the Humanities (TaDiRAH) provides a controlled vocabulary for identifying and describing research activities and not only enables an active exchange with the communities, but also supports the mapping and machine-readable representation of them. Thus it is possible to develop a semantic framework for the description of scientific concepts enabling a cross-disciplinary and cross-cultural crosswalk by interlinking, e.g. among other things with research objects and through multilingualism. TaDirah is not only an offer to the DH community in German-speaking countries, but is now available in 7 languages ​​and is used in DARIAH-EU services, SSHOC marketplace, Course Registry Catalog, and text tools collection TAPoR for adding value for end-users, and to promote cross-disciplinary collaborative work.

“Editions and Linked Data: Some modes of application” by Philipp Hegel

Some preliminary thoughts about a case study on alchemical terminology will be connected with the question of how the practice of philological and historical commenting changes in the digital medium if semantic web technologies are applied. The theoretical and methodological approach of this case study will be illustrated by comparing it with three other modes of application of linked data in digital editions. The planned mode of application by associations differs from the three others by the arrangement of the research process. All four modes will be demonstrated with exemplary commentaries on some loci in Michael Maiers Atalanta fugiens (1617).

“Linked open data management with WissKI” by Mark Fichtner and Robert Nasarek

WissKI is a content management service for linked open data. As a module of the digital experience platform Drupal, it offers solutions for tasks within the entire research data lifecycle. The tutorial provides insights into the basic functionality of an ontology-based content management and demonstrates the most important features and properties of the software using online accessible live systems.

Disseminating Research Data Objects

Closed

Hybrid, University of Mannheim, B6, Room 81-83

25 November 2022

10.00 a.m. – 12.30 p.m.

“Wikibase+ Semantic enrichment of media” by Lozana Rossenova

Download presentation

In the context of NFDI4Culture, we encounter heterogeneous 2D and 3D representations of cultural assets, and a range of associated research data artifacts, which pose significant challenges to standardized access and visualisation tools. To bridge the gaps across traditional data management tools and media-rendering environments, at TIB’s Open Science Lab we developed a suite of free and open source tools focused around Wikibase as primary data management environment for linked open data. This FOSS toolchain facilitates linking media objects and annotations, and their cultural context (including historical people and places, geo-location and capture-technology metadata), to the broader semantic web including national and international authority records.

“Knowledge graphs in BERD and in NFDI” by Renat Shigapov

Download presentation

Knowledge graphs are able to capture, enrich and disseminate research data objects so that the FAIR and Linked Data principles are fulfilled. How knowledge graphs can improve the domain-specific (BERD) and cross-domain (NFDI) research data infrastructures? The answer is based on the use cases in BERD@NFDI and on activities of the NFDI working group “Knowledge graphs”. First, we describe the architecture, knowledge graphs and use cases in BERD@NFDI. Then, we present the NFDI working group “Knowledge Graphs”, its work plan and potential base services.

“Scholarly Information Extraction for research knowledge graphs” by Stefan Dietze

Download presentation

This talk will give an overview of research knowledge graphs in practice at GESIS and related NFDI consortia (e.g. BERD, NFDI4DS), their use and application and related techniques for knowledge graph construction. With respect to the latter, we will focus on scholarly information extraction techniques using state-of-the-art NLP models able to detect and disambiguate software or dataset mentions in scholarly publications or to semantically annotate social web