Focused Tutorial on Capturing, Enriching, Disseminating Research Data Objects

Use Cases from Text+, NFDI4Culture and BERD@NFDI

This cross-disciplinary tutorial on tools and methods provides brief introductions to use cases from NFDI4Culture, Text+ and BERD@NFDI.

Presented use cases address capture, enrichment and dissemination of research data objects. Capture involves the creation of digital surrogates (e.g. with OCR) or the representation of existing artefacts in a digital representation (e.g. transcription). This is followed by the enrichment (e.g. annotation, tagging and association) of research data objects and is summarised with their dissemination, i.e. making them available and sharing them to support collaboration and reuse.

This Focused Tutorial will provide knowledge, methodological and technical expertise in the areas of data, metadata, taxonomies and standards with a view to the FAIR principles, and promotes a cross-disciplinary exchange and networking between the participating consortia.

Capturing Research Data Objects

Hybrid, University of Mannheim, B6, Room 81-83

24 November 2022

10.30 a.m. – 12.45 p.m.

“Extracting research data from historical documents with eScriptorium and Python” by Jan Kamlah, Thomas Schmidt and Renat Shigapov

This talk will present a workflow based on eScriptorium and Python to extract research data from historical documents. eScriptorium is a rather young transcription tool and uses the OCR engine Kraken. The software offers not only the possibility of optimally adapting the text recognition, but also the layout recognition to the source material by means of training. Due to the high research data quality requirements, this step is necessary in many cases. By using existing base models, the training effort can be drastically reduced. The text recognition results can then be exported in PAGE-XML format for further processing. For this purpose, the Python tool “blatt” was developed within the project. It can parse the PAGE-XML exports, sort and extract the contents using algorithms and templates, and convert them into a structured table format such as CSV. In the first part of the presentation there is small introduction to the topic, the source material and the research question. Then it will be shown how a training process based on a base model with minimal training data can be performed using the software eScriptorium and which problem to pay attention to. In the last section, the Python tool “blatt” is presented, as well as the underlying ideas and algorithms.

“Mixing and matching: Combining HTR and manual transcription for heterogeneous collections” by Melanie Seltmann and Stefan Büdenbender

In the Text+ consortium, tasks include the creation of a registry of resources in the different data domains and collection of tutorials and workshops on common tools and work processes. For this purpose, various use cases from the participating institutions are being investigated. The results will be incorporated into corresponding collections.

One such use case for the data domain Editions is the Citizen Science project Gruß & Kuss – Briefe digital. Bürger*innen erhalten Liebesbriefe, in which love letters of ordinary persons are transcribed and explored with the help of citizen scientists. The paper will present explorative approaches of using HTR for transcribing this heterogeneous corpus. We will investigate at which scope of a bundle, respectively at which number of pages of the same handwriting, the training and use of an HTR model pays off as an alternative or addition to manual transcription.

“Derived text formats and their evaluation” by Keli Du

Text and data mining (TDM) using copyrighted texts faces many limitations in terms of storage, publication and subsequent use of the resulting corpus. To solve this problem, texts can be transformed into Derived Text Formats (DTFs) by removing copyright-related features. To ensure that the text in DTFs can still be used for various TDM tasks in DH, such as authorship attribution or sentiment analysis, it is necessary to evaluate the extent to which the information loss caused by DTFs affects the TDM results. In my presentation, I will provide an overview of the different DTFs and some preliminary evaluation results.

Enriching Research Data Objects

Hybrid, University of Mannheim, B6, Room 81-83

24 November 2022

2.15 p.m. – 4.30 p.m.

“Generic vocabularies for RDM: Use and reuse of TaDiRAH” by Luise Borek and Canan Hastik

The sustainable use and reuse of resources from science and research requires systematic documentation and interlinking. By standardizing the language used for documentation, better reusability of resources can be guaranteed. The Taxonomy of Digital Research Activities in the Humanities (TaDiRAH) provides a controlled vocabulary for identifying and describing research activities and not only enables an active exchange with the communities, but also supports the mapping and machine-readable representation of them. Thus it is possible to develop a semantic framework for the description of scientific concepts enabling a cross-disciplinary and cross-cultural crosswalk by interlinking, e.g. among other things with research objects and through multilingualism. TaDirah is not only an offer to the DH community in German-speaking countries, but is now available in 7 languages ​​and is used in DARIAH-EU services, SSHOC marketplace, Course Registry Catalog, and text tools collection TAPoR for adding value for end-users, and to promote cross-disciplinary collaborative work.

“Editions and Linked Data: Some modes of application” by Philipp Hegel

Some preliminary thoughts about a case study on alchemical terminology will be connected with the question of how the practice of philological and historical commenting changes in the digital medium if semantic web technologies are applied. The theoretical and methodological approach of this case study will be illustrated by comparing it with three other modes of application of linked data in digital editions. The planned mode of application by associations differs from the three others by the arrangement of the research process. All four modes will be demonstrated with exemplary commentaries on some loci in Michael Maiers Atalanta fugiens (1617).

“Linked open data management with WissKI” by Mark Fichtner and Robert Nasarek

WissKI is a content management service for linked open data. As a module of the digital experience platform Drupal, it offers solutions for tasks within the entire research data lifecycle. The tutorial provides insights into the basic functionality of an ontology-based content management and demonstrates the most important features and properties of the software using online accessible live systems.

Disseminating Research Data Objects

Hybrid, University of Mannheim, B6, Room 81-83

25 November 2022

10.00 a.m. – 12.30 p.m.

“Wikibase+ Semantic enrichment of media” by Lozana Rossenova

In the context of NFDI4Culture, we encounter heterogeneous 2D and 3D representations of cultural assets, and a range of associated research data artifacts, which pose significant challenges to standardized access and visualisation tools. To bridge the gaps across traditional data management tools and media-rendering environments, at TIB’s Open Science Lab we developed a suite of free and open source tools focused around Wikibase as primary data management environment for linked open data. This FOSS toolchain facilitates linking media objects and annotations, and their cultural context (including historical people and places, geo-location and capture-technology metadata), to the broader semantic web including national and international authority records.

“Knowledge graphs in BERD and in NFDI” by Renat Shigapov

Knowledge graphs are able to capture, enrich and disseminate research data objects so that the FAIR and Linked Data principles are fulfilled. How knowledge graphs can improve the domain-specific (BERD) and cross-domain (NFDI) research data infrastructures? The answer is based on the use cases in BERD@NFDI and on activities of the NFDI working group “Knowledge graphs”. First, we describe the architecture, knowledge graphs and use cases in BERD@NFDI. Then, we present the NFDI working group “Knowledge Graphs”, its work plan and potential base services.

“Scholarly Information Extraction for research knowledge graphs” by Stefan Dietze

This talk will give an overview of research knowledge graphs in practice at GESIS and related NFDI consortia (e.g. BERD, NFDI4DS), their use and application and related techniques for knowledge graph construction. With respect to the latter, we will focus on scholarly information extraction techniques using state-of-the-art NLP models able to detect and disambiguate software or dataset mentions in scholarly publications or to semantically annotate social web data to construct research data knowledge graphs that are easy to interpret and use as part of inter-disciplinary research.