Shaping Data in Digital Humanities

CCC invites you to this seminar on Shaping Data in Digital Humanities. The goal of the seminar is to bring together researchers working in the broad area of digital humanities to elucidate the potential of language processing techniques for research in the humanities, such as, in literature and culture.

The seminar is open for all but registration is required.
It is no longer possible to register for the seminar.

Programme

See abstracts below

09:30 - 10:30 Network Analysis for Novels
Caroline Sporleder, University of Göttingen
10:30 - 11:00 Automatic Scansion of Poetry: Can empirical methods help? Towards unsupervised scansion of poetry
Manex Agirrezabal, University of Copenhagen
11:00 - 11:30 Coffee
11:30 - 12:30 Towards interoperability in the European poetry community
Helena Bermúdez, Universidad Nacional de Educación a Distancia, Madrid, Spain
12:30 - 13:30 Joint lunch
13:30 - 14:00 Between automatic textual annotation and manual proofreading: Working with highly annotated medieval primary sources
Alex Speed Kjeldsen/Anne Mette Hansen, University of Copenhagen
14:00 - 14:30 Shaping data by linking and contextualising
Tarrin Wills, University of Copenhagen
14:30 - 15:15 Coffee and panel discussion

Abstracts

Network Analysis for Novels
Caroline Sporleder, University of Göttingen

The automatic analysis of works of literature such as novels or poems is an interesting and intricate application for natural language processing. While early approaches concerned with the computational analysis of literature often concentrated on capturing an author's style via shallow surface features (bags-of-words), more recent studies looked beyond both stylometry and shallow features. The analysis of automatically computed character networks has received particular attention for modelling both style and narrative structure. In this talk, I will present two studies that investigate whether network analysis can be employed to cluster novels by genre and to identify authors.

Automatic Scansion of Poetry: Can empirical methods help? Towards unsupervised scansion of poetry
Manex Agirrezabal, University of Copenhagen

Automatic analysis of poetic rhythm is a challenging task that involves linguistics, literature, and computer science. When the language to be analyzed is known, rule-based systems or data-driven methods can be used. In this talk, I will show how we analyzed poetic rhythm in English and Spanish. We show how an English poem can be tagged by using some simple rules of thumb. In a similar way, classification models will be shown to predict quite well the prominences of a line, but as expected, Structured Prediction models will outperform them. Finally, we show that the representations of data learned from character-based neural models are more informative than the ones from hand-crafted features, and that a Bi-LSTM+ CRF-model produces state-of-the art accuracy on scansion of poetry in two languages. At the end, we will conjecture an unsupervised scansion model that should be able to analyze any poem in any language by only getting some text in such language.

Towards interoperability in the European poetry community
Helena Bermúdez, Universidad Nacional de Educación a Distancia, Madrid, Spain

This presentation stems from the Poetry Standardization and Linked Open Data project (POSTDATA, www.postdata.linhd.es). As its name reveals, one of the main aims of POSTDATA is to provide a means to publish European poetry data as Linked Open Data (LOD). Thus, one of the milestones of the project is to equip the community of practise of European Poetry with the required instruments for exchanging knowledge in the Semantic Web, even when languages and theoretical backgrounds differ. Even if the quest for interoperability is the foundation of POSTDATA, the project is also developing a complex framework to aid researchers in the literary analysis of poetry. In this line of work, we are designing a set of tools that combine natural language processing systems and quantitative text analysis methods.

Between automatic textual annotation and manual proofreading: Working with highly annotated medieval primary sources
Alex Speed Kjeldsen/Anne Mette Hansen, University of Copenhagen

Within the VELUX funded project Script and Text in Time and Space we are i.a. developing an open-source editing tool (MenoTaB) that enables analysis, publication and dissemination of handwritten medieval texts in both printed and digital form. The primary motivation behind the development of this tool are the many challenges philologists face when working with and producing highly annotated texts, e.g. in relation to the interplay between automatic analysis/annotation and manual correction. In the presentation we will discuss some of these challenges and illustrate how they are tackled in the MenoTaB system.

Shaping data by linking and contextualising
Tarrin Wills, University of Copenhagen

The projects I work with involve linking manuscript, textual, lexical and contextual material. In terms of the data themselves, the most important means of shaping data is using linking to encode the processes of abstraction and analysis that lead the researcher from the physical record of the text to its interpretation and analysis. I will describe briefly the digital methods of linking used (XML and relational data) and the data types and sources used. The result of this detailed and time-consuming work is that a number of different problems presented by the material can be solved, including making difficult material much more accessible. It also allows the answering of traditional research questions directed towards understanding texts and pieces of text within various dimensions (time, space, networks, genres, etc.), such as the dating of texts, identification of origins, assessing its novelty or uniqueness. This approach also allows building a broad overview of the corpora and their contexts for quantitative diachronic and diatopic observations, the development of language and style, and the mapping of networks and other connections. This presentation will also touch on another apparently unrelated process: how researchers shape the data themselves. These projects involve engaging a broad range of scholars to interact directly with the data: developing interfaces that allow traditional philological methods to be practised using digital interfaces. In particular, very senior scholars use the databases directly to create and control the quality of the data, producing highly authoritative and accurate resources.