Enhancing CLARIN-DK Resources While Building the Danish ParlaMint Corpus
Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Standard
Enhancing CLARIN-DK Resources While Building the Danish ParlaMint Corpus. / Jongejan, Bart; Hansen, Dorte Haltrup; Navarretta, Costanza.
CLARIN Annual Conference 2021 Proceedings. CLARIN ERIC, 2021. s. 70-73.Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Enhancing CLARIN-DK Resources While Building the Danish ParlaMint Corpus
AU - Jongejan, Bart
AU - Hansen, Dorte Haltrup
AU - Navarretta, Costanza
PY - 2021
Y1 - 2021
N2 - In this paper we describe the Danish CLARIN resources, corpora, tools and workflow, which we used and enhanced in order to build the Danish ParlaMint corpus, as part of the CLARIN founded ParlaMint project. More specifically, the article accounts for the manual and automatic processes involved in the preparation of the Danish Parliamentary speeches with focus on the CLARIN-DK tools and Text Tonsorium workflow management. The tools annotated the speeches with metadata and linguistic information in compliance with the common ParlaMint TEI P5 format. As a spin-off of the project, the CLARIN-DK sen-tence tokenizer and the CST Named Entity Recognizer were improved. These tools, to-gether with the CST-lemmatiser, Danish UD-Pipe software and several data transformation utilities, produced all the linguistic annotations in the correct format. We conclude the pa-per with a report of a pilot evaluation of the quality of some of the linguistic annotations in the Danish ParlaMint corpus.
AB - In this paper we describe the Danish CLARIN resources, corpora, tools and workflow, which we used and enhanced in order to build the Danish ParlaMint corpus, as part of the CLARIN founded ParlaMint project. More specifically, the article accounts for the manual and automatic processes involved in the preparation of the Danish Parliamentary speeches with focus on the CLARIN-DK tools and Text Tonsorium workflow management. The tools annotated the speeches with metadata and linguistic information in compliance with the common ParlaMint TEI P5 format. As a spin-off of the project, the CLARIN-DK sen-tence tokenizer and the CST Named Entity Recognizer were improved. These tools, to-gether with the CST-lemmatiser, Danish UD-Pipe software and several data transformation utilities, produced all the linguistic annotations in the correct format. We conclude the pa-per with a report of a pilot evaluation of the quality of some of the linguistic annotations in the Danish ParlaMint corpus.
M3 - Article in proceedings
SP - 70
EP - 73
BT - CLARIN Annual Conference 2021 Proceedings
PB - CLARIN ERIC
ER -
ID: 279626708