The Corpus of American Danish: a language resource of spoken immigrant Danish in North and South America

Institut for Nordiske Studier og Sprogvidenskab (NorS)

The Corpus of American Danish: a language resource of spoken immigrant Danish in North and South America

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › fagfællebedømt

The paper describes a newly established corpus of spoken immigrant Danish in North and South America, the Corpus of American Danish (CoAmDa). In its current state, the CoAmDa amounts to approx. 1.7 million tokens which makes it one of the largest corpora of heritage language at the present. With regard to text type, the CoAmDa can be characterized as non-standard multilingual spoken language as American English, Canadian English or Argentine Spanish, respectively, are present in the audio data and transcriptions.
The aim of this paper is to document relevant aspects and specifications of the CoAmDA, viz. the audio data combined with sociodemographic metadata on the speakers, the digitization process of analog data, the transcription procedures, the format and tagging of the speech files and the internal validation procedures that have been applied. By doing this, we share our experience and best practices with regard to achieve a spoken language resource of high quality with the interested public, in particular other researchers working on and with multilingual speech corpora.

Originalsprog	Engelsk
Tidsskrift	Language Resources and Evaluation
Vol/bind	54
Sider (fra-til)	831–849
Antal sider	19
ISSN	1574-020X
DOI	https://doi.org/10.1007/s10579-019-09473-5
Status	Udgivet - 2020

Forskningsområder

Det Humanistiske Fakultet - spoken language resource, language contact, multilingual spoken language, Danish language, heritage language, Corpus (creation, annotation, etc.)

ID: 198405254