The Danish Gigaword Corpus

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

  • Leon Strømberg-Derczynski
  • Manuel Rafael Ciosici
  • Morten H. Christiansen
  • Rebekah Brita Baglini
  • Jacob Aarup Dalsgaard
  • Riccardo Fusaroli
  • Peter Juel Henrichsen
  • Rasmus Hvingelby
  • Andreas Kirkedal
  • Kjeldsen, Alex Speed
  • Claus Ladefoged
  • Finn Arup Nielsen
  • Jens Madsen
  • Malte Lau Petersen
  • Jonathan Hvithamar Rystrøm
  • Daniel Varab
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers{'} socio-economic status, and Danish dialects.
OriginalsprogEngelsk
TitelProceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
Antal sider9
ForlagLinköping University Electronic Press
Publikationsdato2021
Sider413-421
StatusUdgivet - 2021

ID: 270555110