The Danish Gigaword Corpus

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Documents

  • Leon Strømberg-Derczynski
  • Manuel Rafael Ciosici
  • Morten H. Christiansen
  • Rebekah Brita Baglini
  • Jacob Aarup Dalsgaard
  • Riccardo Fusaroli
  • Peter Juel Henrichsen
  • Rasmus Hvingelby
  • Andreas Kirkedal
  • Kjeldsen, Alex Speed
  • Claus Ladefoged
  • Finn Arup Nielsen
  • Jens Madsen
  • Malte Lau Petersen
  • Jonathan Hvithamar Rystrøm
  • Daniel Varab
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers’ socio-economic status, and Danish dialects.
Original languageEnglish
Title of host publicationProceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
Number of pages9
PublisherLinköping University Electronic Press
Publication date2021
Pages413-421
Publication statusPublished - 2021

Number of downloads are based on statistics from Google Scholar and www.ku.dk


No data available

ID: 270555110