The Danish Gigaword Corpus

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review


  • Leon Strømberg-Derczynski
  • Manuel Rafael Ciosici
  • Morten H. Christiansen
  • Rebekah Brita Baglini
  • Jacob Aarup Dalsgaard
  • Riccardo Fusaroli
  • Peter Juel Henrichsen
  • Rasmus Hvingelby
  • Andreas Kirkedal
  • Kjeldsen, Alex Speed
  • Claus Ladefoged
  • Finn Arup Nielsen
  • Jens Madsen
  • Malte Lau Petersen
  • Jonathan Hvithamar Rystrøm
  • Daniel Varab
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers’ socio-economic status, and Danish dialects.
Original languageEnglish
Title of host publicationProceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
Number of pages9
PublisherLinköping University Electronic Press
Publication date2021
Publication statusPublished - 2021

Number of downloads are based on statistics from Google Scholar and

No data available

ID: 270555110