Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

Dokumenter

  • Fulltext

    Forlagets udgivne version, 268 KB, PDF-dokument

We develop and evaluate the first pre-trained language models specifically tailored for historical Danish and Norwegian texts. Three models are trained on a corpus of 19th-century Danish and Norwegian literature: two directly on the corpus with no prior pre-training, and one with continued pre-training. To evaluate the models, we utilize an existing sentiment classification dataset, and additionally introduce a new annotated word sense disambiguation dataset focusing on the concept of fate. Our assessment reveals that the model employing continued pre-training outperforms the others in two downstream NLP tasks on historical texts. Specifically, we observe substantial improvement in sentiment classification and word sense disambiguation compared to models trained on contemporary texts. These results highlight the effectiveness of continued pre-training for enhancing performance across various NLP tasks in historical text analysis.

OriginalsprogEngelsk
TitelProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
RedaktørerNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
ForlagEuropean Language Resources Association (ELRA)
Publikationsdato2024
Sider4811-4819
ISBN (Elektronisk)9782493814104
StatusUdgivet - 2024
BegivenhedJoint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 - Hybrid, Torino, Italien
Varighed: 20 maj 202425 maj 2024

Konference

KonferenceJoint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024
LandItalien
ByHybrid, Torino
Periode20/05/202425/05/2024
SponsorAequa-Tech, Baidu, Bloomberg, Dataforce (Transperfect), et al., Intesa San Paolo Bank

Bibliografisk note

Publisher Copyright:
© 2024 ELRA Language Resource Association: CC BY-NC 4.0.

Links

ID: 396718582