Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts

Department of Nordic Studies and Linguistics (NorS)

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts. / Al-Laith, Ali; Conroy, Alexander; Bjerring-Hansen, Jens; Hershcovich, Daniel.

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ed. / Nicoletta Calzolari; Min-Yen Kan; Veronique Hoste; Alessandro Lenci; Sakriani Sakti; Nianwen Xue. European Language Resources Association (ELRA), 2024. p. 4811-4819.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Al-Laith, A, Conroy, A, Bjerring-Hansen, J & Hershcovich, D 2024, Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts. in N Calzolari, M-Y Kan, V Hoste, A Lenci, S Sakti & N Xue (eds), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). European Language Resources Association (ELRA), pp. 4811-4819, Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, 20/05/2024. <https://aclanthology.org/2024.lrec-main.431>

APA

Al-Laith, A., Conroy, A., Bjerring-Hansen, J., & Hershcovich, D. (2024). Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts. In N. Calzolari, M-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 4811-4819). European Language Resources Association (ELRA). https://aclanthology.org/2024.lrec-main.431

Vancouver

Al-Laith A, Conroy A, Bjerring-Hansen J, Hershcovich D. Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts. In Calzolari N, Kan M-Y, Hoste V, Lenci A, Sakti S, Xue N, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). European Language Resources Association (ELRA). 2024. p. 4811-4819

Author

Al-Laith, Ali ; Conroy, Alexander ; Bjerring-Hansen, Jens ; Hershcovich, Daniel. / Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). editor / Nicoletta Calzolari ; Min-Yen Kan ; Veronique Hoste ; Alessandro Lenci ; Sakriani Sakti ; Nianwen Xue. European Language Resources Association (ELRA), 2024. pp. 4811-4819

Bibtex

@inproceedings{1855d62cfdaf44838628d7d0f35020f5,

title = "Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts",

abstract = "We develop and evaluate the first pre-trained language models specifically tailored for historical Danish and Norwegian texts. Three models are trained on a corpus of 19th-century Danish and Norwegian literature: two directly on the corpus with no prior pre-training, and one with continued pre-training. To evaluate the models, we utilize an existing sentiment classification dataset, and additionally introduce a new annotated word sense disambiguation dataset focusing on the concept of fate. Our assessment reveals that the model employing continued pre-training outperforms the others in two downstream NLP tasks on historical texts. Specifically, we observe substantial improvement in sentiment classification and word sense disambiguation compared to models trained on contemporary texts. These results highlight the effectiveness of continued pre-training for enhancing performance across various NLP tasks in historical text analysis.",

keywords = "Digital Humanities, Pre-trained Language Models, Sentiment Analysis, Word Sense Disambiguation",

author = "Ali Al-Laith and Alexander Conroy and Jens Bjerring-Hansen and Daniel Hershcovich",

note = "Publisher Copyright: {\textcopyright} 2024 ELRA Language Resource Association: CC BY-NC 4.0.; Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 ; Conference date: 20-05-2024 Through 25-05-2024",

year = "2024",

language = "English",

pages = "4811--4819",

editor = "Nicoletta Calzolari and Min-Yen Kan and Veronique Hoste and Alessandro Lenci and Sakriani Sakti and Nianwen Xue",

booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",

publisher = "European Language Resources Association (ELRA)",

}

RIS

TY - GEN

T1 - Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts

AU - Al-Laith, Ali

AU - Conroy, Alexander

AU - Bjerring-Hansen, Jens

AU - Hershcovich, Daniel

PY - 2024

Y1 - 2024

N2 - We develop and evaluate the first pre-trained language models specifically tailored for historical Danish and Norwegian texts. Three models are trained on a corpus of 19th-century Danish and Norwegian literature: two directly on the corpus with no prior pre-training, and one with continued pre-training. To evaluate the models, we utilize an existing sentiment classification dataset, and additionally introduce a new annotated word sense disambiguation dataset focusing on the concept of fate. Our assessment reveals that the model employing continued pre-training outperforms the others in two downstream NLP tasks on historical texts. Specifically, we observe substantial improvement in sentiment classification and word sense disambiguation compared to models trained on contemporary texts. These results highlight the effectiveness of continued pre-training for enhancing performance across various NLP tasks in historical text analysis.

AB - We develop and evaluate the first pre-trained language models specifically tailored for historical Danish and Norwegian texts. Three models are trained on a corpus of 19th-century Danish and Norwegian literature: two directly on the corpus with no prior pre-training, and one with continued pre-training. To evaluate the models, we utilize an existing sentiment classification dataset, and additionally introduce a new annotated word sense disambiguation dataset focusing on the concept of fate. Our assessment reveals that the model employing continued pre-training outperforms the others in two downstream NLP tasks on historical texts. Specifically, we observe substantial improvement in sentiment classification and word sense disambiguation compared to models trained on contemporary texts. These results highlight the effectiveness of continued pre-training for enhancing performance across various NLP tasks in historical text analysis.

KW - Digital Humanities

KW - Pre-trained Language Models

KW - Sentiment Analysis

KW - Word Sense Disambiguation

M3 - Article in proceedings

AN - SCOPUS:85195912870

SP - 4811

EP - 4819

BT - Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

A2 - Calzolari, Nicoletta

A2 - Kan, Min-Yen

A2 - Hoste, Veronique

A2 - Lenci, Alessandro

A2 - Sakti, Sakriani

A2 - Xue, Nianwen

PB - European Language Resources Association (ELRA)

T2 - Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024

Y2 - 20 May 2024 through 25 May 2024

ER -

ID: 396718582