A bilingual dictionary from a parallel corpus linked at the lexical level

Publikation: KonferencebidragKonferenceabstrakt til konferenceForskningfagfællebedømt

Bilingual and historical dictionaries can be produced from translated parallel corpora either automatically or manually. The automatic methods (e.g. using SketchEngine - Baisa et al. 2015) require alignment of corpora at the level of the paragraph or sentence, the smallest feasible level where a one-to-one correspondence can occur between two corpora in the same order. The manual methods (such as the new Lexicon of the Nordic Medieval Laws - Love et al. 2020) involve dictionary entries largely based on manually excerpting lexical equivalents between the two corpora, showing the usage of the word in the context and as interpreted by the translator, based on their understanding of the whole text.

Although many if not most words in a text have a one-to-one correspondence with the translation, the word order is almost inevitably not the same, meaning that they cannot be linked at the lexical level without reordering. Corpus linguistic tools assume that texts come in a fixed order at all levels of their linguistic structure. Linking two corpora at a very detailed level, however, requires a data model that permits one-to-one linking of words, for example, but with an alternative ordering in the translated version.

The problem of linking text and translation at the lexical level can be overcome to a certain extent with appropriate tools and methods The editing and translation project, ‘Skaldic Poetry of the Scandinavian Middle Ages’ (skaldic.org), uses a data model that allows for linking translations at the lexical level and reordering them appropriately for the target language (English). Visual tools are provided for those working with the data to facilitate this process. The editor-translators are encouraged to include all lexical elements in the translation, within the limits of the idiom of the target language, a common practice anyway for scholarly translations of historical texts. Those entering the data produce the closest feasible match between the text and translation at the lexical level.

The MSCA-funded Lexicon Poeticum (lexiconpoeticum.org) project has lemmatised the resulting corpus. These processes, when taken together, are sufficient to produce automatic dictionary entries that list the contextual translations for each word. These give an overview of the usage of the word in all contexts as interpreted by the editor-translator. Further information linked to the texts and words from the original project can also be incorporated into the resulting entries, including the source materials, annotations and other semantic analyses. This is sufficient information for most users of the lexicon. The final dictionary will therefore require only further editing in cases where different usages are not encompassed by the translations and/or the usage requires further explanation than the contextual translations.

This method requires suitable digital tools; a manageable corpus; source and target languages that are remotely related at least; and compatible practices of editing and translation in the preparation of the corpora. Within these parameters, the Lexicon Poeticum project demonstrates that it is possible to create useful lexicographic resources automatically, based solely on translation and lemmatising.
StatusUdgivet - 2021
BegivenhedeLex 2021: Post-editing Lexicography -
Varighed: 5 jul. 20217 jul. 2021
Konferencens nummer: 7


KonferenceeLex 2021


ID: 286415583