Classifying and predicting inter-speaker differences in heritage language performance. A corpus-statistical approach

Research output: Contribution to conferencePaperResearchpeer-review

In this paper we present a test a of a so far unexplored combination of corpuslinguistic and statistical methods
in order to predict language maintenance and attrition in heritage language speakers’ speech production. This is
motivated by the potential shortcoming in those studies of heritage language which have treated developments
in heritage speakers’ linguistic performance as a unidimensional process in which, e.g., vocabulary and speaking
rates are seen as two sides of the same coin: “More proficient speakers seem to have less of a problem with
lexical access and general construction of the clause. This in turn accounts for a faster speech rate. Speakers who
are less proficient are naturally hindered in their lexical access, which slows down their utterance” (Polinsky
2008: 20).
We challenge the assumption that speech performance is a unidimensional phenomenon, and question whether,
e.g., some speakers may be highly fluent with a limited vocabulary or vice versa. Secondly, we test what sociobiographical
backgrounds such as age, gender and involvement in a heritage community best predict the
performance of individual speaker on different dimensions of speech production.
We approach variation in linguistic performance through a corpus study of 337 heritage and immigrant speakers
of Danish in North and South America (the ‘Corpus of American Danish’, containing approx. 1.3 million tokens)
(Kühl, Heegård Petersen & Hansen 2019). Data were collected through sociolinguistic interviews, and the
linguistic performance of each speaker was measured on 13 quantitative measures (e.g., speech rate, the number
of Danish and majority language words, type-token ratio and the ratio of sub-clauses to main clauses).
The novel analysis consists of a two-step statistical approach: We first apply Factor Analysis/PCA to explore
which of the 13 performance variables behave more similar to each other. Preliminary analyses point towards
four underlying factors, i.e. (a) lexicon, (b) structural complexity, (c) utterance planning and (d) fluency (see
Heegård Petersen et al. 2018, also Kühl, Thøgersen & Hansen’s proposed paper for the RUEG2023). We expand
on this, by adding a second step in which we apply Multiple Regression analysis to explore which biographical
and social factors best predict an individual speaker’s performance in the various dimensions identified by the
Factor Analysis. Preliminary analyses indicate that performance in the different dimensions may correlate with
different socio-biographical factors, e.g., age group may be a predictor of fluency, but not of lexicon. Other
biographical factors include majority language, gender, immigrant vs. heritage speaker, involvement in minority
group networks, etc.
The study shows that (i) Heritage speakers are not simply ‘more’ or ‘less’ proficient, but perform differently on
different parameters, they may be fluent with a limited grammar and vocabulary or vice versa; (ii) it is (to some
extent) possible to predict linguistic performance from socio-biographical information and thus different lived
experiences – the variation is not random; (ii) speakers belonging to different groups perform differently. This
leads us to conclude that the socio-cultural setting (both the minority language community and the majority
society) play fundamental roles in the emergence of performance norms.
The proposed paper is a product of the project Danish voices in the Americas,,
funded by the A.P. Møller Fonden and the Carlsberg Foundation and hosted by the University of Copenhagen.
Heegård Petersen, Jan; Thøgersen, Jacob; Hansen, Gert Foget; Kühl, Karoline (2018): Linguistic proficiency in
immigrant and heritage speakers of Danish in Argentina and North America: A quantitative approach. In: Corpus
Linguistics and Linguistic Theory. DOI: 10.1515/cllt-2017-0088.
Kühl, Karoline; Jan Heegård Petersen; Gert Foget Hansen (2019) The Corpus of American Danish: A language
resource of spoken immigrant Danish in North and South America. Language Resources and Evaluation 54 (3), S.
831-849. DOI: 10.1007/s10579-019-09473-5.
Polinsky, Maria. 2008. Gender under incomplete acquisition: Heritage speakers’ knowledge of noun
categorization. Heritage Language Journal 6. 40–71.
Original languageEnglish
Publication date27 Sep 2023
Publication statusPublished - 27 Sep 2023
EventRUEG 2023: Linguistic Variability in Heritage Language Research - Humboldt Universität, Berlin, Germany
Duration: 26 Sep 202328 Sep 2023
Conference number: 2023


ConferenceRUEG 2023
LocationHumboldt Universität
Internet address

ID: 368335599