Classifying and predicting inter-speaker differences in heritage language performance. A corpus-statistical approach

In this paper we present a test a of a so far unexplored combination of corpuslinguistic and statistical methods
in order to predict language maintenance and attrition in heritage language speakers’ speech production. This is
motivated by the potential shortcoming in those studies of heritage language which have treated developments
in heritage speakers’ linguistic performance as a unidimensional process in which, e.g., vocabulary and speaking
rates are seen as two sides of the same coin: “More proficient speakers seem to have less of a problem with
lexical access and general construction of the clause. This in turn accounts for a faster speech rate. Speakers who
are less proficient are naturally hindered in their lexical access, which slows down their utterance” (Polinsky
2008: 20).
We challenge the assumption that speech performance is a unidimensional phenomenon, and question whether,
e.g., some speakers may be highly fluent with a limited vocabulary or vice versa. Secondly, we test what sociobiographical
backgrounds such as age, gender and involvement in a heritage community best predict the
performance of individual speaker on different dimensions of speech production.
We approach variation in linguistic performance through a corpus study of 337 heritage and immigrant speakers
of Danish in North and South America (the ‘Corpus of American Danish’, containing approx. 1.3 million tokens)
(Kühl, Heegård Petersen & Hansen 2019). Data were collected through sociolinguistic interviews, and the
linguistic performance of each speaker was measured on 13 quantitative measures (e.g., speech rate, the number
of Danish and majority language words, type-token ratio and the ratio of sub-clauses to main clauses).
The novel analysis consists of a two-step statistical approach: We first apply Factor Analysis/PCA to explore
which of the 13 performance variables behave more similar to each other. Preliminary analyses point towards
four underlying factors, i.e. (a) lexicon, (b) structural complexity, (c) utterance planning and (d) fluency (see
Heegård Petersen et al. 2018, also Kühl, Thøgersen & Hansen’s proposed paper for the RUEG2023). We expand
on this, by adding a second step in which we apply Multiple Regression analysis to explore which biographical
and social factors best predict an individual speaker’s performance in the various dimensions identified by the
Factor Analysis. Preliminary analyses indicate that performance in the different dimensions may correlate with
different socio-biographical factors, e.g., age group may be a predictor of fluency, but not of lexicon. Other
biographical factors include majority language, gender, immigrant vs. heritage speaker, involvement in minority
group networks, etc.
The study shows that (i) Heritage speakers are not simply ‘more’ or ‘less’ proficient, but perform differently on
different parameters, they may be fluent with a limited grammar and vocabulary or vice versa; (ii) it is (to some
extent) possible to predict linguistic performance from socio-biographical information and thus different lived
experiences – the variation is not random; (ii) speakers belonging to different groups perform differently. This
leads us to conclude that the socio-cultural setting (both the minority language community and the majority
society) play fundamental roles in the emergence of performance norms.
The proposed paper is a product of the project Danish voices in the Americas,,
funded by the A.P. Møller Fonden and the Carlsberg Foundation and hosted by the University of Copenhagen.
