This lesson provides both a short introduction to multivariate exploratory approaches and step-by-step instructions to implement each of them with R. Five methods are presented. These methods are designed to explore and summarize large and complex data tables by means of summary statistics. They help generate hypotheses by providing informative clusters using the variable values that characterize each observation.
For you to fully benefit from this lesson, you need:
Load the R environment as follows:
load("~/pathto/your/workingdirectory/multivariate.RData")
This allows you to access the datasets that come with the environment, a list of which is obtained with ls()
, like so:
ls()
## [1] "colprofs" "distm" "inclusion" "inputEFA"
## [5] "inputMCA" "inputPCA" "intensifiers" "leech"
## [9] "nouns" "nouns.2" "prepositions" "prepositions2"
## [13] "rowprofs" "swearwords"
I show how to run the code to perform CA, MCA, and PCA with FactoMineR
. The package should therefore be downloaded and installed beforehand.
install.packages("FactoMineR")
library(FactoMineR)
EFA is run with factanal()
and MDS with cmdscale()
, which are part of base R. No extra package is needed for these.
As corpus linguists, you collect large amounts of observations. These observations are commonly examined in the light of explanatory variables. You are hoping to find patterns in the data.
When the data set is too large, it becomes impossible to make sense of the table with the naked eye and summary statistics are needed. This is where multivariate exploratory techniques step in.
Consider Tab. 2.1 below, which displays the frequency counts of four types of nouns (rows) across three corpus files from the BNC-XML (columns).
A1J | A1K | A1L | |
---|---|---|---|
NN0 | 136 | 14 | 8 |
NN1 | 2236 | 254 | 263 |
NN2 | 952 | 87 | 139 |
NP0 | 723 | 117 | 71 |
What preference patterns can you see?
Now, take a look at Tab. 2.2 (same table, but with marginal totals).
A1J | A1K | A1L | row.totals | |
---|---|---|---|---|
NN0 | 136 | 14 | 8 | 158 |
NN1 | 2236 | 254 | 263 | 2753 |
NN2 | 952 | 87 | 139 | 1178 |
NP0 | 723 | 117 | 71 | 911 |
column totals | 4047 | 472 | 481 | 5000 |
How confident are you when you assess your first intuition in the light of the marginal totals?
Now see how the data points are clustered in a correspondence analysis map.
The more familiar you are with corpora and data extraction techniques, the more likely you are to work with large datasets such as the one in Tab. 2.3, adapted from Leech (2002).
LOB | FLOB | BROWN | FROWN | BrEng | AmEng | X1961 | X1991 | type | |
---|---|---|---|---|---|---|---|---|---|
would | 3028 | 2694 | 3053 | 2868 | 5722 | 5921 | 6081 | 5562 | core_modal |
will | 2798 | 2723 | 2702 | 2042 | 5521 | 4744 | 5500 | 4765 | core_modal |
can | 1997 | 2041 | 2193 | 2160 | 4038 | 4353 | 4190 | 4201 | core_modal |
could | 1740 | 1782 | 1776 | 1655 | 3522 | 3431 | 3516 | 3437 | core_modal |
may | 1333 | 1101 | 1298 | 878 | 2434 | 2176 | 2631 | 1979 | core_modal |
should | 1301 | 1147 | 910 | 787 | 2448 | 1697 | 2211 | 1934 | core_modal |
must | 1147 | 814 | 1018 | 668 | 1961 | 1686 | 2165 | 1482 | core_modal |
might | 777 | 660 | 635 | 635 | 1437 | 1270 | 1412 | 1295 | core_modal |
shall | 355 | 200 | 267 | 150 | 555 | 417 | 622 | 350 | core_modal |
ought | 104 | 58 | 70 | 49 | 162 | 119 | 174 | 107 | core_modal |
need | 78 | 44 | 40 | 35 | 122 | 75 | 118 | 79 | core_modal |
be_going_to | 248 | 245 | 219 | 332 | 493 | 551 | 467 | 577 | semi_modal |
be_to | 454 | 376 | 349 | 209 | 830 | 558 | 803 | 585 | semi_modal |
had_better | 50 | 37 | 41 | 34 | 87 | 75 | 91 | 71 | semi_modal |
have_got_to | 41 | 27 | 45 | 52 | 68 | 97 | 86 | 79 | semi_modal |
have_to | 757 | 825 | 627 | 643 | 1582 | 1270 | 1384 | 1468 | semi_modal |
need_to | 54 | 198 | 69 | 154 | 252 | 223 | 123 | 352 | semi_modal |
be_supposed_to | 22 | 47 | 48 | 51 | 69 | 99 | 70 | 98 | semi_modal |
used_to | 86 | 97 | 51 | 74 | 183 | 125 | 137 | 171 | semi_modal |
want_to | 357 | 2275 | 1772 | 2101 | 2632 | 3873 | 2129 | 4376 | semi_modal |
SEMI_MODALS | 2069 | 4127 | 3221 | 3650 | 6196 | 6871 | 5290 | 7777 | semi_modal |
MODALS | 14658 | 13264 | 13962 | 11927 | 27922 | 25889 | 28620 | 25191 | core_modal |
This data set is all the more tricky to summarize as it is fairly large and contains redundant information. These are aspects that CA accommodates easily (Fig. 2).
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Just like any phenomenon, the realization of language phenomena is influenced by several factors at the same time. Multivariate exploratory methods are meant for the exploration of such phenomena.
Once operationalized by the researcher, these multiple factors are captured by means of several independent variables.
When observations of a phenomenon are captured by several variables, the analysis is multivariate.
Exploring a data set means separating meaningful trends from random noise (unmeaningful distributions).
In theory, exploratory data analysis is used to generate hypotheses because the linguist does not yet have any assumption as to what kinds of trends should appear in the data. In practice, however, linguists collect observations in the light of specific variables precisely because they expect that the latter influence the distribution of the former.
I use this term instead of statistics to avoid scaring people off. The truth is that these methods rely on statistics. The good news is that these are not too hard to understand.
The methods I present today are exploratory, as opposed to explanatory or predictive. They help find structure in multivariate data thanks to observation groupings, and no more!
1st consequence: the conclusions made with these methods are therefore valid for the corpus only. For example, middle-class female speakers aged 25 to 59 display a preference for the use of bloody in the British National Corpus (XML). This finding should not be extended to British English in general, or other middle-class female speakers aged 25 to 59.
2nd consequence: The conclusions made with exploratory methods should never be considered as predictions. Of course, exploratory methods serve as the basis for the design of predictive modeling, which uses the values found in a sample to predict values for another sample.
Expanding on Gries (2006), Glynn (2014) finds that usage features and dictionary senses are correlated with dialect and register thanks to two multivariate exploratory techniques (correspondence analysis and multiple correspondence analysis). To confirm these findings, Glynn (2014) turns to logistic regression. This confirmatory multivariate technique allows to specify which of the usage features and dictionary senses are significantly associated with either dialect or register, and determine the importance of the associations.
Nowadays, many linguists jump to powerful predictive methods (such as logistic regression or discriminant analysis) without going through the trouble of exploring their data sets first. This is a shame because the point of running a multivariate exploratory analysis is to generate fine research hypotheses, which far more powerful predictive methods can only benefit from.
The challenge that underlies the visualizations obtained with dimensionality-reduction methods is the following: we seek to explore a cloud of points from a data set in the form of a \(rows \times columns\) table with as many dimensions as there are columns.
Like a complex object in real life, a data table has to be rotated so as to be observed from an optimal angle. Although the dimensions of a data table are eventually projected in a two-dimensional plane, they are not spatial dimensions. If the table has \(K\) columns, the data points are initially positioned in a space \(\mathbb{R}\) of \(K\) dimensions. To allow for easier interpretation, dimensionality-reduction methods decompose the cloud into a smaller number of meaningful planes.
All the methods covered in this chapter summarize the table by measuring how much variance there is and decomposing the variance into proportions.
All four methods offer graphs that facilitate the interpretation of the results. Although convenient, these graphs do not replace a careful interpretation of the numeric results.
The main difference between these methods pertain mainly to the kind of data that one works with.
Correspondence analysis (henceforth CA) is used to summarize a two-dimensional contingency table such as Tab. 2.1. The table is a matrix M of counts (i.e. integers) that consists of i individuals or observations (rows) and j variables (columns).
A1J | A1K | A1L | |
---|---|---|---|
NN0 | 136 | 14 | 8 |
NN1 | 2236 | 254 | 263 |
NN2 | 952 | 87 | 139 |
NP0 | 723 | 117 | 71 |
Multiple correspondence analysis (henceforth MCA) takes as input a case-by-variable table such as Tab. 3.1, from Desagulier (2015). The table consists of \(i\) individuals or observations (rows) and \(j\) variables (columns).
construction | intensifier | text_mode | text_type | text_info | sem_class |
---|---|---|---|---|---|
PREDETERMINER | QUITE | SPOKEN | OTHERSP | S pub debate | psych_stim_good |
PREADJECTIVAL | QUITE | WRITTEN | FICTION | W fict prose | factual |
PREADJECTIVAL | RATHER | WRITTEN | FICTION | W fict prose | dullness |
PREADJECTIVAL | RATHER | WRITTEN | FICTION | W fict prose | atypicality_odd |
PREADJECTIVAL | QUITE | SPOKEN | OTHERSP | S meeting | importance |
PREADJECTIVAL | RATHER | WRITTEN | NONAC | W religion | difficulty_complexity |
PREADJECTIVAL | RATHER | WRITTEN | NEWS | W newsp other: social | singularity |
PREADJECTIVAL | QUITE | WRITTEN | NONAC | W biography | factual |
PREDETERMINER | QUITE | SPOKEN | OTHERSP | S lect soc science | age_young |
PREADJECTIVAL | RATHER | WRITTEN | NONAC | W nonAc: nat science | simplicity |
Historically, MCA was developed to explore the structure of surveys in which informants are asked to select an answer from a list of suggestions. For example, the question ``According to you, which of these disciplines best describe the hard sciences: physics, biology, mathematics, computer science, or statistics?’’ requires informants to select one category.
Principal component analysis (i.e. PCA) takes as input a table of data of \(i\) individuals or observations (rows) and \(j\) variables (columns) such as Tab. 3.2, from Lacheret-Dujour et al. (2019).
corpus.sample | fPauses | fOverlaps | fFiller | fProm | fPI | fPA | subgenre | interactivity | planning.type |
---|---|---|---|---|---|---|---|---|---|
D0001 | 0.26 | 0.12 | 0.14 | 1.79 | 0.28 | 1.54 | argumentation | interactive | semi-spontaneous |
D0002 | 0.42 | 0.11 | 0.10 | 1.80 | 0.33 | 1.75 | argumentation | interactive | semi-spontaneous |
D0003 | 0.35 | 0.10 | 0.03 | 1.93 | 0.34 | 1.76 | description | semi-interactive | spontaneous |
D0004 | 0.28 | 0.11 | 0.12 | 2.29 | 0.30 | 1.79 | description | interactive | semi-spontaneous |
D0005 | 0.29 | 0.07 | 0.23 | 1.91 | 0.22 | 1.69 | description | semi-interactive | spontaneous |
D0006 | 0.47 | 0.05 | 0.26 | 1.86 | 0.44 | 1.94 | argumentation | interactive | semi-spontaneous |
The method handles continuous and nominal data. The continuous data may consist of means, reaction times, formant frequencies, etc. The categorical/nominal data are used to tag the observations.
Like PCA, exploratory factor analysis (i.e. EFA) takes as input a table of continuous data. However, it does not commonly accommodate nominal data. Typically, Tab. 3.2 minus the three columns of nominal data can serve as input for EFA (Tab. 3.3).
corpus.sample | fPauses | fOverlaps | fFiller | fProm | fPI | fPA |
---|---|---|---|---|---|---|
D0001 | 0.26 | 0.12 | 0.14 | 1.79 | 0.28 | 1.54 |
D0002 | 0.42 | 0.11 | 0.10 | 1.80 | 0.33 | 1.75 |
D0003 | 0.35 | 0.10 | 0.03 | 1.93 | 0.34 | 1.76 |
D0004 | 0.28 | 0.11 | 0.12 | 2.29 | 0.30 | 1.79 |
D0005 | 0.29 | 0.07 | 0.23 | 1.91 | 0.22 | 1.69 |
D0006 | 0.47 | 0.05 | 0.26 | 1.86 | 0.44 | 1.94 |
Multidimensional scaling (henceforth MDS) takes as input a distance matrix (a.k.a. a dissimilarity matrix). A distance matrix is similar to a driving distance table that you find in road atlases (Tab. 3.4).
X | Dallas | Miami | New.York | San.Francisco | Seattle |
---|---|---|---|---|---|
Dallas | 0.0 | 1312.1 | 1548.5 | 1731.4 | 2109.1 |
Miami | 1312.1 | 0.0 | 1283.3 | 3114.3 | 3296.9 |
New York | 1548.5 | 1283.3 | 0.0 | 2568.6 | 2404.6 |
San Francisco | 1731.4 | 3114.3 | 2568.6 | 0.0 | 807.0 |
Seattle | 2109.1 | 3296.9 | 2404.6 | 807.0 | 0.0 |
MDS can accommodate virtually any data table as long as you can convert it into a distance matrix.
The foundations of CA were laid out by Hirschfeld (1935) and Benzécri (1984). The method gets its name from what it aims to show:
The underlying idea is to group the rows and columns that share identical profiles (see below).
Because you investigate the relationship between your observations (i.e. the rows in your data table) and the variables that describe them (i.e. the columns), you want to determine whether rows and columns are independent. If they are, that’s bad.
To determine whether rows and columns are independent, CA relies on the \(\chi^2\) test. It tests the significance of the overall deviation of the table from the independence model. The test computes the contribution of each cell to \(\chi^2\) and sums up all contributions to obtain the \(\chi^2\) statistic.
Because we are interested in determining whether two variables are interdependent, we formulate the hypotheses as follows:
One calculates the \(\chi^{2}\) value of a cell in the \(i^{th}\) row and the \(j^{th}\) column as follows:
\[\begin{equation} \chi^2_{i,j} = \dfrac{(O_{i,j}-E_{i,j})^{2}}{E_{i,j}} \end{equation}\]
where \(O_{i,j}\) is the expected frequency for cell \(i,j\) and \(E_{i,j}\) is the expected frequency for cell \(i,j\). The \(\chi^2\) statistic of the whole table is the sum of the \(\chi^2\) values of all cells.
\[\begin{equation} \chi^2 = \sum\limits_{i=1}^n \dfrac{(O-E)^{2}}{E} \end{equation}\]
All the above can be run in R in just a few lines of code.
#nouns <- read.table("https://bit.ly/3oUTJGb", header=TRUE, sep="\t", row.names=1)
chisq <- chisq.test(nouns)
chisq
##
## Pearson's Chi-squared test
##
## data: nouns
## X-squared = 29.856, df = 6, p-value = 4.186e-05
The association between the choice of a noun type and the corpus file is significant.
chisq$observed
## A1J A1K A1L
## NN0 136 14 8
## NN1 2236 254 263
## NN2 952 87 139
## NP0 723 117 71
chisq$expected
## A1J A1K A1L
## NN0 127.8852 14.9152 15.1996
## NN1 2228.2782 259.8832 264.8386
## NN2 953.4732 111.2032 113.3236
## NP0 737.3634 85.9984 87.6382
Because comparing matrices manually is tedious, a much quicker way to find this out involves the Pearson residuals. If a Pearson residual is positive, then the corresponding observed frequency is greater than its expected frequency. If a Pearson residual is negative, then the corresponding observed frequency is smaller than its expected frequency. Second, the more the Pearson residual deviates from 0, the stronger the effect.
chisq$res
## A1J A1K A1L
## NN0 0.71757562 -0.2369744 -1.8466827
## NN1 0.16358137 -0.3649426 -0.1129787
## NN2 -0.04770979 -2.2951662 2.4119814
## NP0 -0.52895225 3.3430196 -1.7772954
In R, you can visualize the table of Pearson residuals by means of a Cohen-Friendly association plot with assocplot()
.
#nouns <- read.table("https://bit.ly/3oUTJGb", header=TRUE, sep="\t", row.names=1)
m <- as.matrix(nouns)
assocplot(t(m))
Compare with the CA plot (Fig. 2.1).
Central to CA is the concept of profile. To obtain the profile of a row, each cell is divided by its row total. Tab. 4.1 displays the row profiles of Table 2.1.
A1J.xml | A1K.xml | A1L.xml | row.total | |
---|---|---|---|---|
NN0 | 0.8608 | 0.0886 | 0.0506 | 1 |
NN1 | 0.7837 | 0.1241 | 0.0922 | 1 |
NN2 | 0.8081 | 0.0739 | 0.1180 | 1 |
NP0 | 0.7936 | 0.1284 | 0.0779 | 1 |
column average | 0.7935 | 0.1122 | 0.0943 | 1 |
The row profiles add up to 1.
Likewise, one obtains the profile of a column by dividing each column frequency by the column total.
A1J.xml | A1K.xml | A1L.xml | row.average | |
---|---|---|---|---|
NN0 | 0.0336 | 0.0245 | 0.0166 | 0.0310 |
NN1 | 0.5525 | 0.6189 | 0.5468 | 0.5594 |
NN2 | 0.2352 | 0.1521 | 0.2890 | 0.2310 |
NP0 | 0.1787 | 0.2045 | 0.1476 | 0.1786 |
column total | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Again, the column profiles add up to 1.
In theory, CA performs an analysis of rows and columns that is symmetric.
A column analysis consists in interpreting the column profiles using the rows as reference points on the basis of a table such as Table 4.2. For example, the value in the A1K.xml
column for singular common nouns (NN1
) is \(0.6189\). Comparing this value with the average proportion of NN1
in the sample (\(0.5594\)), it appears that these noun types are slightly over-represented in A1K.xml
by a ratio of \(\frac{0.6189}{0.5594} \approx 1.1063\).
A row analysis consists in interpreting the row profiles using the columns as reference points on the basis of a table such as Table 4.1, in which the same cell displays a value of \(0.1241\). In other words, of all the singular common nouns that occur in the corpus files, 12.41% occur in A1K.xml
. On average, A1K.xml
contains 11.22% of the nouns found in the sample. The ratio is the same as above, i.e. \(\frac{0.1241}{0.1122} \approx 1.1063\)!
Distances between profiles are measured with inertia. It is with the total inertia of the table (\(\phi^2\)) that CA measures how much variance there is. \(\phi^2\) is obtained by dividing the \(\chi^2\) statistic by the sample size. CA interprets inertia geometrically to assess how far row/column profiles are from their respective average profiles. The larger \(\phi^2\), the more the data points are spread out on the map.
Each column of the table contributes one dimension. The more columns in your table, the larger the number of dimensions. When there are many dimensions, summarizing the table becomes very difficult. To solve this problem, CA decomposes \(\phi^2\) along a few dimensions that concentrate as large a proportion of inertia as possible. These proportions of inertia are known as eigenvalues.
On top of the coordinates of the data points, two descriptors help interpret the dimensions:
If a data point displays a minor contribution to a given dimension, its position with respect to this dimension must not be given too much relevance.
The quality of the projection of a data point onto a dimension is measured as the percentage of inertia associated with this dimension. Usually, projection quality is used to select the dimension in which the individual or the variable is the most faithfully represented.
Individuals and variables can be declared as active or supplementary/illustrative. These supplementary rows and/or columns help interpret the active rows and columns. As opposed to active elements, supplementary elements do not contribute to the construction of the dimensions. Supplementary information is generally redundant. Its main function is to help interpret the results by providing relevant groupings. Whether a group of individuals or variables should be declared as active/illustrative depends on what the linguist considers are primary or secondary in the exploration of the phenomenon under study.
See Tab. 2.3.
LOB | FLOB | BROWN | FROWN | BrEng | AmEng | X1961 | X1991 | type | |
---|---|---|---|---|---|---|---|---|---|
would | 3028 | 2694 | 3053 | 2868 | 5722 | 5921 | 6081 | 5562 | core_modal |
will | 2798 | 2723 | 2702 | 2042 | 5521 | 4744 | 5500 | 4765 | core_modal |
can | 1997 | 2041 | 2193 | 2160 | 4038 | 4353 | 4190 | 4201 | core_modal |
could | 1740 | 1782 | 1776 | 1655 | 3522 | 3431 | 3516 | 3437 | core_modal |
may | 1333 | 1101 | 1298 | 878 | 2434 | 2176 | 2631 | 1979 | core_modal |
The variables LOB
, FLOB
, BROWN
, and FROWN
are active. The other variables are redundant and therefore illustrative.
Leitner (1991) reports a study by Hirschmüller (1989) who compares the distribution of complex prepositions in three corpora of English:
The Brown Corpus is a corpus of American English (Francis and Kučera 1964). The LOB Corpus is the British counterpart to the Brown Corpus (Leech, Johansson, and Hofland 1978; Leech et al. 1986). The Kolhapur Corpus is a corpus of Indian English (Shastri, Patilkulkarni, and Shastri 1986).
Complex prepositions are multiword expressions (i.e. expressions that consist of several words): ahead of, along with, apart from, such as, thanks to, together with, on account of, on behalf of, or on top of.
In Hirschmüller’s data, 81 prepositions consist of two words and 154 of three and more, out of a total of 235 complex prepositions. He observes:
Following Leitner (1991), we replicate Hirschmüller’s study based on a two-fold assumption:
Details on how the data were extracted can be found in this blog post.
The data file is stored as a data frame named prepositions
in the workshop’s R environment.
str(prepositions)
## 'data.frame': 257 obs. of 19 variables:
## $ brown1 : int 16 1470 207 14 106 246 0 925 523 2 ...
## $ kolh : int 8 1249 179 11 148 111 1 873 560 0 ...
## $ lob : int 12 1492 199 16 87 234 0 770 507 0 ...
## $ adventure_western_fiction: int 3 258 34 0 0 92 0 137 114 1 ...
## $ belles_lettres : int 11 627 67 7 45 64 0 382 271 0 ...
## $ general_fiction : int 3 465 38 4 4 72 0 235 97 0 ...
## $ humour : int 1 99 3 0 4 9 0 59 24 0 ...
## $ learned_scientific : int 1 454 168 8 86 53 0 339 177 0 ...
## $ miscellaneous : int 0 204 48 7 19 11 0 136 64 0 ...
## $ mystery_detective_fiction: int 2 299 18 3 4 75 1 153 74 1 ...
## $ popular_lore : int 7 348 50 5 43 59 0 250 146 0 ...
## $ press_editorial : int 0 188 21 0 17 15 0 116 136 0 ...
## $ press_reportage : int 3 336 27 2 50 30 0 296 227 0 ...
## $ press_reviews : int 0 142 14 0 7 4 0 75 49 0 ...
## $ religion : int 0 120 16 1 27 9 0 65 53 0 ...
## $ romance_love_story : int 3 350 28 3 3 48 0 145 78 0 ...
## $ science_fiction : int 1 48 6 1 3 4 0 14 7 0 ...
## $ skills_trades_hobbies : int 1 273 47 0 29 46 0 166 73 0 ...
## $ prep.length : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 2 1 1 1 1 1 ...
head(prepositions[,c(1:4, 18:19)])
## brown1 kolh lob adventure_western_fiction skills_trades_hobbies
## aboard 16 8 12 3 1
## about 1470 1249 1492 258 273
## above 207 179 199 34 47
## absent 14 11 16 0 0
## according to 106 148 87 0 29
## across 246 111 234 92 46
## prep.length
## aboard 1
## about 1
## above 1
## absent 1
## according to 2
## across 1
FactoMineR
We use the CA()
function from the FactoMineR
package.1 We need to load the package first.
library(FactoMineR)
At this point:
col.sup=4:18
). These 15 columns correspond to the 15 text categories;quali.sup=19
).ca.object <- CA(prepositions, col.sup=4:18, quali.sup=19, graph=FALSE)
By default, the CA()
function produces a graph based on the first two dimensions. For the time being, these plots are not generated yet (graph=FALSE
). Each graph will be plotted individually later, with specific parameters.
The output of CA is in ca.object
.
ca.object
## **Results of the Correspondence Analysis (CA)**
## The row variable has 257 categories; the column variable has 3 categories
## The chi square of independence between the two variables is equal to 10053.43 (p-value = 0 ).
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$col" "results for the columns"
## 3 "$col$coord" "coord. for the columns"
## 4 "$col$cos2" "cos2 for the columns"
## 5 "$col$contrib" "contributions of the columns"
## 6 "$row" "results for the rows"
## 7 "$row$coord" "coord. for the rows"
## 8 "$row$cos2" "cos2 for the rows"
## 9 "$row$contrib" "contributions of the rows"
## 10 "$col.sup$coord" "coord. for supplementary columns"
## 11 "$col.sup$cos2" "cos2 for supplementary columns"
## 12 "$quali.sup$coord" "coord. for supplementary categorical var."
## 13 "$quali.sup$cos2" "cos2 for supplementary categorical var."
## 14 "$call" "summary called parameters"
## 15 "$call$marge.col" "weights of the columns"
## 16 "$call$marge.row" "weights of the rows"
The first lines of the ouput give the \(\chi^2\) score and the associated \(p\)-value. The \(\chi^2\) score is very high (10,053.43) and it is associated with the smallest possible \(p\)-value (0). The deviation of the table from independence is beyond doubt.
The eig
object allows to see how many dimensions there are to inspect.
ca.object$eig
## eigenvalue percentage of variance cumulative percentage of variance
## dim 1 0.020398336 82.34156 82.34156
## dim 2 0.004374495 17.65844 100.00000
Because the input table is simple and because the number of active variables is low, there are only two dimensions to inspect. Indeed, the first two dimensions represent 100% of the variance of the table. In most other studies, however, we should expect to inspect more than two dimensions. Our decision is based on the cumulative percentage of variance. The inertia (i.e. the sum of eigenvalues) is low (0.0248). This means that there is not much variance in the table and that the tendencies that we are about to observe are subtle.
In case there are more than two dimensions to inspect, a scree plot such as Fig. 4.2 is useful.
barplot(ca.object$eig[,2], names=paste("dimension", 1:nrow(ca.object$eig)),
xlab="dimensions",
ylab="percentage of variance")
The CA graph is plotted with the plot.CA()
function.
The rows are made invisible to avoid cluttering the graph with prepositions (invisible="ind"
). The prepositions can be plotted together with the column variables by removing invisible="ind"
. To prevent the labels from being overplotted, autoLab
is set to "yes"
. By setting shadowtext
to TRUE
, a background shadow facilitates reading. The font size of the labels is adjusted to 80% of their default size (cex=0.8
). The active column variables are in magenta (col.col="magenta"
) whereas the supplementary column variables are in Dodger blue (col.col.sup= "dodgerblue"
). Finally, a title is included (title=
). Its font size is 80% of the default size (cex.main=.8
).
plot.CA(ca.object,
invisible="row",
autoLab="yes",
shadow=TRUE,
cex=.8,
col.col="magenta",
col.col.sup="dodgerblue",
title="Distribution of prepositions based on lexical complexity
in three corpora:\n LOB (British English), Brown (US English),
and Kolhapur (Indian English)",
cex.main=.8)
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Hirschmüller observed the following:
The CA plot reflects these tendencies (Fig. 4.3).
The first dimension (along the horizontal axis) accounts for 82.34% of the variance. It shows a clear divide between Brown and LOB (left) and Kolhapur (right). Large complex prepositions (three words and more: prep.length.3
and prep.length.4
) are far more likely to occur in Indian English than in British or US English. No such preference is observed for one-word and two-word prepositions (prep.length.1
and prep.length.2
). Very formal text categories cluster to the right, along with the Kolhapur corpus: learned\_scientific
, press\_reviews
, and religion, miscellaneous
(governmental documents, foundation reports, industry reports, college catalogue, industry in-house publications).
The second dimension (along the vertical axis) accounts for 17.66% of the variance. It distinguishes the LOB corpus (upper part of the plot) from the Brown corpus (lower part). All in all, complex prepositions are specific to the Kolhapur Corpus, especially in formal contexts.
Leech (2002) compares the distribution of modals and semi-modals across four corpora: Brown, LOB, Frown and FLOB. The results are given in Tab. 2.3, which is repeated below.
LOB | FLOB | BROWN | FROWN | BrEng | AmEng | X1961 | X1991 | type | |
---|---|---|---|---|---|---|---|---|---|
would | 3028 | 2694 | 3053 | 2868 | 5722 | 5921 | 6081 | 5562 | core_modal |
will | 2798 | 2723 | 2702 | 2042 | 5521 | 4744 | 5500 | 4765 | core_modal |
can | 1997 | 2041 | 2193 | 2160 | 4038 | 4353 | 4190 | 4201 | core_modal |
could | 1740 | 1782 | 1776 | 1655 | 3522 | 3431 | 3516 | 3437 | core_modal |
may | 1333 | 1101 | 1298 | 878 | 2434 | 2176 | 2631 | 1979 | core_modal |
should | 1301 | 1147 | 910 | 787 | 2448 | 1697 | 2211 | 1934 | core_modal |
must | 1147 | 814 | 1018 | 668 | 1961 | 1686 | 2165 | 1482 | core_modal |
might | 777 | 660 | 635 | 635 | 1437 | 1270 | 1412 | 1295 | core_modal |
shall | 355 | 200 | 267 | 150 | 555 | 417 | 622 | 350 | core_modal |
ought | 104 | 58 | 70 | 49 | 162 | 119 | 174 | 107 | core_modal |
need | 78 | 44 | 40 | 35 | 122 | 75 | 118 | 79 | core_modal |
be_going_to | 248 | 245 | 219 | 332 | 493 | 551 | 467 | 577 | semi_modal |
be_to | 454 | 376 | 349 | 209 | 830 | 558 | 803 | 585 | semi_modal |
had_better | 50 | 37 | 41 | 34 | 87 | 75 | 91 | 71 | semi_modal |
have_got_to | 41 | 27 | 45 | 52 | 68 | 97 | 86 | 79 | semi_modal |
have_to | 757 | 825 | 627 | 643 | 1582 | 1270 | 1384 | 1468 | semi_modal |
need_to | 54 | 198 | 69 | 154 | 252 | 223 | 123 | 352 | semi_modal |
be_supposed_to | 22 | 47 | 48 | 51 | 69 | 99 | 70 | 98 | semi_modal |
used_to | 86 | 97 | 51 | 74 | 183 | 125 | 137 | 171 | semi_modal |
want_to | 357 | 2275 | 1772 | 2101 | 2632 | 3873 | 2129 | 4376 | semi_modal |
All four corpora (LOB, FLOB, Brown, and Frown) belong to the `Brown Family’ of corpora. Here is why:
The Brown Family of corpora, which provides most of the evidence for this study, is named after the Brown Corpus (compiled at Brown University in the 1960s), which was the first of this group of corpora to be created, and was indeed the first electronic corpus of the English language. The Brown Corpus contains just over a million word tokens of published American English (AmE) textual material, and was joined in the 1970s by a matching corpus of British English (BrE), known as the LOB [Lancaster-Oslo/Bergen] Corpus. Both corpora consist of text samples first published in the year 1961. Later, in the 1990s, two new corpora were compiled at the University of Freiburg, matching Brown and LOB in all manageable details, and were therefore called the Freiburg-Brown [Frown] Corpus and the Freiburg-LOB [FLOB] Corpus. These were compiled according to precisely the same corpus design as Brown and LOB, except that they were sampled from texts first published in 1991 and 1992-3 respectively. – (Leech 2013)
These four corpora are comparable because they are equivalently subdivided into 15 genre categories. Furthermore, within each genre, the make-up of the sample is matched in sub-corpus groupings down to one-to-one equivalence of individual text samples of 2000 words.
What trends do you observe with respect to the distribution of modals and semi-modals between 1961 and 1991 in BrE and AmE?
The data set is preloaded in your R environment under the following name: leech
.
Because MCA is an extension of CA, its inner workings are very similar. For this reason, they are not repeated here.
As pointed out above, MCA takes as input a table of nominal data. For this method to yield manageable results, it is best if the table is of reasonable size (not too many columns), and if each variable does not break down into too many categories. Otherwise, the contribution of each dimension to \(\phi^2\) is small, and a large number of dimensions must be inspected.
There are no hard and fast rules for knowing when there are too many dimensions to inspect. However, when the eigenvalue that corresponds to a dimension is low, we know that the dimension is of little interest (the chances are that the data points will be close to the intersection of the axes in the summary plot).
Schmid (2003) provides an analysis of sex differences in the 10M-word spoken section of the British National Corpus (BNC). Schmid shows that women use certain swear-words more than men, although swear-words which tend to have a perceived `strong’ effect are more frequent in male speech. Schmid’s study is based on two subcorpora, which are both sampled from the spoken section of the BNC. The subcorpora amount to 8,173,608 words. The contributions are not equally shared among men and women since for every 100 word spoken by women, 151 are spoken by men. With respect to swear-words, Schmid’s conclusion is that both men and women swear, but men tend to use stronger swear-words than women.
Schmid’s study is repeated here in order to explore the distribution of swear-words with respect to gender in the BNC-XML. The goal is to see if:
The data file for this case study is swearwords
.2
Unlike Schmid, and following Rayson, Leech, and Hodges (1997), the data are extracted from the demographic component of the BNC-XML, which consists of spontaneous interactive discourse.
The swear-words are: bloody, damn, fuck, fucked, fucker, fucking, gosh, and shit.
Two exploratory variables are included in addition to gender: age and social class.3
The data set contains 293,289 swear-words. These words are described by three categorical variables (nominal data):
Age breaks down into 6 groups:
Ag0
: respondent age between 0 and 14;Ag1
: respondent age between 15 and 24;Ag2
: respondent age between 25 and 34;Ag3
: respondent age between 35 and 44;Ag4
: respondent age between 45 and 59;Ag5
: respondent age is 60+.Social classes are divided into 4 groups:
AB
: higher management: administrative or professional;C1
: lower management: supervisory or clerical;C2
: skilled manual;DE
: semi-skilled or unskilled.As we inspect the structure of the data frame with str()
, it is advisable to keep an eye on the number of levels for each variable and see if any can be kept to a minimum to guarantee that inertia will not drop.
str(swearwords)
## 'data.frame': 293289 obs. of 4 variables:
## $ word : Factor w/ 8 levels "bloody","damn",..: 2 2 7 7 7 2 7 2 7 7 ...
## $ gender : Factor w/ 2 levels "f","m": 2 2 2 2 2 2 2 2 2 2 ...
## $ age : Factor w/ 6 levels "Ag0","Ag1","Ag2",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ soc_class: Factor w/ 4 levels "AB","C1","C2",..: 1 1 1 1 1 1 1 1 1 1 ...
The variable word
has eight levels, not all of which are evenly distributed.
table(swearwords$word)
##
## bloody damn fuck fucked fucker fucking gosh shit
## 146203 32294 9219 11 467 23487 60678 20930
We can group fuck, fucking, fucked, and fucker into a single factor: f-words. With gsub()
, we replace each word with the single tag f-words
.
swearwords$word <- gsub("fuck|fucking|fucker|fucked", "f-words", swearwords$word, ignore.case=TRUE)
table(swearwords$word)
##
## bloody damn f-words gosh shit
## 146203 32294 33184 60678 20930
The number of levels has been reduced to five. We convert swearwords$word
back to a factor.
swearwords$word <- as.factor(swearwords$word)
As in CA, we can declare some variables as active and some other variables as supplementary/illustrative in MCA. We declare the variables corresponding to swear-words and gender as active, and the variables age and social class as supplementary/illustrative.
FactoMineR
We run the MCA with the MCA()
function. We declare age
and soc\_class
as supplementary (quali.sup=c(3,4)
). We do not plot the graph yet (graph=FALSE
).
mca.object <- MCA(swearwords, quali.sup=c(3,4), graph=FALSE)
Again, the eig
object allows us to see how many dimensions there are to inspect.
round(mca.object$eig, 2)
## eigenvalue percentage of variance cumulative percentage of variance
## dim 1 0.56 22.47 22.47
## dim 2 0.50 20.00 42.47
## dim 3 0.50 20.00 62.47
## dim 4 0.50 20.00 82.47
## dim 5 0.44 17.53 100.00
The number of dimensions is rather large and the first two dimensions account for only 42.47% of \(\phi^2\). To inspect a significant share of \(\phi^2\), e.g. 80%, we would have to inspect at least 4 dimensions. This issue is common in MCA.
The eigenvalues can be vizualized by means of a scree plot (Fig. 5.1). It is obtained as follows.
barplot(mca.object$eig[,1], names.arg=paste("dim ", 1:nrow(mca.object$eig)), las=2)
Ideally, we would want to see a sharp decrease after the first few dimensions, and we would want these first few dimensions to account for as much share of \(\phi^2\) as possible. Here, no sharp decrease is observed.
The MCA plot is made with the plot.MCA()
function. Each category is the color of its variable (habillage="quali"
). The title is removed (title=""
).
plot.MCA(mca.object,
invisible="ind",
autoLab="yes",
shadowtext=TRUE,
habillage="quali",
title="")
In the MCA biplot (Fig. 5.2), each category is the color of its variable. Let us focus first on the first dimension (the horizontal axis) and ignore the second dimension (the vertical axis). Strikingly, the most explicit swear-words (f-words) cluster in the rightmost part of the plot. These are used mostly by men. Female speakers tend to prefer a softer swear word: bloody.
Next, we focus on the second dimension and ignore the first. Words in the upper part (gosh and shit) are used primarily by upper-class speakers. F-words, bloody, and damn are used by lower social categories. Age groups are positioned close to the intersection of the axes. This is a sign that the first two dimensions bring little or no information about them.
Combining the two dimensions, the plot is divided into four corners in which we observe three distinct clusters:
gosh
and shit
, used by male and female upper class speakers;bloody
, used by female middle-class speakers;f
-words and damn
, used by male lower-class speakers.A divide exists between male (m
, right) and female (f
, left) speakers. However, as the combined eigenvalues indicate, we should be wary of making final conclusions based on the sole inspection of the first two dimensions.
The relevance of age groups becomes more relevant if dimensions 3 and 4 are inspected together (Fig. 5.3). To do so, the argument axes=c(3,4)
is added in the plot.MCA()
call.
plot.MCA(mca.object,
axes=c(3,4),
invisible="ind",
autoLab="yes",
shadowtext=TRUE,
habillage="quali",
title="")
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
With respect to dimensions 3 and 4, the male/female distinction disappears (both variables overlap where the two axes intersect). A divide is observed between f
-words and bloody
(left), used mostly by younger and middle-aged speakers, and gosh
and damn
(right), used mostly by upper-class speakers from age groups 3 and 5. The most striking feature is the outstanding position of shit
in the upper-left corner.
Although used preferably by male and female upper class speakers, it is also used, although to a lesser degree, by much younger speakers from lower social classes.
As in CA and MCA, the total variance of the table is decomposed into proportions in principal component analysis (henceforth PCA). There is one minor terminological difference: the dimensions are called principal components. For each component, the proportion of variance is obtained by dividing the squared standard deviation by the sum of the squared standard deviations.
As exemplified in this chapter, PCA is based on the inspection of correlations between the variables and the principal components.
Before one runs a PCA, one should consider standardizing (i.e. centering and scaling) the variables. If a table contains measurements in different units, standardizing the variables is compulsory. If a table contains measurements in the same unit, standardizing the variables is optional. However, even in this case, failing to standardize means giving each variable a weight proportional to its variance. Standardizing the variables guarantees that equal weights are attributed to the variables (Husson, Lê, and Pagès 2010).
A second kind of PCA is based on loadings Baayen (2008Sect. 5.1.1). Loadings are correlations between the original variables and the unit-scaled principal components. The two kinds of PCA are similar: both are meant to normalize the coordinates of the data points. The variant exemplified in this chapter is more flexible because it allows for the introduction of supplementary variables.
Gréa (2017) compares five prepositions that denote inclusion in French:
To determine the semantic profile of each preposition, Gréa examines their preferred and dispreferred nominal collocates. He uses an association measure known as calcul des spécificités which is based on the hypergeometric distribution. A positive value indicates that the word is over-represented in the construction. The higher the value, the more the word is over-represented. A negative value indicates that the word is under-represented in the construction. The smaller the value, the more the word is under-represented.
To compare the semantic profiles of the prepositions, the preferred and dispreferred nominal collocates of the prepositions are examined in the FrWaC corpus. The goal is to summarize the table graphically instead of interpreting the data table directly.
The data file for this case study is inclusion
.
As we inspect the data frame with str()
, we see that 22,397 NPs were found. The rows contain the nominal collocates and the columns the prepositions. The cells contain the association scores. The assumption is that the semantic profiles of the prepositions will emerge from the patterns of attraction/repulsion.
str(inclusion)
## 'data.frame': 22397 obs. of 5 variables:
## $ centre: num -281 -274 -219 -128 -114 ...
## $ coeur : num -651 -913 -933 -368 -330 ...
## $ milieu: num -545 -432 -270 -237 -173 ...
## $ parmi : num -1685 -1129 -1072 -678 -499 ...
## $ sein : num 4226 3830 3490 1913 1522 ...
head(inclusion)
## centre coeur milieu parmi sein
## entreprise -281.14 -651.39 -545.23 -1685.26 4226.02
## équipe -274.36 -913.40 -432.42 -1129.20 3829.95
## groupe -218.55 -932.95 -269.59 -1071.62 3490.06
## établissement -128.13 -367.62 -236.78 -677.87 1913.47
## association -114.08 -330.25 -172.95 -498.83 1522.44
## service -103.45 -279.61 -181.29 -495.38 1443.34
As in CA and MCA, we can declare some variables as active and some other variables as supplementary/illustrative in PCA. Here, however, we decide to declare all variables as active.
FactoMineR
We load the FactoMineR
package and run the PCA with the PCA()
function.4 The table contains measurements in the same unit. Standardizing them avoids giving each variable a weight proportional to its variance. Perhaps some prepositions attract most nouns more than others. The variables are standardized by default.
library(FactoMineR)
pca.object <- PCA(inclusion, graph=F)
We make sure that the first two components are representative.5
round(pca.object$eig, 2)
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 2.02 40.32 40.32
## comp 2 1.37 27.42 67.74
## comp 3 1.04 20.79 88.52
## comp 4 0.51 10.14 98.67
## comp 5 0.07 1.33 100.00
These eigenvalues are plotted in Fig. 6.1.
barplot(pca.object$eig[,1], names.arg=paste("comp ", 1:nrow(pca.object$eig)), las=2)
In PCA, the variables and the individuals and categories are plotted separately. The graph of variables serves as a guide to interpret the graph of individuals and categories.
In the graph of variables, each variable is represented as an arrow. The circle is known as the circle of correlations. The closer the end of an arrow is to the circle (and the farther it is from where the axes intersect at the center of the graph), the better the corresponding variable is captured by the two components, and the more important the components are with respect to this variable.
We plot the graph of variables (Fig. 6.2) and the graph of individuals (Fig. 6.3).
# graph of variables
plot.PCA(pca.object, choix="var", title="")
# graph of individuals
plot.PCA(pca.object, cex=0.8, autoLab="auto", shadowtext = FALSE, title="")
Three main profiles stand out:
au sein de
(upper left corner);au centre de
and au coeur de
(upper right corner);au milieu de
and parmi
(lower right corner).The affinities between au centre de
and au coeur de
on the one hand and au milieu de
and parmi
on the other are due to similar collocational behaviors. Au sein de
is the odd one out.
Admittedly, the graph of individuals is cluttered. This is due to the very large number of NP types that cooccur with the prepositions. Most NPs clutter around where the two axes intersect, a sign that their distribution is of little interest, at least with respect to our understanding of the prepositions. More interesting are those NPs that appear in the margins of the plot.
We can filter out unwanted individuals by selecting only the desired ones. The select
argument of the PCA()
function allows the user to filter out unwanted individuals by selecting only the desired ones.
Here is what the title of each plot means:
select="coord 20"
, only the labels of the twenty individuals that have the most extreme coordinates on the chosen dimensions are plotted;select="contrib 20"
, only the labels of the twenty individuals that have the highest contributions on the chosen dimensions are plotted;6select="cos2 5"
, only the labels of the five individuals that have the highest squared-cosine scores on the chosen dimensions are plotted;7select="dist 20"
, only the labels of the twenty individuals that are the farthest from the center of gravity of the cloud of data points are plotted.plot.PCA(pca.object, select="coord 20", title="select=\"coord 20\"")
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
plot.PCA(pca.object, select="contrib 20", title="select=\"contrib 20\"")
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
plot.PCA(pca.object, select="cos2 5", title="select=\"cos2 5\"")
## Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
plot.PCA(pca.object, select="dist 20", title="select=\"dist 20\"")
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Clearer trends emerge:
The graph displaying the first two components does a good job at grouping prepositions based on the nominal collocates that they have in common and revealing consistent semantic trends.
However, it does not show what distinguishes each preposition. For example, au centre du conflit ‘at the center of the conflict’ profiles a participant that is either the instigator of the conflict or what is at stake in the conflict. In constrast, au coeur du conflit ‘at the heart of the conflict’ denotes the peak of the conflict, either spatially or temporally.
This issue has nothing to do with the PCA. It has to do with the kind of collocational approach exemplified in the paper, which does not aim to (and is not geared to) reveal fine-grained semantic differences by itself.
Exploratory factor analysis (henceforth EFA) was made popular in linguistics by Biber’s studies on register variation as part of the multidimensional (MD) approach (Biber 1988, 1995). The goal of the MD approach is to detect register differences across the texts and text varieties of a corpus based on groups of linguistic features that co-occur significantly. Technically, this approach starts with a large number of linguistic variables and relies on factor analysis to reduce this number to a few basic functional dimensions that account for differences between texts.
Although close to PCA, EFA differs with respect to the following. The number of relevant components, which are called factors, is not determined automatically. It must be chosen beforehand.
EFA is designed to identify patterns of joint variation in a number of observed variables. It looks for variables that are highly correlated with a group of other variables. These intercorrelated variables are assumed to measure one underlying variable. This variable, which is not directly observed, but inferred, is latent.
It is known as a factor. This is an aspect that PCA is not designed to show. One added value of EFA is that ‘’an error term is added to the model in order to do justice to the possibility that there is noise in the data’’ Baayen (2008, 127).
The same data set used for PCA serves as input for EFA, namely inclusion
.
factanal()
EFA is performed with factanal()
. According to Figs. 6.2 and 6.3 we could say that three principal components are worth investigating, we are tempted to specify 3 factors. Unfortunately, this is not going to work because 3 factors are too many for 5 variables in the kind of EFA that factanal()
performs.8 Therefore, we set the number of required factors to 2.
fa.object <- factanal(inclusion, factors=2)
fa.object
##
## Call:
## factanal(x = inclusion, factors = 2)
##
## Uniquenesses:
## centre coeur milieu parmi sein
## 0.655 0.436 0.849 0.005 0.005
##
## Loadings:
## Factor1 Factor2
## centre 0.587
## coeur 0.750
## milieu 0.389
## parmi -0.147 0.987
## sein -0.740 -0.669
##
## Factor1 Factor2
## SS loadings 1.626 1.424
## Proportion Var 0.325 0.285
## Cumulative Var 0.325 0.610
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 12667.73 on 1 degree of freedom.
## The p-value is 0
A \(\chi^2\) test reports whether the specified number of factors is sufficient. If the \(p\)-value is smaller than 0.05, more factors are needed. If it is greater than \(0.05\), no more factors are needed. The test reports that the \(\chi^2\) statistic is 12,667.73 on 1 degree of freedom and that the \(p\)-value is 0. Although we feel a third factor is required (from our PCA), we have no choice but stick to 2 factors because factanal()
believes it will do a pretty good job at providing a satisfactory factor model. We shall why below.
The output displays:
Factor loadings are the weights and correlations between the variables and the factors. The higher the loading the more relevant the variable is in explaining the dimensionality of the factor. If the value is negative, it is because the variable has an inverse impact on the factor.
Au milieu de, au centre de, and au coeur de define the first factor. Parmi defines the second factor. It seems that au sein de defines both.
The proportions of variance explained by the factors (i.e. eigenvalues) are listed under the factor loadings. A factor is considered worth keeping if the corresponding SS loading (i.e. the sum of squared loadings) is greater than 1. Two factors are retained because both have eigenvalues over 1.
Factor 1 accounts for 32.5% of the variance. Factor 2 account for 28.5% of the variance. Both factors account for 66.9% of the variance.
In EFA, rotation is a procedure meant to clarify the relationship between variables and factors. As its name indicates, it rotates the factors to align them better with the variables. The two most frequent rotation methods are varimax and promax.
With varimax, the factor axes are rotated in such a way that they are still perpendicular to each other. The factors are uncorrelated and the production of 1s and 0s in the factor matrix is maximized.
With promax, the factor axes are rotated in an oblique way. The factors are correlated. The resulting model provides a closer fit to the data than with varimax.
In either case, the goal is to arrive at a few common meaningful factors. Rotation is optional as it does not modify the relationship between the factors and the variables. Figure 7.1 is a plot of the loadings of the prepositions on the two factors with varimax rotation. Figure 7.2 is the same plot of the loadings with promax rotation.
By default, varimax rotation applies.
loadings <- loadings(fa.object)
plot(loadings, type="n", xlim=c(-1,1))
text(loadings, rownames(loadings))
To produce a plot with promax rotation, we runfactanal()
again but set rotation
to promax
.
fa.object2 <- factanal(inclusion, factors=2, rotation="promax")
loadings2 <- loadings(fa.object2)
plot(loadings2, type="n", xlim=c(-1,1))
text(loadings2, rownames(loadings2))
The distinctive profiles we obtain with EFA are similar to those we obtained with PCA. The only major difference is the proximity of au milieu de with au centre de and au coeur de. This may be due to the fact that only two factors are retained in the analysis. As far as this data set is concerned, PCA is, I believe, clearly a better alternative. For a data set with a much larger number of variables, EFA works generally better.
MDS is very popular because it is relatively old, versatile, and easy to understand and implement. It is a multivariate data analysis approach that is used to visualize distances in multidimensional maps (in general: two-dimensional plots).
MDS comes in different flavors:
We focus on classical multidimensional scaling (MDS), which is also known as principal coordinates analysis (Gower 1966). MDS takes as input a matrix of dissimilarities and returns a set of points such that the distances between the points are approximately equal to the dissimilarities. A strong selling point of MDS is that, given the \(n\) dimensions of a table, it returns an optimal solution to represent the data in a space whose dimensions are (much) lower than \(n\).
The data is from Desagulier (2014). The data set was compiled to see how 23 English intensifiers cluster on the basis of their most associated adjectives. For each of the 23 adverbs, I first extracted all adjectival collocates from the Corpus of Contemporary American English (Davies 2008–2012), amounting to 432 adjective types and 316,159 co-occurrence tokens. Then, I conducted a collexeme analysis for each of the 23 degree modifiers. To reduce the data set to manageable proportions, the 35 most attracted adjectives were selected on the basis of their respective collostruction strengths, yielding a 23-by-432 contingency table containing the frequency of adverb-adjective pair types.
The contingency table is preloaded in your R environment under the name intensifiers
.
str(intensifiers)
## 'data.frame': 23 obs. of 432 variables:
## $ absent : int 0 0 0 0 0 40 43 0 0 0 ...
## $ absurd : int 0 0 42 0 0 0 0 0 0 0 ...
## $ acceptable : int 0 0 0 0 0 0 0 0 0 0 ...
## $ accurate : int 0 0 0 0 0 0 60 0 83 0 ...
## $ acidic : int 0 0 0 0 0 0 0 0 0 0 ...
## $ afraid : int 0 141 0 117 0 0 0 0 0 0 ...
## $ ajar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ akin : int 0 0 0 0 0 0 0 0 0 0 ...
## $ alien : int 0 0 0 0 0 0 0 0 0 0 ...
## $ alone : int 0 0 0 0 0 74 0 0 0 0 ...
## $ amazing : int 0 0 107 0 0 0 0 0 0 0 ...
## $ ambiguous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ambitious : int 0 0 0 0 0 0 0 0 0 0 ...
## $ american : int 0 0 0 0 0 0 0 0 0 0 ...
## $ amusing : int 0 0 0 0 0 0 0 0 0 1 ...
## $ anxious : int 0 0 0 0 0 0 0 0 0 0 ...
## $ apprehensive : int 21 0 0 0 0 0 0 0 0 0 ...
## $ appropriate : int 0 0 0 0 0 0 121 0 0 0 ...
## $ arbitrary : int 0 0 0 0 0 0 0 0 0 0 ...
## $ askew : int 0 0 0 0 0 0 0 0 0 0 ...
## $ avant-garde : int 0 0 0 0 0 0 0 0 0 1 ...
## $ awesome : int 0 0 0 0 0 0 0 0 0 0 ...
## $ awkward : int 49 112 0 0 0 0 0 0 0 0 ...
## $ bad : int 0 0 0 0 18 0 0 0 0 0 ...
## $ basic : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bearded : int 0 0 0 0 0 0 0 0 0 0 ...
## $ beautiful : int 0 0 145 0 0 0 0 0 0 0 ...
## $ benign : int 0 0 0 0 0 0 20 0 0 0 ...
## $ better : int 0 0 0 0 0 0 0 0 0 0 ...
## $ big : int 0 0 0 0 58 0 0 0 0 0 ...
## $ bigger : int 44 152 0 0 0 0 0 0 0 0 ...
## $ bitter : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bizarre : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bland : int 0 0 0 0 0 0 0 0 0 0 ...
## $ blank : int 0 0 0 0 0 0 0 0 0 0 ...
## $ blind : int 0 0 0 0 0 0 0 0 0 0 ...
## $ boring : int 0 0 0 0 0 0 0 0 0 1 ...
## $ bright : int 0 0 0 0 0 0 0 85 0 0 ...
## $ brilliant : int 0 0 57 0 0 0 0 0 0 0 ...
## $ brown : int 0 0 0 0 0 0 0 0 0 0 ...
## $ busy : int 0 0 0 0 17 0 0 0 0 2 ...
## $ capable : int 0 0 0 0 0 0 0 0 0 0 ...
## $ careful : int 0 0 0 0 0 0 0 111 0 0 ...
## $ certain : int 0 0 322 633 0 0 0 0 201 0 ...
## $ charming : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cheerful : int 0 0 0 0 0 0 0 0 0 1 ...
## $ chewy : int 0 0 0 0 0 0 0 0 0 0 ...
## $ chilly : int 26 0 0 0 0 0 0 0 0 0 ...
## $ clear : int 0 0 204 0 0 77 216 0 139 0 ...
## $ clever : int 0 0 0 0 0 0 0 0 0 1 ...
## $ close : int 0 0 0 0 17 0 0 0 82 2 ...
## $ cold : int 0 0 0 0 11 0 0 0 0 2 ...
## $ comfortable : int 0 0 0 0 0 94 80 0 0 0 ...
## $ comical : int 0 0 0 71 0 0 0 0 0 0 ...
## $ common : int 0 0 0 0 0 0 0 0 278 0 ...
## $ compatible : int 0 0 0 0 0 0 20 0 0 0 ...
## $ competitive : int 0 0 0 0 0 0 0 83 0 0 ...
## $ complete : int 0 0 0 238 0 0 0 0 0 0 ...
## $ complex : int 0 0 0 0 0 0 0 129 0 0 ...
## $ complicated : int 0 0 0 0 0 0 0 73 0 0 ...
## $ concerned : int 0 195 0 0 0 0 0 91 0 0 ...
## $ confident : int 0 0 57 0 0 0 0 0 78 0 ...
## $ confusing : int 0 0 0 0 0 0 0 0 0 0 ...
## $ conservative : int 0 0 0 0 0 0 0 0 65 0 ...
## $ consistent : int 0 0 0 0 0 0 95 0 130 0 ...
## $ constant : int 0 0 0 126 0 0 0 0 117 0 ...
## $ content : int 0 0 0 0 0 0 0 0 0 0 ...
## $ contradictory : int 0 0 0 0 0 0 0 0 0 0 ...
## $ controversial : int 0 0 0 0 0 0 0 0 0 0 ...
## $ convenient : int 0 0 0 0 6 0 0 0 0 0 ...
## $ convinced : int 0 0 203 0 0 0 41 0 0 0 ...
## $ convincing : int 0 0 0 0 0 0 22 0 0 0 ...
## $ cool : int 0 0 0 0 0 0 0 0 0 0 ...
## $ correct : int 0 0 339 0 0 0 33 0 0 0 ...
## $ crazy : int 0 289 75 0 0 62 0 0 0 0 ...
## $ creaky : int 0 0 0 0 0 0 0 0 0 1 ...
## $ critical : int 0 0 171 0 0 0 0 0 0 0 ...
## $ crockery-smashing: int 0 0 0 0 0 0 0 0 0 0 ...
## $ crooked : int 0 0 0 0 0 0 0 0 0 0 ...
## $ crucial : int 0 0 84 0 0 0 0 0 0 0 ...
## $ curved : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cute : int 0 0 0 0 16 0 0 0 0 0 ...
## $ daily : int 0 0 0 358 0 0 0 0 0 0 ...
## $ dainty : int 0 0 0 0 0 0 0 0 0 1 ...
## $ damp : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dangerous : int 0 0 0 0 0 0 0 282 0 0 ...
## $ dark : int 0 0 0 0 0 122 0 0 0 0 ...
## $ darker : int 0 0 0 0 0 0 0 0 0 0 ...
## $ decent : int 0 0 0 0 0 0 0 0 70 0 ...
## $ delicious : int 0 0 60 0 0 0 0 0 0 0 ...
## $ dependent : int 0 0 0 0 0 100 90 0 0 0 ...
## $ desirable : int 0 0 0 0 0 0 0 0 0 0 ...
## $ destructive : int 0 0 0 0 0 0 0 0 0 0 ...
## $ developed : int 0 0 0 0 0 0 0 0 0 0 ...
## $ devoid : int 0 0 0 0 0 43 22 0 0 0 ...
## $ devoted : int 0 0 0 0 0 0 29 0 0 0 ...
## $ different : int 167 930 0 0 0 1844 1394 0 0 0 ...
## $ difficult : int 52 153 0 0 31 0 0 1138 0 4 ...
## $ dirty : int 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
The contingency table must be converted into a distance object. Technically, this distance object is a dissimilarity matrix. Because the matrix is symmetric, it is divided into two parts (two triangles) on either side of the diagonal of null distances between the same cities. Only one triangle is needed.
You obtain the dissimilarity matrix by converting the contingency table into a table of distances with a user-defined distance measure. When the variables are ratio-scaled, you can choose from several distance measures: Euclidean, City-Block/Manhattan, correlation, Pearson, Canberra, etc. I have noticed that the Canberra distance metric handles best the relatively large number of empty occurrences that we typically obtain in linguistic data (i.e. when we have a sparse matrix).
We use the dist()
function:
method="canberra"
);diag
) lets you decide if you want R to print the diagonal of the distance object;upper
) lets you decide if you want R to print the upper triangle of the distance object.dist.object <- dist(intensifiers, method="canberra", diag=T, upper=T)
The distance object is quite large. To see a snapshot, enter the following.
dist.matrix <- as.matrix(dist.object)
dist.matrix[1:5, 1:5]
## a_bit a_little absolutely almost awfully
## a_bit 0.0000 372.1788 432.0000 432.0000 418.8591
## a_little 372.1788 0.0000 429.4200 426.3215 422.8858
## absolutely 432.0000 429.4200 0.0000 421.9386 432.0000
## almost 432.0000 426.3215 421.9386 0.0000 432.0000
## awfully 418.8591 422.8858 432.0000 432.0000 0.0000
The diagonal of 0 values separates the upper and lower triangles, as expected from a distance matrix.
cmdscale()
The distance matrix serves as input to the base-R cmdscale()
function, which performs a ‘vanilla’ version of MDS. We specify k=2
, meaning that the maximum dimension of the space which the data are to be represented in is 2.
mds <- cmdscale(dist.matrix,eig=TRUE, k=2)
mds
## $points
## [,1] [,2]
## a_bit -93.23707 -159.5334193
## a_little -66.06634 -110.6741767
## absolutely 74.93872 19.7177119
## almost -10.48677 2.9320651
## awfully -57.58458 0.9245004
## completely 220.59916 -36.6649541
## entirely 139.16294 20.9142554
## extremely -59.98183 73.9680050
## fairly -30.38867 144.6343031
## frightfully -19.37836 -12.4447681
## highly -10.37726 9.4095014
## jolly -18.25316 -9.2906600
## most -32.58424 37.3181351
## perfectly 25.93391 62.2717594
## pretty -47.59436 121.8525685
## quite -39.16428 133.6428904
## rather -75.83882 14.5469112
## slightly -41.59012 -122.0587482
## somewhat -74.01797 -130.4787887
## terribly -55.30971 -54.0548218
## totally 219.14279 -46.3793237
## utterly 92.51536 -30.9168073
## very -40.43937 70.3638610
##
## $eig
## [1] 1.752381e+05 1.476932e+05 1.251985e+05 1.089659e+05 1.030146e+05
## [6] 9.859947e+04 9.523649e+04 9.326120e+04 9.084104e+04 8.840678e+04
## [11] 8.735805e+04 8.450954e+04 8.322147e+04 7.973600e+04 7.619331e+04
## [16] 7.468265e+04 7.339436e+04 7.062472e+04 6.141132e+04 5.960298e+04
## [21] 5.534600e+04 3.012413e+04 1.455192e-10
##
## $x
## NULL
##
## $ac
## [1] 0
##
## $GOF
## [1] 0.1645376 0.1645376
The result is a matrix with 2 columns and 23 rows (fit$points
). The function has done a good job at outputting the coordinates of intensifiers in the reduced two-dimensional space that we requested. Note that cmdscale()
returns the best-fitting \(k\)-dimensional representation, where \(k\) may be less than the argument \(k\).
To plot the results, first we retrieve the coordinates for the two dimensions (x
and y
).
x <- mds$points[,1]
y <- mds$points[,2]
Second, we plot the two axes and add information about the intensifiers (Fig. 8.1).
plot(x, y, xlab="Dim.1", ylab="Dim.2", type="n")
text(x, y, labels = row.names(intensifiers), cex=.7)
The question we are addressing is whether these dimensions reflect differences in the semantics of the intensifiers. Existing typologies of intensifiers tend to group them as follows:
Maximizers and boosters stretch horizontally across the middle of the plot. Moderators are in the upper left corner, and diminishers in the lower left corner. Note the surprising position of almost.
We can improve the MDS plot in Fig. 8.1 by grouping and coloring the individuals by means of \(k\)-means clustering. \(k\)-means clustering partitions the data points into into \(k\) classes, based on the nearest mean.
We download and load one extra package.
install.packages("ggpubr")
library(ggpubr)
## Le chargement a nécessité le package : ggplot2
We convert the coordinates obtained above into a data frame.
mds.df <- as.data.frame(mds$points) # convert the coordinates
colnames(mds.df) <- c("Dim.1", "Dim.2") # assign column names
mds.df # inspect
## Dim.1 Dim.2
## a_bit -93.23707 -159.5334193
## a_little -66.06634 -110.6741767
## absolutely 74.93872 19.7177119
## almost -10.48677 2.9320651
## awfully -57.58458 0.9245004
## completely 220.59916 -36.6649541
## entirely 139.16294 20.9142554
## extremely -59.98183 73.9680050
## fairly -30.38867 144.6343031
## frightfully -19.37836 -12.4447681
## highly -10.37726 9.4095014
## jolly -18.25316 -9.2906600
## most -32.58424 37.3181351
## perfectly 25.93391 62.2717594
## pretty -47.59436 121.8525685
## quite -39.16428 133.6428904
## rather -75.83882 14.5469112
## slightly -41.59012 -122.0587482
## somewhat -74.01797 -130.4787887
## terribly -55.30971 -54.0548218
## totally 219.14279 -46.3793237
## utterly 92.51536 -30.9168073
## very -40.43937 70.3638610
We proceed to \(k\)-means clustering on the data frame with the kmeans()
function.
kmclusters <- kmeans(mds.df, 5) # k-means clustering with 5 groups
kmclusters <- as.factor(kmclusters$cluster) # convert to a factor
mds.df$groups <- kmclusters # join to the existing data frame
mds.df # inspect
## Dim.1 Dim.2 groups
## a_bit -93.23707 -159.5334193 2
## a_little -66.06634 -110.6741767 2
## absolutely 74.93872 19.7177119 5
## almost -10.48677 2.9320651 1
## awfully -57.58458 0.9245004 1
## completely 220.59916 -36.6649541 3
## entirely 139.16294 20.9142554 4
## extremely -59.98183 73.9680050 1
## fairly -30.38867 144.6343031 1
## frightfully -19.37836 -12.4447681 1
## highly -10.37726 9.4095014 1
## jolly -18.25316 -9.2906600 1
## most -32.58424 37.3181351 1
## perfectly 25.93391 62.2717594 1
## pretty -47.59436 121.8525685 1
## quite -39.16428 133.6428904 1
## rather -75.83882 14.5469112 1
## slightly -41.59012 -122.0587482 2
## somewhat -74.01797 -130.4787887 2
## terribly -55.30971 -54.0548218 2
## totally 219.14279 -46.3793237 3
## utterly 92.51536 -30.9168073 5
## very -40.43937 70.3638610 1
We are ready to launch the plot. Each group will be assigned a color (Fig. 8.1).
ggscatter(mds.df, x = "Dim.1", y = "Dim.2",
label = rownames(intensifiers),
color = "groups",
palette = "jco",
size = 1,
ellipse = TRUE,
ellipse.type = "convex",
repel = TRUE)
The result is gratifying, although a bit puzzling. Let us see why.
The distance matrix can also serve as input for another multivariate exploratory method: hierarchical cluster analysis.
We use the hclust()
function to apply an amalgamation rule that specifies how the elements in the matrix are clustered. We amalgamate the clusters with Ward’s method, which evaluates the distances between clusters using an analysis of variance. Ward’s method is the most widely used amalgamation rule because it has the advantage of generating clusters of moderate size. We specify method="ward.D"
.
clusters <- hclust(dist.object, method="ward.D")
We plot the dendrogram as follows (Fig. 8.3).
plot(clusters, sub="(Canberra, Ward)")
Although based on the same distance matrix, the dendrogram clusters the intensifiers slightly differently.
The standard graphic output of CA and MCA is a symmetric biplot in which both row variables and column variables are represented in the same space using their coordinates. In this case, only the distance between row points or the distance between column points can be interpreted accurately (Greenacre 2007, 2:72). Only general observations can be made about the distance between row points and column points, when these points appear in the same part of the plot with respect to the center of the cloud of points (Husson, p.c.).
Assessing the inter-distance between rows and columns accurately is possible in either an asymmetric plot or a scaled symmetric biplot (see this tutorial). In an asymmetric biplot, either the columns are represented in row space or the rows are represented in a column space. In a scaled symmetric biplot, neither the row metrics nor the column metrics are preserved. Rows and columns are scaled to have variances equal to the square roots of eigenvalues, which allows for direct comparison in the same plot.9
The dataset for this exercise is `it.be.A.to.V.that.rds’, available here. It is based on a study of the productivity of the construction <it BE ADJ to V> that in the British National Corpus (Baby edition).
The construction is exemplified below:
Each observation should be described by the following variables:
corpus file
: the name of the file from the BNC Baby where the construction is observed;info
: the information tag about the text (e.g. W newsp brdsht nat: sports
or W ac:polit law edu
);mode
: wtext for written text or stext for spoken text;type
: the text genre (ACPROSE
, NEWS
, FICTION
, etc.);exact match
: the construction in context;ADJ
: the adjectival constituent of the construction;V_inf
: the verbal constituent of the construction.Run a PCA of it.be.A.to.V.that.rds
. Consider the variables num.hapax and V supplementary. Decide how many components (or dimensions) should be inspected, plot the results, and interpret them.
The dataset modals.BNC.rds
is a contingency table that contains the frequencies of ten English modals across seventy text genres in the British National Corpus.10
Run a CA of modals.BNC.rds
, which you will find here.
First, plot only the variables that correspond to written texts (the variable names starts with W_
). What can you say about the respective distributions of may and might?
Second, plot only the variables that correspond to spoken texts (the variable names starts with S_
). What profile stands out the most? What is the contextual specificity of will?
The data set for this exercise is typocorp.rds
. You will find it here.
The dataset is a tentative typology of seventeen of the most widely used corpora in English:
Each corpus was described with respect to seven variables:
Your goal is to run a MCA using the MCA()
function from the FactoMineR
package. The variables corpus
, stored_data_format
, and size
should be considered supplementary. Plot the graph of individuals.
Using modals.BNC.rds
again, run a hierarchical cluster analysis of the ten English modals and plot the results in the form of a dendrogram. The hclust()
function from the stats package in base R should do the trick.
Compare the following distance measures:11
Use Ward’s method to amalgamate the clusters.
Based on the HCA that you ran above, run an MSD analysis of the same dataset.
Desagulier (to appear): “Multivariate Exploratory Approaches” Book Section. In Practical Handbook of Corpus Linguistics, edited by Magali Paquot and Stefan Th. Gries. New York: Springer.
ca
Package.” Journal of Statistical Software 20 (3): 1–13.
On top of FactoMineR
, several packages contain a dedicated CA function, e.g. ca
(Nenadic and Greenacre 2007), and anacor
(de Leeuw and Mair 2009).↩︎
The code for the extraction was partly contributed by Mathilde Léger, a third-year student at Paris 8 University, as part of her end-of-term project.↩︎
Several packages and functions implement PCA in R : e.g. princomp()
and prcomp()
from the stats
package, ggbiplot()
from the ggbiplot
package (which is itself based on ggplot2
), dudi.pca()
from the ade4
package, and PCA()
from the FactoMineR
package. Mind you, princomp()
and prcomp()
perform PCA based on loadings.↩︎
For this kind of analysis, the first two components should represent a cumulative percentage of variance that is far above 50%. The more dimensions there are in the input data table, the harder it will be to reach this percentage.↩︎
The contribution is a measure of how much an individual contributes to the construction of a component.↩︎
The squared cosine (\(cos^2\)) is a measure of how well an individual is projected onto a component.↩︎
How many factors are considered worth keeping involves a choice based a metric known as SS loadings, as explained below.↩︎
This possibility is not offered in FactoMineR
. It is offered in the factoextra
(Kassambara and Mundt 2017) and ca
(Nenadic and Greenacre 2007) packages.↩︎
Enter ?dist
for more information about the distance metrics.↩︎