This lesson provides both a short introduction to multivariate exploratory approaches and step-by-step instructions to implement each of them with R. Five methods are presented. These methods are designed to explore and summarize large and complex data tables by means of summary statistics. They help generate hypotheses by providing informative clusters using the variable values that characterize each observation.

1 Downloads and R packages

For you to fully benefit from this lesson, you need:

the R environment for this lesson (which includes the data sets)
the R code

Load the R environment as follows:

load("~/pathto/your/workingdirectory/multivariate.RData")

This allows you to access the datasets that come with the environment, a list of which is obtained with ls(), like so:

ls()

##  [1] "colprofs"      "distm"         "inclusion"     "inputEFA"     
##  [5] "inputMCA"      "inputPCA"      "intensifiers"  "leech"        
##  [9] "nouns"         "nouns.2"       "prepositions"  "prepositions2"
## [13] "rowprofs"      "swearwords"

I show how to run the code to perform CA, MCA, and PCA with FactoMineR. The package should therefore be downloaded and installed beforehand.

install.packages("FactoMineR")
library(FactoMineR)

EFA is run with factanal() and MDS with cmdscale(), which are part of base R. No extra package is needed for these.

2 Introduction

As corpus linguists, you collect large amounts of observations. These observations are commonly examined in the light of explanatory variables. You are hoping to find patterns in the data.

When the data set is too large, it becomes impossible to make sense of the table with the naked eye and summary statistics are needed. This is where multivariate exploratory techniques step in.

2.1 Dealing with a small data set

Consider Tab. 2.1 below, which displays the frequency counts of four types of nouns (rows) across three corpus files from the BNC-XML (columns).

Table 2.1: The distribution of four kinds of nouns in the BNC (NN0: noun, neutral for number; NN1: singular noun; NN2: plural noun; NP0: proper noun)
	A1J	A1K	A1L
NN0	136	14	8
NN1	2236	254	263
NN2	952	87	139
NP0	723	117	71

What preference patterns can you see?

Now, take a look at Tab. 2.2 (same table, but with marginal totals).

Table 2.2: The distribution of four kinds of nouns in the BNC (with marginal totals)
	A1J	A1K	A1L	row.totals
NN0	136	14	8	158
NN1	2236	254	263	2753
NN2	952	87	139	1178
NP0	723	117	71	911
column totals	4047	472	481	5000

How confident are you when you assess your first intuition in the light of the marginal totals?

Now see how the data points are clustered in a correspondence analysis map.

Figure 2.1: A CA plot of Tab. 2.1

2.2 Dealing with a larger data set

The more familiar you are with corpora and data extraction techniques, the more likely you are to work with large datasets such as the one in Tab. 2.3, adapted from Leech (2002).

Table 2.3: The distribution of modals and semi-modals across four corpora of British and American English (after Leech [2002])
	LOB	FLOB	BROWN	FROWN	BrEng	AmEng	X1961	X1991	type
would	3028	2694	3053	2868	5722	5921	6081	5562	core_modal
will	2798	2723	2702	2042	5521	4744	5500	4765	core_modal
can	1997	2041	2193	2160	4038	4353	4190	4201	core_modal
could	1740	1782	1776	1655	3522	3431	3516	3437	core_modal
may	1333	1101	1298	878	2434	2176	2631	1979	core_modal
should	1301	1147	910	787	2448	1697	2211	1934	core_modal
must	1147	814	1018	668	1961	1686	2165	1482	core_modal
might	777	660	635	635	1437	1270	1412	1295	core_modal
shall	355	200	267	150	555	417	622	350	core_modal
ought	104	58	70	49	162	119	174	107	core_modal
need	78	44	40	35	122	75	118	79	core_modal
be_going_to	248	245	219	332	493	551	467	577	semi_modal
be_to	454	376	349	209	830	558	803	585	semi_modal
had_better	50	37	41	34	87	75	91	71	semi_modal
have_got_to	41	27	45	52	68	97	86	79	semi_modal
have_to	757	825	627	643	1582	1270	1384	1468	semi_modal
need_to	54	198	69	154	252	223	123	352	semi_modal
be_supposed_to	22	47	48	51	69	99	70	98	semi_modal
used_to	86	97	51	74	183	125	137	171	semi_modal
want_to	357	2275	1772	2101	2632	3873	2129	4376	semi_modal
SEMI_MODALS	2069	4127	3221	3650	6196	6871	5290	7777	semi_modal
MODALS	14658	13264	13962	11927	27922	25889	28620	25191	core_modal

This data set is all the more tricky to summarize as it is fairly large and contains redundant information. These are aspects that CA accommodates easily (Fig. 2).

## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Figure 2.2: A CA plot of Tab. 2.3

2.3 Definitions

2.3.1 Multivariate

Just like any phenomenon, the realization of language phenomena is influenced by several factors at the same time. Multivariate exploratory methods are meant for the exploration of such phenomena.

Once operationalized by the researcher, these multiple factors are captured by means of several independent variables.

When observations of a phenomenon are captured by several variables, the analysis is multivariate.

2.3.2 Exploratory

Exploring a data set means separating meaningful trends from random noise (unmeaningful distributions).

In theory, exploratory data analysis is used to generate hypotheses because the linguist does not yet have any assumption as to what kinds of trends should appear in the data. In practice, however, linguists collect observations in the light of specific variables precisely because they expect that the latter influence the distribution of the former.

2.3.3 Techniques

I use this term instead of statistics to avoid scaring people off. The truth is that these methods rely on statistics. The good news is that these are not too hard to understand.

2.4 Why explore first?

The methods I present today are exploratory, as opposed to explanatory or predictive. They help find structure in multivariate data thanks to observation groupings, and no more!

1st consequence: the conclusions made with these methods are therefore valid for the corpus only. For example, middle-class female speakers aged 25 to 59 display a preference for the use of bloody in the British National Corpus (XML). This finding should not be extended to British English in general, or other middle-class female speakers aged 25 to 59.

2nd consequence: The conclusions made with exploratory methods should never be considered as predictions. Of course, exploratory methods serve as the basis for the design of predictive modeling, which uses the values found in a sample to predict values for another sample.

Expanding on Gries (2006), Glynn (2014) finds that usage features and dictionary senses are correlated with dialect and register thanks to two multivariate exploratory techniques (correspondence analysis and multiple correspondence analysis). To confirm these findings, Glynn (2014) turns to logistic regression. This confirmatory multivariate technique allows to specify which of the usage features and dictionary senses are significantly associated with either dialect or register, and determine the importance of the associations.

Nowadays, many linguists jump to powerful predictive methods (such as logistic regression or discriminant analysis) without going through the trouble of exploring their data sets first. This is a shame because the point of running a multivariate exploratory analysis is to generate fine research hypotheses, which far more powerful predictive methods can only benefit from.

3 Methods

3.1 Commonalities

The challenge that underlies the visualizations obtained with dimensionality-reduction methods is the following: we seek to explore a cloud of points from a data set in the form of a $rows \times columns$ table with as many dimensions as there are columns.

Like a complex object in real life, a data table has to be rotated so as to be observed from an optimal angle. Although the dimensions of a data table are eventually projected in a two-dimensional plane, they are not spatial dimensions. If the table has $K$ columns, the data points are initially positioned in a space $\mathbb{R}$ of $K$ dimensions. To allow for easier interpretation, dimensionality-reduction methods decompose the cloud into a smaller number of meaningful planes.

All the methods covered in this chapter summarize the table by measuring how much variance there is and decomposing the variance into proportions.

All four methods offer graphs that facilitate the interpretation of the results. Although convenient, these graphs do not replace a careful interpretation of the numeric results.

3.2 Differences

The main difference between these methods pertain mainly to the kind of data that one works with.

3.2.1 Correspondence analysis

Correspondence analysis (henceforth CA) is used to summarize a two-dimensional contingency table such as Tab. 2.1. The table is a matrix M of counts (i.e. integers) that consists of i individuals or observations (rows) and j variables (columns).

	A1J	A1K	A1L
NN0	136	14	8
NN1	2236	254	263
NN2	952	87	139
NP0	723	117	71

3.2.2 Multiple correspondence analysis

Multiple correspondence analysis (henceforth MCA) takes as input a case-by-variable table such as Tab. 3.1, from Desagulier (2015). The table consists of $i$ individuals or observations (rows) and $j$ variables (columns).

Table 3.1: The distribution of *quite* and *rather* in the BNC
construction	intensifier	text_mode	text_type	text_info	sem_class
PREDETERMINER	QUITE	SPOKEN	OTHERSP	S pub debate	psych_stim_good
PREADJECTIVAL	QUITE	WRITTEN	FICTION	W fict prose	factual
PREADJECTIVAL	RATHER	WRITTEN	FICTION	W fict prose	dullness
PREADJECTIVAL	RATHER	WRITTEN	FICTION	W fict prose	atypicality_odd
PREADJECTIVAL	QUITE	SPOKEN	OTHERSP	S meeting	importance
PREADJECTIVAL	RATHER	WRITTEN	NONAC	W religion	difficulty_complexity
PREADJECTIVAL	RATHER	WRITTEN	NEWS	W newsp other: social	singularity
PREADJECTIVAL	QUITE	WRITTEN	NONAC	W biography	factual
PREDETERMINER	QUITE	SPOKEN	OTHERSP	S lect soc science	age_young
PREADJECTIVAL	RATHER	WRITTEN	NONAC	W nonAc: nat science	simplicity

Historically, MCA was developed to explore the structure of surveys in which informants are asked to select an answer from a list of suggestions. For example, the question ``According to you, which of these disciplines best describe the hard sciences: physics, biology, mathematics, computer science, or statistics?’’ requires informants to select one category.

3.2.3 Principal component analysis

Principal component analysis (i.e. PCA) takes as input a table of data of $i$ individuals or observations (rows) and $j$ variables (columns) such as Tab. 3.2, from Lacheret-Dujour et al. (2019).

Table 3.2: Spoken samples described by 6 kinds of mean frequency counts further described by 3 kinds of nominal information
corpus.sample	fPauses	fOverlaps	fFiller	fProm	fPI	fPA	subgenre	interactivity	planning.type
D0001	0.26	0.12	0.14	1.79	0.28	1.54	argumentation	interactive	semi-spontaneous
D0002	0.42	0.11	0.10	1.80	0.33	1.75	argumentation	interactive	semi-spontaneous
D0003	0.35	0.10	0.03	1.93	0.34	1.76	description	semi-interactive	spontaneous
D0004	0.28	0.11	0.12	2.29	0.30	1.79	description	interactive	semi-spontaneous
D0005	0.29	0.07	0.23	1.91	0.22	1.69	description	semi-interactive	spontaneous
D0006	0.47	0.05	0.26	1.86	0.44	1.94	argumentation	interactive	semi-spontaneous

The method handles continuous and nominal data. The continuous data may consist of means, reaction times, formant frequencies, etc. The categorical/nominal data are used to tag the observations.

3.2.4 Exploratory factor analysis

Like PCA, exploratory factor analysis (i.e. EFA) takes as input a table of continuous data. However, it does not commonly accommodate nominal data. Typically, Tab. 3.2 minus the three columns of nominal data can serve as input for EFA (Tab. 3.3).

Table 3.3: Spoken samples described by 6 kinds of mean frequency counts
corpus.sample	fPauses	fOverlaps	fFiller	fProm	fPI	fPA
D0001	0.26	0.12	0.14	1.79	0.28	1.54
D0002	0.42	0.11	0.10	1.80	0.33	1.75
D0003	0.35	0.10	0.03	1.93	0.34	1.76
D0004	0.28	0.11	0.12	2.29	0.30	1.79
D0005	0.29	0.07	0.23	1.91	0.22	1.69
D0006	0.47	0.05	0.26	1.86	0.44	1.94

3.2.5 Multidimensional scaling

Multidimensional scaling (henceforth MDS) takes as input a distance matrix (a.k.a. a dissimilarity matrix). A distance matrix is similar to a driving distance table that you find in road atlases (Tab. 3.4).

Table 3.4: A matrix of driving distances between five US cities
X	Dallas	Miami	New.York	San.Francisco	Seattle
Dallas	0.0	1312.1	1548.5	1731.4	2109.1
Miami	1312.1	0.0	1283.3	3114.3	3296.9
New York	1548.5	1283.3	0.0	2568.6	2404.6
San Francisco	1731.4	3114.3	2568.6	0.0	807.0
Seattle	2109.1	3296.9	2404.6	807.0	0.0

MDS can accommodate virtually any data table as long as you can convert it into a distance matrix.

4 CA

4.1 Principles

The foundations of CA were laid out by Hirschfeld (1935) and Benzécri (1984). The method gets its name from what it aims to show:

the correspondence between the rows;
the correspondence between the columns;
ideally, the correspondence between the rows and the columns (but see Sec. Caveats).

The underlying idea is to group the rows and columns that share identical profiles (see below).

4.1.1 Are the rows and columns independent?

Because you investigate the relationship between your observations (i.e. the rows in your data table) and the variables that describe them (i.e. the columns), you want to determine whether rows and columns are independent. If they are, that’s bad.

To determine whether rows and columns are independent, CA relies on the $\chi^2$ test. It tests the significance of the overall deviation of the table from the independence model. The test computes the contribution of each cell to $\chi^2$ and sums up all contributions to obtain the $\chi^2$ statistic.

Because we are interested in determining whether two variables are interdependent, we formulate the hypotheses as follows:

$H_0$: the distributions of row variables and column variables are independent;
$H_1$: the distributions of row variables and column variables are interdependent.

One calculates the $\chi^{2}$ value of a cell in the $i^{th}$ row and the $j^{th}$ column as follows:

\[\begin{equation} \chi^2_{i,j} = \dfrac{(O_{i,j}-E_{i,j})^{2}}{E_{i,j}} \end{equation}\]

where $O_{i,j}$ is the expected frequency for cell $i,j$ and $E_{i,j}$ is the expected frequency for cell $i,j$. The $\chi^2$ statistic of the whole table is the sum of the $\chi^2$ values of all cells.

\[\begin{equation} \chi^2 = \sum\limits_{i=1}^n \dfrac{(O-E)^{2}}{E} \end{equation}\]

All the above can be run in R in just a few lines of code.

#nouns <- read.table("https://bit.ly/3oUTJGb", header=TRUE, sep="\t", row.names=1)
chisq <- chisq.test(nouns)
chisq

## 
##  Pearson's Chi-squared test
## 
## data:  nouns
## X-squared = 29.856, df = 6, p-value = 4.186e-05

The association between the choice of a noun type and the corpus file is significant.

chisq$observed

##      A1J A1K A1L
## NN0  136  14   8
## NN1 2236 254 263
## NN2  952  87 139
## NP0  723 117  71

chisq$expected

##           A1J      A1K      A1L
## NN0  127.8852  14.9152  15.1996
## NN1 2228.2782 259.8832 264.8386
## NN2  953.4732 111.2032 113.3236
## NP0  737.3634  85.9984  87.6382

Because comparing matrices manually is tedious, a much quicker way to find this out involves the Pearson residuals. If a Pearson residual is positive, then the corresponding observed frequency is greater than its expected frequency. If a Pearson residual is negative, then the corresponding observed frequency is smaller than its expected frequency. Second, the more the Pearson residual deviates from 0, the stronger the effect.

chisq$res

##             A1J        A1K        A1L
## NN0  0.71757562 -0.2369744 -1.8466827
## NN1  0.16358137 -0.3649426 -0.1129787
## NN2 -0.04770979 -2.2951662  2.4119814
## NP0 -0.52895225  3.3430196 -1.7772954

In R, you can visualize the table of Pearson residuals by means of a Cohen-Friendly association plot with assocplot().

#nouns <- read.table("https://bit.ly/3oUTJGb", header=TRUE, sep="\t", row.names=1)
m <- as.matrix(nouns)
assocplot(t(m))

Figure 4.1: An association plot

Compare with the CA plot (Fig. 2.1).

4.1.2 Row and column profiles

Central to CA is the concept of profile. To obtain the profile of a row, each cell is divided by its row total. Tab. 4.1 displays the row profiles of Table 2.1.

Table 4.1: Row profiles
	A1J.xml	A1K.xml	A1L.xml	row.total
NN0	0.8608	0.0886	0.0506	1
NN1	0.7837	0.1241	0.0922	1
NN2	0.8081	0.0739	0.1180	1
NP0	0.7936	0.1284	0.0779	1
column average	0.7935	0.1122	0.0943	1

The row profiles add up to 1.

Likewise, one obtains the profile of a column by dividing each column frequency by the column total.

Table 4.2: Column profiles
	A1J.xml	A1K.xml	A1L.xml	row.average
NN0	0.0336	0.0245	0.0166	0.0310
NN1	0.5525	0.6189	0.5468	0.5594
NN2	0.2352	0.1521	0.2890	0.2310
NP0	0.1787	0.2045	0.1476	0.1786
column total	1.0000	1.0000	1.0000	1.0000

Again, the column profiles add up to 1.

In theory, CA performs an analysis of rows and columns that is symmetric.

A column analysis consists in interpreting the column profiles using the rows as reference points on the basis of a table such as Table 4.2. For example, the value in the A1K.xml column for singular common nouns (NN1) is $0.6189$. Comparing this value with the average proportion of NN1 in the sample ($0.5594$), it appears that these noun types are slightly over-represented in A1K.xml by a ratio of $\frac{0.6189}{0.5594} \approx 1.1063$.

A row analysis consists in interpreting the row profiles using the columns as reference points on the basis of a table such as Table 4.1, in which the same cell displays a value of $0.1241$. In other words, of all the singular common nouns that occur in the corpus files, 12.41% occur in A1K.xml. On average, A1K.xml contains 11.22% of the nouns found in the sample. The ratio is the same as above, i.e. $\frac{0.1241}{0.1122} \approx 1.1063$!

4.1.3 Inertia and eigenvalues

Distances between profiles are measured with inertia. It is with the total inertia of the table ($\phi^2$) that CA measures how much variance there is. $\phi^2$ is obtained by dividing the $\chi^2$ statistic by the sample size. CA interprets inertia geometrically to assess how far row/column profiles are from their respective average profiles. The larger $\phi^2$, the more the data points are spread out on the map.

Each column of the table contributes one dimension. The more columns in your table, the larger the number of dimensions. When there are many dimensions, summarizing the table becomes very difficult. To solve this problem, CA decomposes $\phi^2$ along a few dimensions that concentrate as large a proportion of inertia as possible. These proportions of inertia are known as eigenvalues.

4.1.4 Contribution and quality of projection

On top of the coordinates of the data points, two descriptors help interpret the dimensions:

contribution
quality of projection ($cos^2$)

If a data point displays a minor contribution to a given dimension, its position with respect to this dimension must not be given too much relevance.

The quality of the projection of a data point onto a dimension is measured as the percentage of inertia associated with this dimension. Usually, projection quality is used to select the dimension in which the individual or the variable is the most faithfully represented.

4.1.5 Active vs. supplementary

Individuals and variables can be declared as active or supplementary/illustrative. These supplementary rows and/or columns help interpret the active rows and columns. As opposed to active elements, supplementary elements do not contribute to the construction of the dimensions. Supplementary information is generally redundant. Its main function is to help interpret the results by providing relevant groupings. Whether a group of individuals or variables should be declared as active/illustrative depends on what the linguist considers are primary or secondary in the exploration of the phenomenon under study.

See Tab. 2.3.

	LOB	FLOB	BROWN	FROWN	BrEng	AmEng	X1961	X1991	type
would	3028	2694	3053	2868	5722	5921	6081	5562	core_modal
will	2798	2723	2702	2042	5521	4744	5500	4765	core_modal
can	1997	2041	2193	2160	4038	4353	4190	4201	core_modal
could	1740	1782	1776	1655	3522	3431	3516	3437	core_modal
may	1333	1101	1298	878	2434	2176	2631	1979	core_modal

The variables LOB, FLOB, BROWN, and FROWN are active. The other variables are redundant and therefore illustrative.

4.2 Case study

Leitner (1991) reports a study by Hirschmüller (1989) who compares the distribution of complex prepositions in three corpora of English:

the Brown Corpus,
the LOB Corpus,
the Kolhapur Corpus.

The Brown Corpus is a corpus of American English (Francis and Kučera 1964). The LOB Corpus is the British counterpart to the Brown Corpus (Leech, Johansson, and Hofland 1978; Leech et al. 1986). The Kolhapur Corpus is a corpus of Indian English (Shastri, Patilkulkarni, and Shastri 1986).

Complex prepositions are multiword expressions (i.e. expressions that consist of several words): ahead of, along with, apart from, such as, thanks to, together with, on account of, on behalf of, or on top of.

In Hirschmüller’s data, 81 prepositions consist of two words and 154 of three and more, out of a total of 235 complex prepositions. He observes:

a higher incidence of complex prepositions in the Kolhapur Corpus than in the other two corpora;
that the most complex prepositions (3+ words) are over-represented in the corpus of Indian English.

Following Leitner (1991), we replicate Hirschmüller’s study based on a two-fold assumption:

complex prepositions are likely to be over-represented in the Kolhapur corpus;
within the corpus, complex prepositions are likely to be over-represented in the more formal text categories.

Details on how the data were extracted can be found in this blog post.

4.2.1 Loading and inspecting the data

The data file is stored as a data frame named prepositions in the workshop’s R environment.

str(prepositions)

## 'data.frame':    257 obs. of  19 variables:
##  $ brown1                   : int  16 1470 207 14 106 246 0 925 523 2 ...
##  $ kolh                     : int  8 1249 179 11 148 111 1 873 560 0 ...
##  $ lob                      : int  12 1492 199 16 87 234 0 770 507 0 ...
##  $ adventure_western_fiction: int  3 258 34 0 0 92 0 137 114 1 ...
##  $ belles_lettres           : int  11 627 67 7 45 64 0 382 271 0 ...
##  $ general_fiction          : int  3 465 38 4 4 72 0 235 97 0 ...
##  $ humour                   : int  1 99 3 0 4 9 0 59 24 0 ...
##  $ learned_scientific       : int  1 454 168 8 86 53 0 339 177 0 ...
##  $ miscellaneous            : int  0 204 48 7 19 11 0 136 64 0 ...
##  $ mystery_detective_fiction: int  2 299 18 3 4 75 1 153 74 1 ...
##  $ popular_lore             : int  7 348 50 5 43 59 0 250 146 0 ...
##  $ press_editorial          : int  0 188 21 0 17 15 0 116 136 0 ...
##  $ press_reportage          : int  3 336 27 2 50 30 0 296 227 0 ...
##  $ press_reviews            : int  0 142 14 0 7 4 0 75 49 0 ...
##  $ religion                 : int  0 120 16 1 27 9 0 65 53 0 ...
##  $ romance_love_story       : int  3 350 28 3 3 48 0 145 78 0 ...
##  $ science_fiction          : int  1 48 6 1 3 4 0 14 7 0 ...
##  $ skills_trades_hobbies    : int  1 273 47 0 29 46 0 166 73 0 ...
##  $ prep.length              : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 2 1 1 1 1 1 ...

head(prepositions[,c(1:4, 18:19)])

##              brown1 kolh  lob adventure_western_fiction skills_trades_hobbies
## aboard           16    8   12                         3                     1
## about          1470 1249 1492                       258                   273
## above           207  179  199                        34                    47
## absent           14   11   16                         0                     0
## according to    106  148   87                         0                    29
## across          246  111  234                        92                    46
##              prep.length
## aboard                 1
## about                  1
## above                  1
## absent                 1
## according to           2
## across                 1

4.2.2 Running CA with `FactoMineR`

We use the CA() function from the FactoMineR package.¹ We need to load the package first.

library(FactoMineR)

At this point:

we have to declare the first three columns as active;
columns 4 to 18 are quantitative and declared as supplementary (col.sup=4:18). These 15 columns correspond to the 15 text categories;
column 19, which corresponds to the complexity of the preposition, is qualitative and therefore supplementary (quali.sup=19).

ca.object <- CA(prepositions, col.sup=4:18, quali.sup=19, graph=FALSE)

By default, the CA() function produces a graph based on the first two dimensions. For the time being, these plots are not generated yet (graph=FALSE). Each graph will be plotted individually later, with specific parameters.

The output of CA is in ca.object.

ca.object

## **Results of the Correspondence Analysis (CA)**
## The row variable has  257  categories; the column variable has 3 categories
## The chi square of independence between the two variables is equal to 10053.43 (p-value =  0 ).
## *The results are available in the following objects:
## 
##    name               description                                
## 1  "$eig"             "eigenvalues"                              
## 2  "$col"             "results for the columns"                  
## 3  "$col$coord"       "coord. for the columns"                   
## 4  "$col$cos2"        "cos2 for the columns"                     
## 5  "$col$contrib"     "contributions of the columns"             
## 6  "$row"             "results for the rows"                     
## 7  "$row$coord"       "coord. for the rows"                      
## 8  "$row$cos2"        "cos2 for the rows"                        
## 9  "$row$contrib"     "contributions of the rows"                
## 10 "$col.sup$coord"   "coord. for supplementary columns"         
## 11 "$col.sup$cos2"    "cos2 for supplementary columns"           
## 12 "$quali.sup$coord" "coord. for supplementary categorical var."
## 13 "$quali.sup$cos2"  "cos2 for supplementary categorical var."  
## 14 "$call"            "summary called parameters"                
## 15 "$call$marge.col"  "weights of the columns"                   
## 16 "$call$marge.row"  "weights of the rows"

The first lines of the ouput give the $\chi^2$ score and the associated $p$-value. The $\chi^2$ score is very high (10,053.43) and it is associated with the smallest possible $p$-value (0). The deviation of the table from independence is beyond doubt.

4.2.3 How many dimensions should we inspect?

The eig object allows to see how many dimensions there are to inspect.

ca.object$eig

##        eigenvalue percentage of variance cumulative percentage of variance
## dim 1 0.020398336               82.34156                          82.34156
## dim 2 0.004374495               17.65844                         100.00000

Because the input table is simple and because the number of active variables is low, there are only two dimensions to inspect. Indeed, the first two dimensions represent 100% of the variance of the table. In most other studies, however, we should expect to inspect more than two dimensions. Our decision is based on the cumulative percentage of variance. The inertia (i.e. the sum of eigenvalues) is low (0.0248). This means that there is not much variance in the table and that the tendencies that we are about to observe are subtle.

In case there are more than two dimensions to inspect, a scree plot such as Fig. 4.2 is useful.

barplot(ca.object$eig[,2], names=paste("dimension", 1:nrow(ca.object$eig)), 
        xlab="dimensions", 
        ylab="percentage of variance")

Figure 4.2: A scree plot

4.2.4 The CA graph

The CA graph is plotted with the plot.CA() function.

The rows are made invisible to avoid cluttering the graph with prepositions (invisible="ind"). The prepositions can be plotted together with the column variables by removing invisible="ind". To prevent the labels from being overplotted, autoLab is set to "yes". By setting shadowtext to TRUE, a background shadow facilitates reading. The font size of the labels is adjusted to 80% of their default size (cex=0.8). The active column variables are in magenta (col.col="magenta") whereas the supplementary column variables are in Dodger blue (col.col.sup= "dodgerblue"). Finally, a title is included (title=). Its font size is 80% of the default size (cex.main=.8).

plot.CA(ca.object,
        invisible="row",
        autoLab="yes",
        shadow=TRUE,
        cex=.8,
        col.col="magenta",
        col.col.sup="dodgerblue",
        title="Distribution of prepositions based on lexical complexity
        in three corpora:\n LOB (British English), Brown (US English),
        and Kolhapur (Indian English)",
        cex.main=.8)

## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Figure 4.3: A CA biplot

Hirschmüller observed the following:

complex prepositions cluster in non-fictional texts, a preference that is amplified in the Kolhapur Corpus;
learned and bureaucratic writing shows a more pronounced pattern in the Kolhapur Corpus than in the British and American corpora.

The CA plot reflects these tendencies (Fig. 4.3).

The first dimension (along the horizontal axis) accounts for 82.34% of the variance. It shows a clear divide between Brown and LOB (left) and Kolhapur (right). Large complex prepositions (three words and more: prep.length.3 and prep.length.4) are far more likely to occur in Indian English than in British or US English. No such preference is observed for one-word and two-word prepositions (prep.length.1 and prep.length.2). Very formal text categories cluster to the right, along with the Kolhapur corpus: learned\_scientific, press\_reviews, and religion, miscellaneous (governmental documents, foundation reports, industry reports, college catalogue, industry in-house publications).

The second dimension (along the vertical axis) accounts for 17.66% of the variance. It distinguishes the LOB corpus (upper part of the plot) from the Brown corpus (lower part). All in all, complex prepositions are specific to the Kolhapur Corpus, especially in formal contexts.

4.3 Practice

Leech (2002) compares the distribution of modals and semi-modals across four corpora: Brown, LOB, Frown and FLOB. The results are given in Tab. 2.3, which is repeated below.

	LOB	FLOB	BROWN	FROWN	BrEng	AmEng	X1961	X1991	type
would	3028	2694	3053	2868	5722	5921	6081	5562	core_modal
will	2798	2723	2702	2042	5521	4744	5500	4765	core_modal
can	1997	2041	2193	2160	4038	4353	4190	4201	core_modal
could	1740	1782	1776	1655	3522	3431	3516	3437	core_modal
may	1333	1101	1298	878	2434	2176	2631	1979	core_modal
should	1301	1147	910	787	2448	1697	2211	1934	core_modal
must	1147	814	1018	668	1961	1686	2165	1482	core_modal
might	777	660	635	635	1437	1270	1412	1295	core_modal
shall	355	200	267	150	555	417	622	350	core_modal
ought	104	58	70	49	162	119	174	107	core_modal
need	78	44	40	35	122	75	118	79	core_modal
be_going_to	248	245	219	332	493	551	467	577	semi_modal
be_to	454	376	349	209	830	558	803	585	semi_modal
had_better	50	37	41	34	87	75	91	71	semi_modal
have_got_to	41	27	45	52	68	97	86	79	semi_modal
have_to	757	825	627	643	1582	1270	1384	1468	semi_modal
need_to	54	198	69	154	252	223	123	352	semi_modal
be_supposed_to	22	47	48	51	69	99	70	98	semi_modal
used_to	86	97	51	74	183	125	137	171	semi_modal
want_to	357	2275	1772	2101	2632	3873	2129	4376	semi_modal

All four corpora (LOB, FLOB, Brown, and Frown) belong to the `Brown Family’ of corpora. Here is why:

The Brown Family of corpora, which provides most of the evidence for this study, is named after the Brown Corpus (compiled at Brown University in the 1960s), which was the first of this group of corpora to be created, and was indeed the first electronic corpus of the English language. The Brown Corpus contains just over a million word tokens of published American English (AmE) textual material, and was joined in the 1970s by a matching corpus of British English (BrE), known as the LOB [Lancaster-Oslo/Bergen] Corpus. Both corpora consist of text samples first published in the year 1961. Later, in the 1990s, two new corpora were compiled at the University of Freiburg, matching Brown and LOB in all manageable details, and were therefore called the Freiburg-Brown [Frown] Corpus and the Freiburg-LOB [FLOB] Corpus. These were compiled according to precisely the same corpus design as Brown and LOB, except that they were sampled from texts first published in 1991 and 1992-3 respectively. – (Leech 2013)

These four corpora are comparable because they are equivalently subdivided into 15 genre categories. Furthermore, within each genre, the make-up of the sample is matched in sub-corpus groupings down to one-to-one equivalence of individual text samples of 2000 words.

What trends do you observe with respect to the distribution of modals and semi-modals between 1961 and 1991 in BrE and AmE?

The data set is preloaded in your R environment under the following name: leech.

5 MCA

5.1 Principles

Because MCA is an extension of CA, its inner workings are very similar. For this reason, they are not repeated here.

As pointed out above, MCA takes as input a table of nominal data. For this method to yield manageable results, it is best if the table is of reasonable size (not too many columns), and if each variable does not break down into too many categories. Otherwise, the contribution of each dimension to $\phi^2$ is small, and a large number of dimensions must be inspected.

There are no hard and fast rules for knowing when there are too many dimensions to inspect. However, when the eigenvalue that corresponds to a dimension is low, we know that the dimension is of little interest (the chances are that the data points will be close to the intersection of the axes in the summary plot).

5.2 Case study

Schmid (2003) provides an analysis of sex differences in the 10M-word spoken section of the British National Corpus (BNC). Schmid shows that women use certain swear-words more than men, although swear-words which tend to have a perceived `strong’ effect are more frequent in male speech. Schmid’s study is based on two subcorpora, which are both sampled from the spoken section of the BNC. The subcorpora amount to 8,173,608 words. The contributions are not equally shared among men and women since for every 100 word spoken by women, 151 are spoken by men. With respect to swear-words, Schmid’s conclusion is that both men and women swear, but men tend to use stronger swear-words than women.

Schmid’s study is repeated here in order to explore the distribution of swear-words with respect to gender in the BNC-XML. The goal is to see if:

men swear more than women;
some swear-words are preferred by men or women;
the gender-distribution of swear-words is correlated with other variables: age and social class.

5.2.1 Loading and inspecting the data

The data file for this case study is swearwords.²

Unlike Schmid, and following Rayson, Leech, and Hodges (1997), the data are extracted from the demographic component of the BNC-XML, which consists of spontaneous interactive discourse.

The swear-words are: bloody, damn, fuck, fucked, fucker, fucking, gosh, and shit.

Two exploratory variables are included in addition to gender: age and social class.³

The data set contains 293,289 swear-words. These words are described by three categorical variables (nominal data):

gender (2 levels: male and female);
age (6 levels: Ag0, Ag1, Ag2, Ag3, Ag4, Ag5);
social class (4 levels: AB, C1, C2, DE);

Age breaks down into 6 groups:

Ag0: respondent age between 0 and 14;
Ag1: respondent age between 15 and 24;
Ag2: respondent age between 25 and 34;
Ag3: respondent age between 35 and 44;
Ag4: respondent age between 45 and 59;
Ag5: respondent age is 60+.

Social classes are divided into 4 groups:

AB: higher management: administrative or professional;
C1: lower management: supervisory or clerical;
C2: skilled manual;
DE: semi-skilled or unskilled.

As we inspect the structure of the data frame with str(), it is advisable to keep an eye on the number of levels for each variable and see if any can be kept to a minimum to guarantee that inertia will not drop.

str(swearwords)

## 'data.frame':    293289 obs. of  4 variables:
##  $ word     : Factor w/ 8 levels "bloody","damn",..: 2 2 7 7 7 2 7 2 7 7 ...
##  $ gender   : Factor w/ 2 levels "f","m": 2 2 2 2 2 2 2 2 2 2 ...
##  $ age      : Factor w/ 6 levels "Ag0","Ag1","Ag2",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ soc_class: Factor w/ 4 levels "AB","C1","C2",..: 1 1 1 1 1 1 1 1 1 1 ...

The variable word has eight levels, not all of which are evenly distributed.

table(swearwords$word)

## 
##  bloody    damn    fuck  fucked  fucker fucking    gosh    shit 
##  146203   32294    9219      11     467   23487   60678   20930

We can group fuck, fucking, fucked, and fucker into a single factor: f-words. With gsub(), we replace each word with the single tag f-words.

swearwords$word <- gsub("fuck|fucking|fucker|fucked", "f-words", swearwords$word, ignore.case=TRUE)
table(swearwords$word)

## 
##  bloody    damn f-words    gosh    shit 
##  146203   32294   33184   60678   20930

The number of levels has been reduced to five. We convert swearwords$word back to a factor.

swearwords$word <- as.factor(swearwords$word)

As in CA, we can declare some variables as active and some other variables as supplementary/illustrative in MCA. We declare the variables corresponding to swear-words and gender as active, and the variables age and social class as supplementary/illustrative.

5.2.2 Running MCA with `FactoMineR`

We run the MCA with the MCA() function. We declare age and soc\_class as supplementary (quali.sup=c(3,4)). We do not plot the graph yet (graph=FALSE).

mca.object <- MCA(swearwords, quali.sup=c(3,4), graph=FALSE)

Again, the eig object allows us to see how many dimensions there are to inspect.

 round(mca.object$eig, 2)

##       eigenvalue percentage of variance cumulative percentage of variance
## dim 1       0.56                  22.47                             22.47
## dim 2       0.50                  20.00                             42.47
## dim 3       0.50                  20.00                             62.47
## dim 4       0.50                  20.00                             82.47
## dim 5       0.44                  17.53                            100.00

The number of dimensions is rather large and the first two dimensions account for only 42.47% of $\phi^2$. To inspect a significant share of $\phi^2$, e.g. 80%, we would have to inspect at least 4 dimensions. This issue is common in MCA.

The eigenvalues can be vizualized by means of a scree plot (Fig. 5.1). It is obtained as follows.

barplot(mca.object$eig[,1], names.arg=paste("dim ", 1:nrow(mca.object$eig)), las=2)

Figure 5.1: A scree plot showing the eigenvalues associated with each dimension

Ideally, we would want to see a sharp decrease after the first few dimensions, and we would want these first few dimensions to account for as much share of $\phi^2$ as possible. Here, no sharp decrease is observed.

The MCA plot is made with the plot.MCA() function. Each category is the color of its variable (habillage="quali"). The title is removed (title="").

plot.MCA(mca.object,
           invisible="ind",
           autoLab="yes",
           shadowtext=TRUE,
           habillage="quali", 
           title="")

Figure 5.2: MCA biplot: a plane representation of individuals and categories

In the MCA biplot (Fig. 5.2), each category is the color of its variable. Let us focus first on the first dimension (the horizontal axis) and ignore the second dimension (the vertical axis). Strikingly, the most explicit swear-words (f-words) cluster in the rightmost part of the plot. These are used mostly by men. Female speakers tend to prefer a softer swear word: bloody.

Next, we focus on the second dimension and ignore the first. Words in the upper part (gosh and shit) are used primarily by upper-class speakers. F-words, bloody, and damn are used by lower social categories. Age groups are positioned close to the intersection of the axes. This is a sign that the first two dimensions bring little or no information about them.

Combining the two dimensions, the plot is divided into four corners in which we observe three distinct clusters:

cluster 1 (upper-right corner) gosh and shit, used by male and female upper class speakers;
cluster 2 (lower-left corner) bloody, used by female middle-class speakers;
cluster 3 (lower-right corner) f-words and damn, used by male lower-class speakers.

A divide exists between male (m, right) and female (f, left) speakers. However, as the combined eigenvalues indicate, we should be wary of making final conclusions based on the sole inspection of the first two dimensions.

The relevance of age groups becomes more relevant if dimensions 3 and 4 are inspected together (Fig. 5.3). To do so, the argument axes=c(3,4) is added in the plot.MCA() call.

plot.MCA(mca.object,
         axes=c(3,4),
         invisible="ind",
         autoLab="yes",
         shadowtext=TRUE,
         habillage="quali", 
         title="")

## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Figure 5.3: MCA biplot: a plane representation of individuals and categories (dimensions 3 and 4)

With respect to dimensions 3 and 4, the male/female distinction disappears (both variables overlap where the two axes intersect). A divide is observed between f-words and bloody (left), used mostly by younger and middle-aged speakers, and gosh and damn (right), used mostly by upper-class speakers from age groups 3 and 5. The most striking feature is the outstanding position of shit in the upper-left corner.

Although used preferably by male and female upper class speakers, it is also used, although to a lesser degree, by much younger speakers from lower social classes.

6 PCA

6.1 Principles

As in CA and MCA, the total variance of the table is decomposed into proportions in principal component analysis (henceforth PCA). There is one minor terminological difference: the dimensions are called principal components. For each component, the proportion of variance is obtained by dividing the squared standard deviation by the sum of the squared standard deviations.

As exemplified in this chapter, PCA is based on the inspection of correlations between the variables and the principal components.

Before one runs a PCA, one should consider standardizing (i.e. centering and scaling) the variables. If a table contains measurements in different units, standardizing the variables is compulsory. If a table contains measurements in the same unit, standardizing the variables is optional. However, even in this case, failing to standardize means giving each variable a weight proportional to its variance. Standardizing the variables guarantees that equal weights are attributed to the variables (Husson, Lê, and Pagès 2010).

A second kind of PCA is based on loadings Baayen (2008Sect. 5.1.1). Loadings are correlations between the original variables and the unit-scaled principal components. The two kinds of PCA are similar: both are meant to normalize the coordinates of the data points. The variant exemplified in this chapter is more flexible because it allows for the introduction of supplementary variables.

6.2 Case study

Gréa (2017) compares five prepositions that denote inclusion in French:

parmi ‘among,’
au centre de ‘at the center of,’
au milieu de ‘in the middle of,’
au coeur de ‘at the heart of,’ and
au sein de ‘within’/‘in’/‘among.’

To determine the semantic profile of each preposition, Gréa examines their preferred and dispreferred nominal collocates. He uses an association measure known as calcul des spécificités which is based on the hypergeometric distribution. A positive value indicates that the word is over-represented in the construction. The higher the value, the more the word is over-represented. A negative value indicates that the word is under-represented in the construction. The smaller the value, the more the word is under-represented.

To compare the semantic profiles of the prepositions, the preferred and dispreferred nominal collocates of the prepositions are examined in the FrWaC corpus. The goal is to summarize the table graphically instead of interpreting the data table directly.

6.2.1 Loading and inspecting the data

The data file for this case study is inclusion.

As we inspect the data frame with str(), we see that 22,397 NPs were found. The rows contain the nominal collocates and the columns the prepositions. The cells contain the association scores. The assumption is that the semantic profiles of the prepositions will emerge from the patterns of attraction/repulsion.

str(inclusion)

## 'data.frame':    22397 obs. of  5 variables:
##  $ centre: num  -281 -274 -219 -128 -114 ...
##  $ coeur : num  -651 -913 -933 -368 -330 ...
##  $ milieu: num  -545 -432 -270 -237 -173 ...
##  $ parmi : num  -1685 -1129 -1072 -678 -499 ...
##  $ sein  : num  4226 3830 3490 1913 1522 ...

head(inclusion)

##                centre   coeur  milieu    parmi    sein
## entreprise    -281.14 -651.39 -545.23 -1685.26 4226.02
## équipe        -274.36 -913.40 -432.42 -1129.20 3829.95
## groupe        -218.55 -932.95 -269.59 -1071.62 3490.06
## établissement -128.13 -367.62 -236.78  -677.87 1913.47
## association   -114.08 -330.25 -172.95  -498.83 1522.44
## service       -103.45 -279.61 -181.29  -495.38 1443.34

As in CA and MCA, we can declare some variables as active and some other variables as supplementary/illustrative in PCA. Here, however, we decide to declare all variables as active.

6.2.2 Running PCA with `FactoMineR`

We load the FactoMineR package and run the PCA with the PCA() function.⁴ The table contains measurements in the same unit. Standardizing them avoids giving each variable a weight proportional to its variance. Perhaps some prepositions attract most nouns more than others. The variables are standardized by default.

library(FactoMineR)
pca.object <- PCA(inclusion, graph=F)

We make sure that the first two components are representative.⁵

round(pca.object$eig, 2)

##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1       2.02                  40.32                             40.32
## comp 2       1.37                  27.42                             67.74
## comp 3       1.04                  20.79                             88.52
## comp 4       0.51                  10.14                             98.67
## comp 5       0.07                   1.33                            100.00

These eigenvalues are plotted in Fig. 6.1.

barplot(pca.object$eig[,1], names.arg=paste("comp ", 1:nrow(pca.object$eig)), las=2)

Figure 6.1: A scree plot showing the eigenvalues associated with each component

In PCA, the variables and the individuals and categories are plotted separately. The graph of variables serves as a guide to interpret the graph of individuals and categories.

In the graph of variables, each variable is represented as an arrow. The circle is known as the circle of correlations. The closer the end of an arrow is to the circle (and the farther it is from where the axes intersect at the center of the graph), the better the corresponding variable is captured by the two components, and the more important the components are with respect to this variable.

We plot the graph of variables (Fig. 6.2) and the graph of individuals (Fig. 6.3).

 # graph of variables
 plot.PCA(pca.object, choix="var", title="")
 # graph of individuals
 plot.PCA(pca.object, cex=0.8, autoLab="auto", shadowtext = FALSE, title="")

Figure 6.2: PCA: graph of variables

Three main profiles stand out:

au sein de (upper left corner);
au centre de and au coeur de (upper right corner);
au milieu de and parmi (lower right corner).

The affinities between au centre de and au coeur de on the one hand and au milieu de and parmi on the other are due to similar collocational behaviors. Au sein de is the odd one out.

Figure 6.3: PCA: graph of variables

Admittedly, the graph of individuals is cluttered. This is due to the very large number of NP types that cooccur with the prepositions. Most NPs clutter around where the two axes intersect, a sign that their distribution is of little interest, at least with respect to our understanding of the prepositions. More interesting are those NPs that appear in the margins of the plot.

We can filter out unwanted individuals by selecting only the desired ones. The select argument of the PCA() function allows the user to filter out unwanted individuals by selecting only the desired ones.

Here is what the title of each plot means:

with select="coord 20", only the labels of the twenty individuals that have the most extreme coordinates on the chosen dimensions are plotted;
with select="contrib 20", only the labels of the twenty individuals that have the highest contributions on the chosen dimensions are plotted;⁶
with select="cos2 5", only the labels of the five individuals that have the highest squared-cosine scores on the chosen dimensions are plotted;⁷
with select="dist 20", only the labels of the twenty individuals that are the farthest from the center of gravity of the cloud of data points are plotted.

plot.PCA(pca.object, select="coord 20", title="select=\"coord 20\"")

## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

plot.PCA(pca.object, select="contrib 20", title="select=\"contrib 20\"")

## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

plot.PCA(pca.object, select="cos2 5", title="select=\"cos2 5\"")

## Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

plot.PCA(pca.object, select="dist 20", title="select=\"dist 20\"")

## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Clearer trends emerge:

the au sein de construction tends to co-occur with collective NPs that denote groups of human beings (entreprise ‘company/business,’ équipe ‘team,’ établissement ‘institution/institute,’ etc.);
the au centre de and au coeur de constructions tend to co-occur with NPs that denote urban areas (ville ‘city/town,’ village ‘village,’ quartier ‘district’) and thoughts or ideas (préoccupations ‘concerns/issues,’ débat ‘debate/discussion/issue’);
the au milieu de and ‘parmi’ constructions tend to co-occur with plural NPs that denote sets of discrete individuals (hommes ‘men,’ personnes ‘persons,’ membres ‘members’), among other things.

The graph displaying the first two components does a good job at grouping prepositions based on the nominal collocates that they have in common and revealing consistent semantic trends.

However, it does not show what distinguishes each preposition. For example, au centre du conflit ‘at the center of the conflict’ profiles a participant that is either the instigator of the conflict or what is at stake in the conflict. In constrast, au coeur du conflit ‘at the heart of the conflict’ denotes the peak of the conflict, either spatially or temporally.

This issue has nothing to do with the PCA. It has to do with the kind of collocational approach exemplified in the paper, which does not aim to (and is not geared to) reveal fine-grained semantic differences by itself.

7 EFA

7.1 Principles

Exploratory factor analysis (henceforth EFA) was made popular in linguistics by Biber’s studies on register variation as part of the multidimensional (MD) approach (Biber 1988, 1995). The goal of the MD approach is to detect register differences across the texts and text varieties of a corpus based on groups of linguistic features that co-occur significantly. Technically, this approach starts with a large number of linguistic variables and relies on factor analysis to reduce this number to a few basic functional dimensions that account for differences between texts.

Although close to PCA, EFA differs with respect to the following. The number of relevant components, which are called factors, is not determined automatically. It must be chosen beforehand.

EFA is designed to identify patterns of joint variation in a number of observed variables. It looks for variables that are highly correlated with a group of other variables. These intercorrelated variables are assumed to measure one underlying variable. This variable, which is not directly observed, but inferred, is latent.

It is known as a factor. This is an aspect that PCA is not designed to show. One added value of EFA is that ‘’an error term is added to the model in order to do justice to the possibility that there is noise in the data’’ Baayen (2008, 127).

7.2 Case study

7.2.1 The data

The same data set used for PCA serves as input for EFA, namely inclusion.

7.2.2 Running EFA with `factanal()`

EFA is performed with factanal(). According to Figs. 6.2 and 6.3 we could say that three principal components are worth investigating, we are tempted to specify 3 factors. Unfortunately, this is not going to work because 3 factors are too many for 5 variables in the kind of EFA that factanal()performs.⁸ Therefore, we set the number of required factors to 2.

fa.object <- factanal(inclusion, factors=2)
fa.object

## 
## Call:
## factanal(x = inclusion, factors = 2)
## 
## Uniquenesses:
## centre  coeur milieu  parmi   sein 
##  0.655  0.436  0.849  0.005  0.005 
## 
## Loadings:
##        Factor1 Factor2
## centre  0.587         
## coeur   0.750         
## milieu  0.389         
## parmi  -0.147   0.987 
## sein   -0.740  -0.669 
## 
##                Factor1 Factor2
## SS loadings      1.626   1.424
## Proportion Var   0.325   0.285
## Cumulative Var   0.325   0.610
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 12667.73 on 1 degree of freedom.
## The p-value is 0

A $\chi^2$ test reports whether the specified number of factors is sufficient. If the $p$-value is smaller than 0.05, more factors are needed. If it is greater than $0.05$, no more factors are needed. The test reports that the $\chi^2$ statistic is 12,667.73 on 1 degree of freedom and that the $p$-value is 0. Although we feel a third factor is required (from our PCA), we have no choice but stick to 2 factors because factanal() believes it will do a pretty good job at providing a satisfactory factor model. We shall why below.

The output displays:

uniqueness (the percentage of variance for the variable that is not explained by the common factors),
loadings (the loadings that are too close to zero are not displayed),
the proportions of variance explained by the factors, and
the $\chi^2$ test.

Factor loadings are the weights and correlations between the variables and the factors. The higher the loading the more relevant the variable is in explaining the dimensionality of the factor. If the value is negative, it is because the variable has an inverse impact on the factor.

Au milieu de, au centre de, and au coeur de define the first factor. Parmi defines the second factor. It seems that au sein de defines both.

The proportions of variance explained by the factors (i.e. eigenvalues) are listed under the factor loadings. A factor is considered worth keeping if the corresponding SS loading (i.e. the sum of squared loadings) is greater than 1. Two factors are retained because both have eigenvalues over 1.

Factor 1 accounts for 32.5% of the variance. Factor 2 account for 28.5% of the variance. Both factors account for 66.9% of the variance.

7.2.3 Rotation

In EFA, rotation is a procedure meant to clarify the relationship between variables and factors. As its name indicates, it rotates the factors to align them better with the variables. The two most frequent rotation methods are varimax and promax.

With varimax, the factor axes are rotated in such a way that they are still perpendicular to each other. The factors are uncorrelated and the production of 1s and 0s in the factor matrix is maximized.

With promax, the factor axes are rotated in an oblique way. The factors are correlated. The resulting model provides a closer fit to the data than with varimax.

In either case, the goal is to arrive at a few common meaningful factors. Rotation is optional as it does not modify the relationship between the factors and the variables. Figure 7.1 is a plot of the loadings of the prepositions on the two factors with varimax rotation. Figure 7.2 is the same plot of the loadings with promax rotation.

By default, varimax rotation applies.

loadings <- loadings(fa.object)
plot(loadings, type="n", xlim=c(-1,1))
text(loadings, rownames(loadings))

Figure 7.1: PCA: graph of variables

To produce a plot with promax rotation, we runfactanal() again but set rotation to promax.

fa.object2 <- factanal(inclusion, factors=2, rotation="promax")
loadings2 <- loadings(fa.object2)
plot(loadings2, type="n", xlim=c(-1,1))
text(loadings2, rownames(loadings2))

Figure 7.2: PCA: graph of variables

The distinctive profiles we obtain with EFA are similar to those we obtained with PCA. The only major difference is the proximity of au milieu de with au centre de and au coeur de. This may be due to the fact that only two factors are retained in the analysis. As far as this data set is concerned, PCA is, I believe, clearly a better alternative. For a data set with a much larger number of variables, EFA works generally better.

8 MDS

8.1 Principles

MDS is very popular because it is relatively old, versatile, and easy to understand and implement. It is a multivariate data analysis approach that is used to visualize distances in multidimensional maps (in general: two-dimensional plots).

MDS comes in different flavors:

vanilla/classical MDS (metric MDS);
Kruskal’s non-metric multidimensional scaling;
Sammon’s non-linear mapping.

We focus on classical multidimensional scaling (MDS), which is also known as principal coordinates analysis (Gower 1966). MDS takes as input a matrix of dissimilarities and returns a set of points such that the distances between the points are approximately equal to the dissimilarities. A strong selling point of MDS is that, given the $n$ dimensions of a table, it returns an optimal solution to represent the data in a space whose dimensions are (much) lower than $n$.

8.2 Case study

The data is from Desagulier (2014). The data set was compiled to see how 23 English intensifiers cluster on the basis of their most associated adjectives. For each of the 23 adverbs, I first extracted all adjectival collocates from the Corpus of Contemporary American English (Davies 2008–2012), amounting to 432 adjective types and 316,159 co-occurrence tokens. Then, I conducted a collexeme analysis for each of the 23 degree modifiers. To reduce the data set to manageable proportions, the 35 most attracted adjectives were selected on the basis of their respective collostruction strengths, yielding a 23-by-432 contingency table containing the frequency of adverb-adjective pair types.

8.2.1 Loading and inspecting the data

The contingency table is preloaded in your R environment under the name intensifiers.

str(intensifiers)

## 'data.frame':    23 obs. of  432 variables:
##  $ absent           : int  0 0 0 0 0 40 43 0 0 0 ...
##  $ absurd           : int  0 0 42 0 0 0 0 0 0 0 ...
##  $ acceptable       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ accurate         : int  0 0 0 0 0 0 60 0 83 0 ...
##  $ acidic           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ afraid           : int  0 141 0 117 0 0 0 0 0 0 ...
##  $ ajar             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ akin             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ alien            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ alone            : int  0 0 0 0 0 74 0 0 0 0 ...
##  $ amazing          : int  0 0 107 0 0 0 0 0 0 0 ...
##  $ ambiguous        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ambitious        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ american         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ amusing          : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ anxious          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ apprehensive     : int  21 0 0 0 0 0 0 0 0 0 ...
##  $ appropriate      : int  0 0 0 0 0 0 121 0 0 0 ...
##  $ arbitrary        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ askew            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avant-garde      : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ awesome          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ awkward          : int  49 112 0 0 0 0 0 0 0 0 ...
##  $ bad              : int  0 0 0 0 18 0 0 0 0 0 ...
##  $ basic            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bearded          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ beautiful        : int  0 0 145 0 0 0 0 0 0 0 ...
##  $ benign           : int  0 0 0 0 0 0 20 0 0 0 ...
##  $ better           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ big              : int  0 0 0 0 58 0 0 0 0 0 ...
##  $ bigger           : int  44 152 0 0 0 0 0 0 0 0 ...
##  $ bitter           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bizarre          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bland            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ blank            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ blind            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ boring           : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ bright           : int  0 0 0 0 0 0 0 85 0 0 ...
##  $ brilliant        : int  0 0 57 0 0 0 0 0 0 0 ...
##  $ brown            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ busy             : int  0 0 0 0 17 0 0 0 0 2 ...
##  $ capable          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ careful          : int  0 0 0 0 0 0 0 111 0 0 ...
##  $ certain          : int  0 0 322 633 0 0 0 0 201 0 ...
##  $ charming         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cheerful         : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ chewy            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ chilly           : int  26 0 0 0 0 0 0 0 0 0 ...
##  $ clear            : int  0 0 204 0 0 77 216 0 139 0 ...
##  $ clever           : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ close            : int  0 0 0 0 17 0 0 0 82 2 ...
##  $ cold             : int  0 0 0 0 11 0 0 0 0 2 ...
##  $ comfortable      : int  0 0 0 0 0 94 80 0 0 0 ...
##  $ comical          : int  0 0 0 71 0 0 0 0 0 0 ...
##  $ common           : int  0 0 0 0 0 0 0 0 278 0 ...
##  $ compatible       : int  0 0 0 0 0 0 20 0 0 0 ...
##  $ competitive      : int  0 0 0 0 0 0 0 83 0 0 ...
##  $ complete         : int  0 0 0 238 0 0 0 0 0 0 ...
##  $ complex          : int  0 0 0 0 0 0 0 129 0 0 ...
##  $ complicated      : int  0 0 0 0 0 0 0 73 0 0 ...
##  $ concerned        : int  0 195 0 0 0 0 0 91 0 0 ...
##  $ confident        : int  0 0 57 0 0 0 0 0 78 0 ...
##  $ confusing        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ conservative     : int  0 0 0 0 0 0 0 0 65 0 ...
##  $ consistent       : int  0 0 0 0 0 0 95 0 130 0 ...
##  $ constant         : int  0 0 0 126 0 0 0 0 117 0 ...
##  $ content          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ contradictory    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ controversial    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ convenient       : int  0 0 0 0 6 0 0 0 0 0 ...
##  $ convinced        : int  0 0 203 0 0 0 41 0 0 0 ...
##  $ convincing       : int  0 0 0 0 0 0 22 0 0 0 ...
##  $ cool             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ correct          : int  0 0 339 0 0 0 33 0 0 0 ...
##  $ crazy            : int  0 289 75 0 0 62 0 0 0 0 ...
##  $ creaky           : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ critical         : int  0 0 171 0 0 0 0 0 0 0 ...
##  $ crockery-smashing: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ crooked          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ crucial          : int  0 0 84 0 0 0 0 0 0 0 ...
##  $ curved           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cute             : int  0 0 0 0 16 0 0 0 0 0 ...
##  $ daily            : int  0 0 0 358 0 0 0 0 0 0 ...
##  $ dainty           : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ damp             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ dangerous        : int  0 0 0 0 0 0 0 282 0 0 ...
##  $ dark             : int  0 0 0 0 0 122 0 0 0 0 ...
##  $ darker           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ decent           : int  0 0 0 0 0 0 0 0 70 0 ...
##  $ delicious        : int  0 0 60 0 0 0 0 0 0 0 ...
##  $ dependent        : int  0 0 0 0 0 100 90 0 0 0 ...
##  $ desirable        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ destructive      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ developed        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ devoid           : int  0 0 0 0 0 43 22 0 0 0 ...
##  $ devoted          : int  0 0 0 0 0 0 29 0 0 0 ...
##  $ different        : int  167 930 0 0 0 1844 1394 0 0 0 ...
##  $ difficult        : int  52 153 0 0 31 0 0 1138 0 4 ...
##  $ dirty            : int  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]

8.2.2 The dissimilarity matrix

The contingency table must be converted into a distance object. Technically, this distance object is a dissimilarity matrix. Because the matrix is symmetric, it is divided into two parts (two triangles) on either side of the diagonal of null distances between the same cities. Only one triangle is needed.

You obtain the dissimilarity matrix by converting the contingency table into a table of distances with a user-defined distance measure. When the variables are ratio-scaled, you can choose from several distance measures: Euclidean, City-Block/Manhattan, correlation, Pearson, Canberra, etc. I have noticed that the Canberra distance metric handles best the relatively large number of empty occurrences that we typically obtain in linguistic data (i.e. when we have a sparse matrix).

We use the dist() function:

the first argument is the data table;
the second argument is the distance metric (method="canberra");
the third argument (diag) lets you decide if you want R to print the diagonal of the distance object;
the fourth argument (upper) lets you decide if you want R to print the upper triangle of the distance object.

dist.object <- dist(intensifiers, method="canberra", diag=T, upper=T)

The distance object is quite large. To see a snapshot, enter the following.

dist.matrix <- as.matrix(dist.object)
dist.matrix[1:5, 1:5]

##               a_bit a_little absolutely   almost  awfully
## a_bit        0.0000 372.1788   432.0000 432.0000 418.8591
## a_little   372.1788   0.0000   429.4200 426.3215 422.8858
## absolutely 432.0000 429.4200     0.0000 421.9386 432.0000
## almost     432.0000 426.3215   421.9386   0.0000 432.0000
## awfully    418.8591 422.8858   432.0000 432.0000   0.0000

The diagonal of 0 values separates the upper and lower triangles, as expected from a distance matrix.

8.2.3 Running MDS with `cmdscale()`

The distance matrix serves as input to the base-R cmdscale() function, which performs a ‘vanilla’ version of MDS. We specify k=2, meaning that the maximum dimension of the space which the data are to be represented in is 2.

mds <- cmdscale(dist.matrix,eig=TRUE, k=2)
mds

## $points
##                  [,1]         [,2]
## a_bit       -93.23707 -159.5334193
## a_little    -66.06634 -110.6741767
## absolutely   74.93872   19.7177119
## almost      -10.48677    2.9320651
## awfully     -57.58458    0.9245004
## completely  220.59916  -36.6649541
## entirely    139.16294   20.9142554
## extremely   -59.98183   73.9680050
## fairly      -30.38867  144.6343031
## frightfully -19.37836  -12.4447681
## highly      -10.37726    9.4095014
## jolly       -18.25316   -9.2906600
## most        -32.58424   37.3181351
## perfectly    25.93391   62.2717594
## pretty      -47.59436  121.8525685
## quite       -39.16428  133.6428904
## rather      -75.83882   14.5469112
## slightly    -41.59012 -122.0587482
## somewhat    -74.01797 -130.4787887
## terribly    -55.30971  -54.0548218
## totally     219.14279  -46.3793237
## utterly      92.51536  -30.9168073
## very        -40.43937   70.3638610
## 
## $eig
##  [1] 1.752381e+05 1.476932e+05 1.251985e+05 1.089659e+05 1.030146e+05
##  [6] 9.859947e+04 9.523649e+04 9.326120e+04 9.084104e+04 8.840678e+04
## [11] 8.735805e+04 8.450954e+04 8.322147e+04 7.973600e+04 7.619331e+04
## [16] 7.468265e+04 7.339436e+04 7.062472e+04 6.141132e+04 5.960298e+04
## [21] 5.534600e+04 3.012413e+04 1.455192e-10
## 
## $x
## NULL
## 
## $ac
## [1] 0
## 
## $GOF
## [1] 0.1645376 0.1645376

The result is a matrix with 2 columns and 23 rows (fit$points). The function has done a good job at outputting the coordinates of intensifiers in the reduced two-dimensional space that we requested. Note that cmdscale() returns the best-fitting $k$-dimensional representation, where $k$ may be less than the argument $k$.

To plot the results, first we retrieve the coordinates for the two dimensions (x and y).

x <- mds$points[,1]
y <- mds$points[,2]

Second, we plot the two axes and add information about the intensifiers (Fig. 8.1).

plot(x, y, xlab="Dim.1", ylab="Dim.2", type="n")
text(x, y, labels = row.names(intensifiers), cex=.7)

Figure 8.1: MDS graph

The question we are addressing is whether these dimensions reflect differences in the semantics of the intensifiers. Existing typologies of intensifiers tend to group them as follows:

diminishers (slightly, a little, a bit, somewhat)
moderators (quite, rather, pretty, fairly)
boosters (most, very, extremely, highly, awfully, terribly, frightfully, jolly)
maximizers (completely, totally, perfectly, absolutely, entirely, utterly)

Maximizers and boosters stretch horizontally across the middle of the plot. Moderators are in the upper left corner, and diminishers in the lower left corner. Note the surprising position of almost.

8.2.4 Combining MDS and k-means clustering

We can improve the MDS plot in Fig. 8.1 by grouping and coloring the individuals by means of $k$-means clustering. $k$-means clustering partitions the data points into into $k$ classes, based on the nearest mean.

We download and load one extra package.

install.packages("ggpubr")
library(ggpubr)

## Le chargement a nécessité le package : ggplot2

We convert the coordinates obtained above into a data frame.

mds.df <- as.data.frame(mds$points) # convert the coordinates
colnames(mds.df) <- c("Dim.1", "Dim.2") # assign column names
mds.df # inspect

##                 Dim.1        Dim.2
## a_bit       -93.23707 -159.5334193
## a_little    -66.06634 -110.6741767
## absolutely   74.93872   19.7177119
## almost      -10.48677    2.9320651
## awfully     -57.58458    0.9245004
## completely  220.59916  -36.6649541
## entirely    139.16294   20.9142554
## extremely   -59.98183   73.9680050
## fairly      -30.38867  144.6343031
## frightfully -19.37836  -12.4447681
## highly      -10.37726    9.4095014
## jolly       -18.25316   -9.2906600
## most        -32.58424   37.3181351
## perfectly    25.93391   62.2717594
## pretty      -47.59436  121.8525685
## quite       -39.16428  133.6428904
## rather      -75.83882   14.5469112
## slightly    -41.59012 -122.0587482
## somewhat    -74.01797 -130.4787887
## terribly    -55.30971  -54.0548218
## totally     219.14279  -46.3793237
## utterly      92.51536  -30.9168073
## very        -40.43937   70.3638610

We proceed to $k$-means clustering on the data frame with the kmeans() function.

kmclusters <- kmeans(mds.df, 5) # k-means clustering with 5 groups
kmclusters <- as.factor(kmclusters$cluster) # convert to a factor
mds.df$groups <- kmclusters # join to the existing data frame
mds.df # inspect

##                 Dim.1        Dim.2 groups
## a_bit       -93.23707 -159.5334193      2
## a_little    -66.06634 -110.6741767      2
## absolutely   74.93872   19.7177119      5
## almost      -10.48677    2.9320651      1
## awfully     -57.58458    0.9245004      1
## completely  220.59916  -36.6649541      3
## entirely    139.16294   20.9142554      4
## extremely   -59.98183   73.9680050      1
## fairly      -30.38867  144.6343031      1
## frightfully -19.37836  -12.4447681      1
## highly      -10.37726    9.4095014      1
## jolly       -18.25316   -9.2906600      1
## most        -32.58424   37.3181351      1
## perfectly    25.93391   62.2717594      1
## pretty      -47.59436  121.8525685      1
## quite       -39.16428  133.6428904      1
## rather      -75.83882   14.5469112      1
## slightly    -41.59012 -122.0587482      2
## somewhat    -74.01797 -130.4787887      2
## terribly    -55.30971  -54.0548218      2
## totally     219.14279  -46.3793237      3
## utterly      92.51536  -30.9168073      5
## very        -40.43937   70.3638610      1

We are ready to launch the plot. Each group will be assigned a color (Fig. 8.1).

ggscatter(mds.df, x = "Dim.1", y = "Dim.2", 
          label = rownames(intensifiers),
          color = "groups",
          palette = "jco",
          size = 1, 
          ellipse = TRUE,
          ellipse.type = "convex",
          repel = TRUE)

Figure 8.2: MDS graph

The result is gratifying, although a bit puzzling. Let us see why.

8.3 Beyond MDS

The distance matrix can also serve as input for another multivariate exploratory method: hierarchical cluster analysis.

We use the hclust() function to apply an amalgamation rule that specifies how the elements in the matrix are clustered. We amalgamate the clusters with Ward’s method, which evaluates the distances between clusters using an analysis of variance. Ward’s method is the most widely used amalgamation rule because it has the advantage of generating clusters of moderate size. We specify method="ward.D".

clusters <- hclust(dist.object, method="ward.D")

We plot the dendrogram as follows (Fig. 8.3).

plot(clusters, sub="(Canberra, Ward)")

Figure 8.3: HCA dendrogram

Although based on the same distance matrix, the dendrogram clusters the intensifiers slightly differently.

9 Caveats

The standard graphic output of CA and MCA is a symmetric biplot in which both row variables and column variables are represented in the same space using their coordinates. In this case, only the distance between row points or the distance between column points can be interpreted accurately (Greenacre 2007, 2:72). Only general observations can be made about the distance between row points and column points, when these points appear in the same part of the plot with respect to the center of the cloud of points (Husson, p.c.).

Assessing the inter-distance between rows and columns accurately is possible in either an asymmetric plot or a scaled symmetric biplot (see this tutorial). In an asymmetric biplot, either the columns are represented in row space or the rows are represented in a column space. In a scaled symmetric biplot, neither the row metrics nor the column metrics are preserved. Rows and columns are scaled to have variances equal to the square roots of eigenvalues, which allows for direct comparison in the same plot.⁹

10 Exercises

10.1 PCA

The dataset for this exercise is `it.be.A.to.V.that.rds’, available here. It is based on a study of the productivity of the construction <it BE ADJ to V> that in the British National Corpus (Baby edition).

The construction is exemplified below:

It is important to note that these criteria are not intended to be an exhaustive definition of the public interest.* (BNC–FRN)
It is plausible to assume that case and thematic systems in these simple examples and also in more complicated cases are interrelated (…). (BNC–HJ1)

Each observation should be described by the following variables:

corpus file: the name of the file from the BNC Baby where the construction is observed;
info: the information tag about the text (e.g. W newsp brdsht nat: sports or W ac:polit law edu);
mode: wtext for written text or stext for spoken text;
type: the text genre (ACPROSE, NEWS, FICTION, etc.);
exact match: the construction in context;
ADJ: the adjectival constituent of the construction;
V_inf: the verbal constituent of the construction.

Run a PCA of it.be.A.to.V.that.rds. Consider the variables num.hapax and V supplementary. Decide how many components (or dimensions) should be inspected, plot the results, and interpret them.

10.2 CA

The dataset modals.BNC.rds is a contingency table that contains the frequencies of ten English modals across seventy text genres in the British National Corpus.¹⁰

Run a CA of modals.BNC.rds, which you will find here.

First, plot only the variables that correspond to written texts (the variable names starts with W_). What can you say about the respective distributions of may and might?

Second, plot only the variables that correspond to spoken texts (the variable names starts with S_). What profile stands out the most? What is the contextual specificity of will?

10.3 MCA

The data set for this exercise is typocorp.rds. You will find it here.

The dataset is a tentative typology of seventeen of the most widely used corpora in English:

The Bank of English (BoE)
The British National Corpus (BNC)
The Brown Corpus (BROWN)
The English component of the Child Language Data Exchange System (CHILDES)
The Corpus of Contemporary American English (COCA)
The Corpus of Historical American English (COHA)
The Bergen Corpus of London Teenage Language (COLT)
enTenTen12
The Freiburg-LOB Corpus of British English (FLOB)
The Freiburg-Brown corpus of American En- glish (FROWN)
The Corpus of Global Web-Based English (GloWbE)
Helsinki Corpus of English Texts (Helsinki)
the British component of the International Corpus of English (ICE-GB)
The Lampeter Corpus of Early Modern English Tracts (Lampeter)
The Lancaster-Oslo/Bergen Corpus (LOB)
The London-Lund Corpus of Spoken English (LLC)
the Santa Barbara Corpus of Spoken American English (SBCAE)

Each corpus was described with respect to seven variables:

the variety of English represented in the corpus (British, American, or international),
whether the corpus is general or specific,
whether the corpus is static or dynamic,
whether the corpus is synchronic or diachronic,
the format in which the data is stored (as text, as audio files, as video files, or as a combination of several formats),
the mode (spoken, written, or both),
the size of the corpus (from very small to largest).

Your goal is to run a MCA using the MCA() function from the FactoMineR package. The variables corpus, stored_data_format, and size should be considered supplementary. Plot the graph of individuals.

10.4 HCA

Using modals.BNC.rds again, run a hierarchical cluster analysis of the ten English modals and plot the results in the form of a dendrogram. The hclust() function from the stats package in base R should do the trick.

Compare the following distance measures:¹¹

Euclidean,
Manhattan,
Minkowski,
Canberra.

Use Ward’s method to amalgamate the clusters.

10.5 MDS

Based on the HCA that you ran above, run an MSD analysis of the same dataset.

11 Further reading

11.1 CLSR

Corpus Linguistics and Statistics with R

See Chap. 10.

11.2 Book chapter

Desagulier (to appear): “Multivariate Exploratory Approaches” Book Section. In Practical Handbook of Corpus Linguistics, edited by Magali Paquot and Stefan Th. Gries. New York: Springer.

References

Baayen, R Harald. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics Using r. Cambridge University Press.

Benzécri, Jean-Paul. 1984. Analyse Des Correspondances: Exposé élémentaire. Book. Vol. 1. Pratique de l’Analyse Des Données. Paris: Dunod.

Biber, Douglas. 1988. Variation Across Speech and Writing. Cambridge University Press.

———. 1995. Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press.

de Leeuw, Jan, and Patrick Mair. 2009. “Simple and Canonical Correspondence Analysis Using the R Package Anacor.” Journal of Statistical Software 31 (5): 1–18.

Desagulier, Guillaume. to appear. “Multivariate Exploratory Approaches.” Book Section. In Practical Handbook of Corpus Linguistics, edited by Magali Paquot and Stefan Th. Gries. New York: Springer.

———. 2014. “Visualizing Distances in a Set of Near Synonyms: Rather, Quite, Fairly, and Pretty.” Book Section. In Corpus Methods for Semantics: Quantitative Studies in Polysemy and Synonymy, edited by Dylan Glynn and Justyna Robinson, 145–78. Amsterdam: John Benjamins.

———. 2015. “Forms and Meanings of Intensification: A Multifactorial Comparison of Quite and Rather.” Anglophonia 20 (2). https://doi.org/10.4000/anglophonia.558.

Francis, W. Nelson, and Henry Kučera. 1964. A Standard Corpus of Present-Day Edited American English, for Use with Digital Computers (Brown). Providence, Rhode Island: Brown University.

Glynn, Dylan. 2014. “The Many Uses of Run.” In Corpus Methods for Semantics: Quantitative Studies in Polysemy and Synonymy, edited by Dylan Glynn and Justyna A. Robinson, 43:117–44. Human Cognitive Processing. John Benjamins.

Gower, John C. 1966. “Some Distance Properties of Latent Root and Vector Methods Used in Multivariate Analysis.” Biometrika 53 (3-4): 325–38.

Greenacre, Michael J. 2007. Correspondence Analysis in Practice. Book. Vol. 2. Interdisciplinary Statistics Series. Boca Raton: Chapman & Hall/CRC.

Gréa, Philippe. 2017. “Inside in French.” Cognitive Linguistics 28 (1): 77–130.

Gries, Stefan Th. 2006. “Corpus-Based Methods and Cognitive Semantics: The Many Senses of to Run.” In Corpora in Cognitive Linguistics: Corpus-Based Approaches to Syntax and Lexis, edited by Stefan Th Gries and Anatol Stefanowitsch, 57–99. Mouton de Gruyter.

Hirschfeld, Hermann O. 1935. “A Connection Between Correlation and Contingency.” In Mathematical Proceedings of the Cambridge Philosophical Society, 31:520–24. 4. Cambridge University Press.

Hirschmüller, Helmut. 1989. “The Use of Complex Prepositions in Indian English in Comparison with British and American English.” In Englische Textlinguistik Und Varietätenforschung, edited by Gottfried Graustein and Wolfgang Thiele, 69:52–58. Linguistische Arbeitsberichte. Leipzig: Karl Marx Universität.

Husson, François, Sébastien Lê, and Jérôme Pagès. 2010. Exploratory Multivariate Analysis by Example Using r. London: CRC press.

Kassambara, Alboukadel, and Fabian Mundt. 2017. Factoextra: Extract and Visualize the Results of Multivariate Data Analyses.

Lacheret-Dujour, Anne, Guillaume Desagulier, Serge Fleury, and Frédéric Isel. 2019. “The Distribution of Prosodic Features in the Rhapsodie Corpus.” In Rhapsodie: A Prosodic and Syntactic Treebank for Spoken French, edited by Anne Lacheret-Dujour and Sylvain Kahane, 315–38. Studies in Corpus Linguistics 89. John Benjamins.

Leech, Geoffrey. 2002. “Recent Grammatical Change in English: Data, Description, Theory.” In Advances in Corpus Linguistics, 61–81. Brill Rodopi.

———. 2013. “Where Have All the Modals Gone? An Essay on the Declining Frequency of Core Modal Auxiliaries in Recent Standard English.” English Modality: Core, Periphery and Evidentiality, 95–115.

Leech, Geoffrey, Stieg Johansson, Roger Garside, and Knut Hofland. 1986. The LOB Corpus, POS-Tagged Version (1981–1986). Lancaster, Oslo, Bergen.

Leech, Geoffrey, Stieg Johansson, and Knut Hofland. 1978. The LOB Corpus, Original Version (1970–1978). Lancaster, Oslo, Bergen.

Leitner, Gerhard. 1991. “The Kolhapur Corpus of Indian English: Intra-Varietal Description and/or Intervarietal Comparison.” In English Computer Corpora, edited by Stieg Johansson and Anna-Brita Stenström, 215–32. Topics in English Linguistics. Berlin: Mouton de Gruyter.

Nenadic, Oleg, and Michael J. Greenacre. 2007. “Correspondence Analysis in r, with Two- and Three-Dimensional Graphics: The ca Package.” Journal of Statistical Software 20 (3): 1–13.

Rayson, Paul, Geoffrey N Leech, and Mary Hodges. 1997. “Social Differentiation in the Use of English Vocabulary: Some Analyses of the Conversational Component of the British National Corpus.” International Journal of Corpus Linguistics 2 (1): 133–52.

Schmid, Hans Jörg. 2003. “Do Men and Women Really Live in Different Cultures? Evidence from the BNC.” In Corpus Linguistics by the Lune, edited by Andrew Wilson, Paul Rayson, and Tony McEnery, 185–221. Lódź Studies in Language. Frankfurt: Peter Lang.

Shastri, S. V., C. T. Patilkulkarni, and Geeta S. Shastri. 1986. The Kolhapur Corpus. Kolhapur, India.

On top of FactoMineR, several packages contain a dedicated CA function, e.g. ca (Nenadic and Greenacre 2007), and anacor (de Leeuw and Mair 2009).↩︎
The code for the extraction was partly contributed by Mathilde Léger, a third-year student at Paris 8 University, as part of her end-of-term project.↩︎
http://www.natcorp.ox.ac.uk/docs/catRef.xml ↩︎
Several packages and functions implement PCA in R : e.g. princomp() and prcomp() from the stats package, ggbiplot() from the ggbiplot package (which is itself based on ggplot2), dudi.pca() from the ade4 package, and PCA() from the FactoMineR package. Mind you, princomp() and prcomp() perform PCA based on loadings.↩︎
For this kind of analysis, the first two components should represent a cumulative percentage of variance that is far above 50%. The more dimensions there are in the input data table, the harder it will be to reach this percentage.↩︎
The contribution is a measure of how much an individual contributes to the construction of a component.↩︎
The squared cosine ($cos^2$) is a measure of how well an individual is projected onto a component.↩︎
How many factors are considered worth keeping involves a choice based a metric known as SS loadings, as explained below.↩︎
This possibility is not offered in FactoMineR. It is offered in the factoextra (Kassambara and Mundt 2017) and ca (Nenadic and Greenacre 2007) packages.↩︎
See David Lee’s Genre Classification Scheme.↩︎
Enter ?dist for more information about the distance metrics.↩︎

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Multivariate exploratory techniques for linguists with

Guillaume Desagulier

M2 TAL – Université Paris Nanterre – 2021-2022