Descriptive statistics summarize information. In this lesson, we review two kinds of descriptive statistics:

measures of central tendency, and
measures of dispersion.

Measures of central tendency are meant to summarize the profile of a variable. Although widespread, these statistics are often misused. I provide guidelines for using them. Measures of dispersion are complementary: they are meant to assess how good a given measure of central tendency is at summarizing the variable.

1 Variables

A variable is a property that varies from one individual to another. An individual may be just anything, such as a person (a speaker, an informant) or a linguistic unit or phenomenon (modal auxiliaries, transitivity, etc.). Here is an open-ended list of variables for linguistic phenomena:

the number of modal auxiliaries per sentence, 2. the number of syllables per word,
vowel lengths in milliseconds,
pitch frequencies in hertz,
ranked acceptability judgments,
the text types represented in a corpus,
the verb types that occur in a construction.

The first five of these variables provide numerical information and are known as quantitative variables. The last two variables provide non-numerical information and are known as qualitative or categorical variables.

Quantitative variables break down into:

discrete variables, and
continuous variables.

Typically, discrete quantitative variables involve counts (integers). Such is the case of the number of modal auxiliaries per sentence and the number of syllables per word.

Continuous quantitative variables involve a measurement of some kind within an interval of numbers with decimals. Such is the case of vowel lengths in milliseconds and pitch frequencies in hertz.

There is a special type of quantitative variables: ordinal variables. Ordinal variables are numerical and take the form of rankings. An example of ordinal variable is the ranking of agentivity criteria: “1” high agentivity, “2” mild agentivity, “3” low agentivity. Acceptability judgments in psycholinguistic experiments may also be ordinal variables. The specificity of ordinal variables is that you cannot do arithmetic with them because the difference between each level is not quantitatively relevant.¹

2 Summary graphs

The simplest statistics in corpus linguistics and NLP are based on frequency counts. This section reviews the most common plots to summarize frequency counts.

2.1 Line plots and scatterplots

2.1.1 `plot()` in base-R

plot() is one of the most widely used in R. It comes with many options, which can be explored by entering ?plot.

We want to plot a frequency list made from the BNC Baby corpus (freqlist.bnc.baby.txt). After loading the file, we plot the frequencies with plot().

rm(list=ls(all=TRUE)) # clear R's memory

freqlist <- read.table("https://tinyurl.com/freqlistbncbaby", header=TRUE) # load the freqlist

str(freqlist) # inspect

## 'data.frame':    76828 obs. of  2 variables:
##  $ WORD     : chr  "said" "know" "got" "get" ...
##  $ FREQUENCY: int  12704 10202 8825 6756 6427 5961 5827 5551 5448 4285 ...

head(freqlist) # inspect

##    WORD FREQUENCY
## 1  said     12704
## 2  know     10202
## 3   got      8825
## 4   get      6756
## 5    go      6427
## 6 think      5961

# plot
plot(freqlist$FREQUENCY, 
     xlab="index of word types", 
     ylab="frequency", 
     main="plot of a frequency list", 
     cex=0.6)

Here, the xlab and ylab arguments specify the names of the horizontal and vertical axes respectively. The cex argument specifies the size of the circles that signal data points: 0.6 represents 60% of the default size (1). A title can also be added to the plot with main.

Instead of circles, you may want to join the points with a line. If so, add the argument type and specify "l". The argument lwd=2 specifies the width of the line (1 is the default).

plot(freqlist$FREQUENCY, 
     type="l", 
     lwd=2, 
     xlab="index of word types", 
     ylab="frequency",
     cex=0.6)

Another option is to plot the words themselves. It would not make sense to plot all the words in the frequency list, so we subset the 20 most frequent words. First, create a new plot and specify col="white" so that the data points are invisible.

Next, plot the words with text(). Technically, the words are used as labels (hence the argument labels). These labels are found in the first column of the data frame: freqlist$WORD. To keep the labels from overlapping, reduce their size with cex.

plot(freqlist$FREQUENCY[1:20],
     xlab="index of word types",
     ylab="frequency",
     col="white")
text(freqlist$FREQUENCY[1:20], 
     labels = freqlist$WORD[1:20],
     cex=0.7)

The labelling can be combined with the line by specifying type="l" in the plot() call.

plot(freqlist$FREQUENCY[1:20], 
     xlab="index of word types", 
     ylab="frequency", 
     type="l",
     col="lightgrey")
text(freqlist$FREQUENCY[1:20], 
     labels = freqlist$WORD[1:20], 
     cex=0.7)

The distribution at work is known as Zipfian. It is named after Zipf’s law many rare events coexist with very few large events. The resulting curve continually decreases from its peak (although, strictly speaking, this is not a peak).

The Zipfian distribution is typical of natural languages. If you plot the frequency list of any corpus of natural language, the curve will look invariably the same providing the corpus is large enough.

Your turn!

Repeat the above with the Dracula frequency list: freqlist.dracula.txt.

2.1.2 `ggplot2`

ggplot2 works as long as your data is tidy (i.e. in compliance with the tidy style advocated in the tidyverse).

library(ggplot2)
ggplot(freqlist[1:20,], aes(x=WORD, y=FREQUENCY, group=1)) +
    geom_line() +
    theme_bw()

By default, ggplot2 does not plot the frequencies and their associated words in decreasing order. We do so with the base-R reorder() function.

Line plot, first 20 words, no labels.

ggplot(freqlist[1:20,], aes(x = reorder(WORD, -FREQUENCY), y = FREQUENCY, group=1, label=WORD)) +
    geom_line() + # declare labels + use geom_text()
    xlab("") + # we get rid of the xlab tags
    theme_bw()

Scatter plot, first 20 words, no labels.

ggplot(freqlist[1:20,], aes(x = reorder(WORD, -FREQUENCY), y = FREQUENCY, group=1, label=WORD)) +
    geom_line() + # declare labels + use geom_text()
    xlab("") + # we get rid of the xlab tags
    theme_bw()

Scatter plot with words as data points (top 20 items).

ggplot(freqlist[1:20,], aes(x = reorder(WORD, -FREQUENCY), y = FREQUENCY, group=1, label=WORD)) +
    geom_text(check_overlap = TRUE) + # declare labels + use geom_text()
    xlab("") +
    theme_bw()

2.2 barplots

2.2.1 `barplot()` in base-R

Another way of plotting the data is by means of a barplot with the barplot() function. The two plots below display the ten and twenty most frequent lexical words in the BNC Baby.

barplot(freqlist$FREQUENCY[1:10],names.arg = freqlist$WORD[1:10], las=2)

barplot(freqlist$FREQUENCY[1:20],names.arg = freqlist$WORD[1:20], las=2)

The heights of the bars in the plot are determined by the values contained in the vector freqlist $FREQUENCY. The las argument allows you to decide if the labels are parallel (las=0) or perpendic- ular (las=2) to the x-axis. Each bar represents a word type. The space between each bar indicates that these word types are distinct categories.

2.2.2 `ggplot2`

In ggplot2, you plot histograms with geom_bar() or geom_col(). By default, ggplot2 does not plot the frequencies and their associated words in decreasing order.

barplot <- ggplot(freqlist[1:20,], aes(WORD, FREQUENCY))

barplot + 
  geom_col() + 
  theme()

We reorder the values in decreasing order and plot the words on the x axis with geom_bar().

ggplot(freqlist[1:20,], aes(x = reorder(WORD, -FREQUENCY), y = FREQUENCY)) +
  geom_col() +
  xlab("WORD") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

2.3 histograms

Histograms are close to barplots except that the bars do not represent distinct categories (therefore, there is no space between them). They represent specified divisions of the x-axis named “bins.” Their heights are proportional to how many observations fall within them. As a consequence, increasing the number of observations does not necessarily increase the number of bins by the same number

2.3.1 `hist()` in base-R

Here is a histogram of the 10 most frequent lexical words in the BNC Baby.

hist(freqlist$FREQUENCY[1:10], xlab="frequency bins", las=2, main="")

Here is a histogram of the 100 most frequent lexical words in the BNC Baby.

hist(freqlist$FREQUENCY[1:100], xlab="frequency bins", las=2, main="")

2.3.2 `ggplot2`

With ggplot2, you plot a histogram thanks to the stat_bin() or geom_histogram() functions.

ggplot(freqlist[1:100,], aes(x=FREQUENCY)) + 
  stat_bin() + 
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As you can see, the bin width is not ideal. Setting an appropriate bin width is important as its value has an impact on the histogram’s appearance. In ggplot2, you can change the bin size thanks to the binwidth argument of geom_histogram().

ggplot(freqlist[1:100,], aes(x=FREQUENCY)) + 
  stat_bin(binwidth = 2000) + 
  theme_bw()

With geom_histogram():

ggplot(freqlist[1:100,], aes(x=FREQUENCY)) + 
  geom_histogram(binwidth = 2000) + 
  theme_bw()

3 Central tendency

Measures of central tendency summarize the profile of a variable. There are three measures of central tendency: the mean, the median, and the mode. All three measures can summarize a large data set with just a couple of numbers.

3.1 the mean

The mean (also known as the arithmetic mean) gives the average value of the data. The data must be on a ratio scale. The arithmetic mean (μ), is the sum (∑) of all the scores for a given variable (x) in the data divided by the number of values (N):

\[\mu=\frac{\sum x}{N}\] Load split_unsplit.rds:

rm(list=ls(all=TRUE))
data <- readRDS(url("https://tinyurl.com/splitunsplitrds"))
str(data)

## 'data.frame':    20 obs. of  3 variables:
##  $ decade            : int  1810 1820 1830 1840 1850 1860 1870 1880 1890 1900 ...
##  $ split.infinitive  : int  4 4 10 25 35 64 166 191 145 181 ...
##  $ unsplit.infinitive: int  42 503 946 1091 1142 1114 1380 1409 1396 1468 ...

R has a built-in function to compute the mean of a numeric vector: mean(). To know the mean number of split and unsplit infinitives across the whole period, we apply the function to each vector.

mean(data$split.infinitive)

## [1] 174.4

mean(data$unsplit.infinitive)

## [1] 1209.9

On average, there are 2049.4 occurrences of the split infinitive and 3018.35 occurrences of the unsplit infinitive.

These values represent only a summary of the data. No decade displays either figure.

To visualize where the mean stands in your data, plot the numeric vectors and position the mean as a horizontal line with abline(). The h argument specifies where the horizontal line that corresponds to the mean should be.

par(mfrow=c(1,2))

# first plot
plot(data$split.infinitive, xlab="decades", ylab="frequency counts", main="split infinitive")
abline(h = mean(data$split.infinitive), col="blue")
text(5, mean(data$split.infinitive)+15, "mean", col="blue")

# second plot
plot(data$unsplit.infinitive, xlab="decades", ylab="frequency counts", main="unsplit infinitive")
abline(h = mean(data$unsplit.infinitive), col="green")
text(5, mean(data$unsplit.infinitive)+20, "mean", col="green")

With par(), you can set up graphical parameters. A layout with two plots side by side is specified using mfrow(). The line par(mfrow=c(1,2)) means “multiframe, row-wise, 1 line × 2 columns layout.” As a result, the two plots are organized in 1 row and 2 columns.

Although very popular among linguists, the mean is far from reliable!

Consider the two vectors below:

b <- c(10, 30, 50, 70, 80)
c <- c(10, 30, 50, 70, 110)

The vectors b and c are identical except for one value. In corpus linguistics, this might be caused by a word whose frequency is abnormally high. This minor difference translates into a large difference in the mean because of the few data that we have. The smaller the data set, the more sensitive it is towards extreme values.

mean(b)

## [1] 48

mean(c)

## [1] 54

To address this problem. The mean() function comes with an optional argument (trim) which allows you to specify a proportion of outlying values that are removed from the computation of the mean.

Each vector contains five values. If you want to remove the top and bottom values, you need to set trim to 0.2 (i.e. 20%). Because $5 × 0.2 = 1$, setting trim to 0.2 will remove two values in each vector: the highest value and the lowest value.

mean(b, trim = 0.2) # = mean(c(30, 50, 70))

## [1] 50

mean(c, trim = 0.2) # = mean(c(30, 50, 70))

## [1] 50

The resulting trimmed means are equal. Trimming means makes sense if the data set is large. If it is not, you should reconsider calculating the mean, whether trimmed or not.

3.2 the median

The median is the value that you obtain when you divide the values in your data set into two “equal” parts. When your data set consists of an uneven number of values, the median is the value in the middle. In the vector b, this value in the middle is 50.

## [1] 10 30 50 70 80

We can verify this with the median() function. There is an equal number of values on either part of the median.

median(b)

## [1] 50

When the vector consists of an even number of values, the value in the middle does not necessarily correspond to a value found in the vector.

median(c(b, 100))

## [1] 60

As opposed to the mean, the median is not affected by extreme values. What the median does not tell you is the behavior of the values on either side of it.

Interestingly, the median corresponds to the mean of the two middle values if the data consists of an even number of values.

a <- 1:12; a

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12

median(a)

## [1] 6.5

mean(c(6,7))

## [1] 6.5

The median does not necessarily correspond to the mean, as evidenced below.

par(mfrow=c(1,2))

# first plot
plot(data$split.infinitive, xlab="decades", ylab="frequency counts", main="split infinitive")
abline(h = mean(data$split.infinitive), col="blue")
abline(h = median(data$split.infinitive), col="blue", lty=3)
text(5, mean(data$split.infinitive)+15, "mean", col="blue")
text(5, median(data$split.infinitive)+15, "median", col="blue")

# second plot
plot(data$unsplit.infinitive, xlab="decades", ylab="frequency counts", main="unsplit infinitive")
abline(h = mean(data$unsplit.infinitive), col="green")
abline(h = median(data$unsplit.infinitive), col="green", lty=3)
text(5, mean(data$unsplit.infinitive)+20, "mean", col="green")
text(5, median(data$unsplit.infinitive)+20, "median", col="green")

3.3 the mode

The mode is the nominal value that occurs the most frequently in a tabulated data set. You obtain the mode with which.max().

Let us load df_each_every_bnc_baby.txt:

df.each.every <- read.delim("https://tinyurl.com/dfeachevery", header=TRUE)
head(df.each.every)

##   corpus.file                         info  mode type    exact.match determiner
## 1     A1E.xml W newsp brdsht nat: commerce wtext NEWS    each nation       each
## 2     A1E.xml W newsp brdsht nat: commerce wtext NEWS     each other       each
## 3     A1E.xml W newsp brdsht nat: commerce wtext NEWS     each other       each
## 4     A1E.xml W newsp brdsht nat: commerce wtext NEWS  each country        each
## 5     A1E.xml W newsp brdsht nat: commerce wtext NEWS     each type        each
## 6     A1E.xml W newsp brdsht nat: commerce wtext NEWS every problem       every
##        NP NP_tag
## 1  nation    NN1
## 2   other    NN1
## 3   other    NN1
## 4 country    NN1
## 5    type    NN1
## 6 problem    NN1

str(df.each.every)

## 'data.frame':    2339 obs. of  8 variables:
##  $ corpus.file: chr  "A1E.xml" "A1E.xml" "A1E.xml" "A1E.xml" ...
##  $ info       : chr  "W newsp brdsht nat: commerce" "W newsp brdsht nat: commerce" "W newsp brdsht nat: commerce" "W newsp brdsht nat: commerce" ...
##  $ mode       : chr  "wtext" "wtext" "wtext" "wtext" ...
##  $ type       : chr  "NEWS" "NEWS" "NEWS" "NEWS" ...
##  $ exact.match: chr  "each nation" "each other" "each other" "each country " ...
##  $ determiner : chr  "each" "each" "each" "each" ...
##  $ NP         : chr  "nation" "other" "other" "country" ...
##  $ NP_tag     : chr  "NN1" "NN1" "NN1" "NN1" ...

We want to know the mode of the variable NP_tag, so that we know which value is most often observed among the four possible options (NN0, NN1, NN2, and NP0). We isolate and tabulate the variable of interest with table().

tab.NP.tags <- table(df.each.every$NP_tag)
tab.NP.tags

## 
##     NN0     NN1 NN1-AJ0 NN1-VVB     NN2     NP0 
##      13    2242      15       6       6      57

We run which.max() on the tabulated data. The function returns the mode and its position.

which.max(tab.NP.tags)

## NN1 
##   2

The mode of NP_tag is NN1. If you want the value itself, use max():

max(tab.NP.tags)

## [1] 2242

The mode is the tallest bar of a barplot.

barplot(tab.NP.tags)

Same as above, but in tidy style:

library(ggplot2)
library(dplyr)

## 
## Attachement du package : 'dplyr'

## Les objets suivants sont masqués depuis 'package:stats':
## 
##     filter, lag

## Les objets suivants sont masqués depuis 'package:base':
## 
##     intersect, setdiff, setequal, union

df <- df.each.every %>% 
  count(NP_tag) %>%
  rename(count = n)
  
ggplot(df, aes(x=NP_tag, y=count)) + 
  geom_bar(stat = "identity")

4 Dispersion

Dispersion is the spread of a set of observations. If many data points are scattered far from the value of a centrality measure, the dispersion is large.

4.1 quantiles

4.1.1 `quantile()`

By default, the quantile() function divides the frequency distribution into four equal ordered subgroups known as quartiles. The first quartile ranges from 0% to 25%, the second quartile from 25% to 50%, the third quartile from 50% to 75%, and the fourth quartile from 75% to 100%.

quantile(data$split.infinitive, type=1)

##   0%  25%  50%  75% 100% 
##    4   35  102  181  873

quantile(data$unsplit.infinitive, type=1)

##   0%  25%  50%  75% 100% 
##   42 1114 1309 1396 1537

The type argument allows the user to choose from nine quantile algorithms, the detail of which may be accessed by entering ?quantile. By default, R uses the seventh type.

4.1.2 `IQR()`

The interquartile range (IQR) is the difference between the third and the first quartiles, i.e. the 75th and the 25th percentiles of the data. It may be used as an alternative to the standard deviation to assess the spread of the data.

IQR(data$split.infinitive, type=7)

## [1] 129.75

IQR(data$unsplit.infinitive, type=7)

## [1] 264.25

In base R, the summary() function combines centrality measures and quantiles (more precisely quartiles).

summary(data$split.infinitive)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   53.75  116.00  174.40  183.50  873.00

summary(data$unsplit.infinitive)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      42    1135    1318    1210    1399    1537

The IQR values confirm what we already know: the frequency distribution of the split infinitive is less dispersed than the frequency distribution of the unsplit infinitive.

The same function can be applied to the whole data frame.

summary(data)

##      decade     split.infinitive unsplit.infinitive
##  Min.   :1810   Min.   :  4.00   Min.   :  42      
##  1st Qu.:1858   1st Qu.: 53.75   1st Qu.:1135      
##  Median :1905   Median :116.00   Median :1318      
##  Mean   :1905   Mean   :174.40   Mean   :1210      
##  3rd Qu.:1952   3rd Qu.:183.50   3rd Qu.:1399      
##  Max.   :2000   Max.   :873.00   Max.   :1537

4.2 boxplots

4.2.1 `boxplot()` in base-R

A boxplot provide a graphic representation of the spread of the values around a central point. A boxplot is the graphic equivalent of summary().

Figure 4.1: A generic boxplot

You obtain a boxplot with the boxplot() function.

boxplot(rnorm(1000), col="grey")

Note that the boxplot will look different each time you enter the code because rnorm(1000) generates 1000 random values of the normal distribution². The center of the plot consists of a box that corresponds to the middle half of the data. The height of the box is determined by the interquartile range. The thick horizontal line that splits the box in two is the median.

If the median is centered between the lower limit of the second quartile and the upper limit of the third quartile, this is because the central part of the data is roughly symmetric. If not, this is because the frequency distribution is skewed.

Whiskers are found on either side of the box. They go from the upper and lower limits of the box to the horizontal lines of the whiskers. These two lines are drawn 1.5 interquartile ranges above the third quartile or 1.5 interquartile ranges below the first quartile. If whiskers have different lengths, it is also a sign that the frequency distribution is skewed.

The data values beyond the whiskers are known as outliers. They are displayed individually as circle dots. Although connoted negatively, outliers can be interesting and should not be systematically ignored.

Regarding the split_unsplit.rds data set, we use boxplots to compare the dispersion of the frequency distributions of split and unsplit infinitives. There are two ways of doing it. The first way consists in selecting the desired columns of the data frame. Here, this is done via subsetting.

boxplot(data[,c(2,3)])

The second way consists in plotting each variable side by side as two separate vectors. The variables are not labeled, unless we override the default settings with

boxplot(data$unsplit.infinitive, data$split.infinitive)

If we were to superimpose these two boxplots, they would not overlap. This shows that their distributions are radically different.

The boxplot for split.infinitive is shorter than the boxplot for unsplit.infinitive because the frequency distribution of the former variable is less dispersed than the frequency distribution of the latter. The boxplot for unsplit.infinitive shows that the frequency distribution for this variable is skewed: the whiskers do not have the same length, and the median is close to the upper limit of the third quartile.

4.2.2 `ggplot2`

To make a boxplot with ggplot2, use geom_boxplot(). Before we can use this function, though, we need to ‘tidy’ the data.

library(tidyr)

data.tidy <- data %>%
  pivot_longer(
    !(decade), # all the columns except 'DECADE' are concerned
    names_to = "construction", # new column
    values_to = "count", # where the counts will appear
    values_drop_na = TRUE # do not include NA values (providing NA values appear)
  )

head(data.tidy) # inspect

## # A tibble: 6 × 3
##   decade construction       count
##    <int> <chr>              <int>
## 1   1810 split.infinitive       4
## 2   1810 unsplit.infinitive    42
## 3   1820 split.infinitive       4
## 4   1820 unsplit.infinitive   503
## 5   1830 split.infinitive      10
## 6   1830 unsplit.infinitive   946

library(ggplot2)

p <- ggplot(data.tidy, aes(construction, count))
p + geom_boxplot()

Same with notches

p + geom_boxplot(notch = TRUE)

## notch went outside hinges. Try setting notch=FALSE.

Same with colored outliers

p + geom_boxplot(outlier.colour = "red", outlier.shape = 1)

Same with one color per variable level (adds a legend).

p + geom_boxplot(aes(colour = construction))

Same without outliers but with original data points and no jitter.

p + geom_boxplot(aes(colour = construction), outlier.shape = NA) +
  geom_jitter(width = 0)

Same with jitter, to facilitate the interpretation of data points.

p + geom_boxplot(aes(colour = construction), outlier.shape = NA) + geom_jitter(width = 0.1)

4.3 variance and standard deviation

The variance ($\sigma^2$) and the standard deviation ($\sigma$) use the mean as their central point.

4.3.1 variance

The variance ($\sigma^2$) measures how much a data set is spread out. It is calculated by:

subtracting the mean ($\bar{x}$) from each data point ($x$)
squaring the difference,
summing up all squared differences, and
dividing the sum by the sample size ($N$) minus 1.

\[\sigma^2 = \frac{\sum(x-\bar{x})^2}{N-1}\] Fortunately, R has a built-in function for the variance: var().

var(data$split.infinitive)

## [1] 47190.88

var(data$unsplit.infinitive)

## [1] 130852.6

4.3.2 standard deviation

The standard deviation ($\sigma$) is the most widely used measure of dispersion. It is the square root of the variance.

\[ \sigma = \sqrt{\frac{\sum(x-\bar{x})^2}{N-1}} \] In R, you obtain the standard deviation of a frequency distribution either by first calculating the variance of the vector and then its square root

sqrt(var(data$split.infinitive))

## [1] 217.2346

sqrt(var(data$unsplit.infinitive))

## [1] 361.7356

or by applying the dedicated function: sd().

sd(data$split.infinitive)

## [1] 217.2346

sd(data$unsplit.infinitive)

## [1] 361.7356

As expected, the variance and the standard deviation of unsplit.infinitive are larger than the variance and the standard deviation of split.infinitive.

5 Exercises

5.1 central tendency

The data file for this exercise is modals.by.genre.BNC.rds, which you load as follows:

modals <- readRDS(url("https://tinyurl.com/modalsbygenrebnc"))
modals

##        ACPROSE CONVRSN FICTION  NEWS NONAC OTHERPUB OTHERSP UNPUB
## can      44793   23161   32293 16269 53297    53392   26262 11816
## could    17379    7955   49826 14045 32923    19684   11976  5733
## may      35224     628    5302  6134 32934    20923    4267  6853
## might    11293    3524   13917  3634 13110     7215    4710  1465
## must     13511    2989   15043  4306 15522    11176    3045  4064
## ought      976     451    1649   221  1115      477     820   110
## shall     4097    1639    4855   408  3746     2306    1233  1701
## should   19420    4344   13791  8900 25622    22014    7647  7104
## will     32805    9032   24285 37476 53246    66980   15934 19258
## would    29903    9895   56934 23375 59211    33132   23778  8741

Write a for-loop that prints (with cat()):

the mean number of modals per text genre;
the median number of modals per text genre;
the frequency of the mode, i.e. the modal that occurs the most in the given text genre.

The for-loop should output something like the following:

## ACPROSE 20940.1 18399.5 44793 
## CONVRSN 6361.8 3934 23161 
## FICTION 21789.5 14480 56934 
## NEWS 11476.8 7517 37476 
## NONAC 29072.6 29272.5 59211 
## OTHERPUB 23729.9 20303.5 66980 
## OTHERSP 9967.2 6178.5 26262 
## UNPUB 6684.5 6293 19258

5.2 Valley girls

A sociolinguist studies the use of the gap-filler ‘whatever’ in Valleyspeak. She records all conversations involving two Valley girls over a week and counts the number of times each girl says ‘whatever’ to fill a gap in the conversation. The (fictitious) data is summarized in Tab. 5.1.

Table 5.1: Valleyspeak data
	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday	Sunday
Valley girl 1	314	299	401	375	510	660	202
Valley girl 2	304	359	357	342	320	402	285

For each girl:

summarize the data in a plot such as 5.1;
compute the mean and the standard deviation;
summarize the data in a boxplot;
interpret the results.

Figure 5.1: Number of times two Valley girls say ‘whatever’ over a week

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ tibble  3.1.5     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ✓ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

6 References

It would make little sense to say that mild agentivity is twice as less as high agentivity.↩︎
enter ?Normal for further information↩︎

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Descriptive statistics with

Guillaume Desagulier

M2 TAL – Université Paris Nanterre – 2021-2022

1 Variables

2 Summary graphs

2.1 Line plots and scatterplots

2.1.1 `plot()` in base-R

2.1.2 `ggplot2`

2.2 barplots

2.2.1 `barplot()` in base-R

2.2.2 `ggplot2`

2.3 histograms

2.3.1 `hist()` in base-R

2.3.2 `ggplot2`

3 Central tendency

3.1 the mean

3.2 the median

3.3 the mode

4 Dispersion

4.1 quantiles

4.1.1 `quantile()`

4.1.2 `IQR()`

4.2 boxplots

4.2.1 `boxplot()` in base-R

4.2.2 `ggplot2`

4.3 variance and standard deviation

4.3.1 variance

4.3.2 standard deviation

5 Exercises

5.1 central tendency

5.2 Valley girls

6 References

Descriptive statistics with

Guillaume Desagulier

M2 TAL – Université Paris Nanterre – 2021-2022

1 Variables

2 Summary graphs

2.1 Line plots and scatterplots

2.1.1 plot() in base-R

2.1.2 ggplot2

2.2 barplots

2.2.1 barplot() in base-R

2.2.2 ggplot2

2.3 histograms

2.3.1 hist() in base-R

2.3.2 ggplot2

3 Central tendency

3.1 the mean

3.2 the median

3.3 the mode

4 Dispersion

4.1 quantiles

4.1.1 quantile()

4.1.2 IQR()

4.2 boxplots

4.2.1 boxplot() in base-R

4.2.2 ggplot2

4.3 variance and standard deviation

4.3.1 variance

4.3.2 standard deviation

5 Exercises

5.1 central tendency

5.2 Valley girls

6 References

2.1.1 `plot()` in base-R

2.1.2 `ggplot2`

2.2.1 `barplot()` in base-R

2.2.2 `ggplot2`

2.3.1 `hist()` in base-R

2.3.2 `ggplot2`

4.1.1 `quantile()`

4.1.2 `IQR()`

4.2.1 `boxplot()` in base-R

4.2.2 `ggplot2`