Descriptive statistics summarize information. In this lesson, we review two kinds of descriptive statistics:
Measures of central tendency are meant to summarize the profile of a variable. Although widespread, these statistics are often misused. I provide guidelines for using them. Measures of dispersion are complementary: they are meant to assess how good a given measure of central tendency is at summarizing the variable.
A variable is a property that varies from one individual to another. An individual may be just anything, such as a person (a speaker, an informant) or a linguistic unit or phenomenon (modal auxiliaries, transitivity, etc.). Here is an open-ended list of variables for linguistic phenomena:
The first five of these variables provide numerical information and are known as quantitative variables. The last two variables provide non-numerical information and are known as qualitative or categorical variables.
Quantitative variables break down into:
Typically, discrete quantitative variables involve counts (integers). Such is the case of the number of modal auxiliaries per sentence and the number of syllables per word.
Continuous quantitative variables involve a measurement of some kind within an interval of numbers with decimals. Such is the case of vowel lengths in milliseconds and pitch frequencies in hertz.
There is a special type of quantitative variables: ordinal variables. Ordinal variables are numerical and take the form of rankings. An example of ordinal variable is the ranking of agentivity criteria: “1” high agentivity, “2” mild agentivity, “3” low agentivity. Acceptability judgments in psycholinguistic experiments may also be ordinal variables. The specificity of ordinal variables is that you cannot do arithmetic with them because the difference between each level is not quantitatively relevant.1
The simplest statistics in corpus linguistics and NLP are based on frequency counts. This section reviews the most common plots to summarize frequency counts.
plot()
in base-Rplot()
is one of the most widely used in R. It comes with many options, which can be explored by entering ?plot
.
We want to plot a frequency list made from the BNC Baby corpus (freqlist.bnc.baby.txt
). After loading the file, we plot the frequencies with plot()
.
rm(list=ls(all=TRUE)) # clear R's memory
freqlist <- read.table("https://tinyurl.com/freqlistbncbaby", header=TRUE) # load the freqlist
str(freqlist) # inspect
## 'data.frame': 76828 obs. of 2 variables:
## $ WORD : chr "said" "know" "got" "get" ...
## $ FREQUENCY: int 12704 10202 8825 6756 6427 5961 5827 5551 5448 4285 ...
head(freqlist) # inspect
## WORD FREQUENCY
## 1 said 12704
## 2 know 10202
## 3 got 8825
## 4 get 6756
## 5 go 6427
## 6 think 5961
# plot
plot(freqlist$FREQUENCY,
xlab="index of word types",
ylab="frequency",
main="plot of a frequency list",
cex=0.6)
Here, the xlab
and ylab
arguments specify the names of the horizontal and vertical axes respectively. The cex argument specifies the size of the circles that signal data points: 0.6 represents 60% of the default size (1
). A title can also be added to the plot with main
.
Instead of circles, you may want to join the points with a line. If so, add the argument type and specify "l"
. The argument lwd=2
specifies the width of the line (1
is the default).
plot(freqlist$FREQUENCY,
type="l",
lwd=2,
xlab="index of word types",
ylab="frequency",
cex=0.6)
Another option is to plot the words themselves. It would not make sense to plot all the words in the frequency list, so we subset the 20 most frequent words. First, create a new plot and specify col="white"
so that the data points are invisible.
Next, plot the words with text()
. Technically, the words are used as labels (hence the argument labels). These labels are found in the first column of the data frame: freqlist$WORD
. To keep the labels from overlapping, reduce their size with cex
.
plot(freqlist$FREQUENCY[1:20],
xlab="index of word types",
ylab="frequency",
col="white")
text(freqlist$FREQUENCY[1:20],
labels = freqlist$WORD[1:20],
cex=0.7)
The labelling can be combined with the line by specifying type="l"
in the plot()
call.
plot(freqlist$FREQUENCY[1:20],
xlab="index of word types",
ylab="frequency",
type="l",
col="lightgrey")
text(freqlist$FREQUENCY[1:20],
labels = freqlist$WORD[1:20],
cex=0.7)
The distribution at work is known as Zipfian. It is named after Zipf’s law many rare events coexist with very few large events. The resulting curve continually decreases from its peak (although, strictly speaking, this is not a peak).
The Zipfian distribution is typical of natural languages. If you plot the frequency list of any corpus of natural language, the curve will look invariably the same providing the corpus is large enough.
Your turn!
Repeat the above with the Dracula frequency list: freqlist.dracula.txt
.
ggplot2
ggplot2
works as long as your data is tidy (i.e. in compliance with the tidy style advocated in the tidyverse).
library(ggplot2)
ggplot(freqlist[1:20,], aes(x=WORD, y=FREQUENCY, group=1)) +
geom_line() +
theme_bw()
By default, ggplot2
does not plot the frequencies and their associated words in decreasing order. We do so with the base-R reorder()
function.
Line plot, first 20 words, no labels.
ggplot(freqlist[1:20,], aes(x = reorder(WORD, -FREQUENCY), y = FREQUENCY, group=1, label=WORD)) +
geom_line() + # declare labels + use geom_text()
xlab("") + # we get rid of the xlab tags
theme_bw()
Scatter plot, first 20 words, no labels.
ggplot(freqlist[1:20,], aes(x = reorder(WORD, -FREQUENCY), y = FREQUENCY, group=1, label=WORD)) +
geom_line() + # declare labels + use geom_text()
xlab("") + # we get rid of the xlab tags
theme_bw()
Scatter plot with words as data points (top 20 items).
ggplot(freqlist[1:20,], aes(x = reorder(WORD, -FREQUENCY), y = FREQUENCY, group=1, label=WORD)) +
geom_text(check_overlap = TRUE) + # declare labels + use geom_text()
xlab("") +
theme_bw()
barplot()
in base-RAnother way of plotting the data is by means of a barplot with the barplot()
function. The two plots below display the ten and twenty most frequent lexical words in the BNC Baby.
barplot(freqlist$FREQUENCY[1:10],names.arg = freqlist$WORD[1:10], las=2)
barplot(freqlist$FREQUENCY[1:20],names.arg = freqlist$WORD[1:20], las=2)
The heights of the bars in the plot are determined by the values contained in the vector freqlist $FREQUENCY
. The las
argument allows you to decide if the labels are parallel (las=0
) or perpendic- ular (las=2
) to the x-axis. Each bar represents a word type. The space between each bar indicates that these word types are distinct categories.
ggplot2
In ggplot2
, you plot histograms with geom_bar()
or geom_col()
. By default, ggplot2
does not plot the frequencies and their associated words in decreasing order.
barplot <- ggplot(freqlist[1:20,], aes(WORD, FREQUENCY))
barplot +
geom_col() +
theme()
We reorder the values in decreasing order and plot the words on the x axis with geom_bar()
.
ggplot(freqlist[1:20,], aes(x = reorder(WORD, -FREQUENCY), y = FREQUENCY)) +
geom_col() +
xlab("WORD") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Histograms are close to barplots except that the bars do not represent distinct categories (therefore, there is no space between them). They represent specified divisions of the x-axis named “bins.” Their heights are proportional to how many observations fall within them. As a consequence, increasing the number of observations does not necessarily increase the number of bins by the same number
hist()
in base-RHere is a histogram of the 10 most frequent lexical words in the BNC Baby.
hist(freqlist$FREQUENCY[1:10], xlab="frequency bins", las=2, main="")
Here is a histogram of the 100 most frequent lexical words in the BNC Baby.
hist(freqlist$FREQUENCY[1:100], xlab="frequency bins", las=2, main="")
ggplot2
With ggplot2
, you plot a histogram thanks to the stat_bin()
or geom_histogram()
functions.
ggplot(freqlist[1:100,], aes(x=FREQUENCY)) +
stat_bin() +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As you can see, the bin width is not ideal. Setting an appropriate bin width is important as its value has an impact on the histogram’s appearance. In ggplot2
, you can change the bin size thanks to the binwidth
argument of geom_histogram()
.
ggplot(freqlist[1:100,], aes(x=FREQUENCY)) +
stat_bin(binwidth = 2000) +
theme_bw()
With geom_histogram()
:
ggplot(freqlist[1:100,], aes(x=FREQUENCY)) +
geom_histogram(binwidth = 2000) +
theme_bw()
Measures of central tendency summarize the profile of a variable. There are three measures of central tendency: the mean, the median, and the mode. All three measures can summarize a large data set with just a couple of numbers.
The mean (also known as the arithmetic mean) gives the average value of the data. The data must be on a ratio scale. The arithmetic mean (μ), is the sum (∑) of all the scores for a given variable (x) in the data divided by the number of values (N):
\[\mu=\frac{\sum x}{N}\]
Load split_unsplit.rds
:
rm(list=ls(all=TRUE))
data <- readRDS(url("https://tinyurl.com/splitunsplitrds"))
str(data)
## 'data.frame': 20 obs. of 3 variables:
## $ decade : int 1810 1820 1830 1840 1850 1860 1870 1880 1890 1900 ...
## $ split.infinitive : int 4 4 10 25 35 64 166 191 145 181 ...
## $ unsplit.infinitive: int 42 503 946 1091 1142 1114 1380 1409 1396 1468 ...
R has a built-in function to compute the mean of a numeric vector: mean()
. To know the mean number of split and unsplit infinitives across the whole period, we apply the function to each vector.
mean(data$split.infinitive)
## [1] 174.4
mean(data$unsplit.infinitive)
## [1] 1209.9
On average, there are 2049.4 occurrences of the split infinitive and 3018.35 occurrences of the unsplit infinitive.
These values represent only a summary of the data. No decade displays either figure.
To visualize where the mean stands in your data, plot the numeric vectors and position the mean as a horizontal line with abline()
. The h
argument specifies where the horizontal line that corresponds to the mean should be.
par(mfrow=c(1,2))
# first plot
plot(data$split.infinitive, xlab="decades", ylab="frequency counts", main="split infinitive")
abline(h = mean(data$split.infinitive), col="blue")
text(5, mean(data$split.infinitive)+15, "mean", col="blue")
# second plot
plot(data$unsplit.infinitive, xlab="decades", ylab="frequency counts", main="unsplit infinitive")
abline(h = mean(data$unsplit.infinitive), col="green")
text(5, mean(data$unsplit.infinitive)+20, "mean", col="green")
With par()
, you can set up graphical parameters. A layout with two plots side by side is specified using mfrow()
. The line par(mfrow=c(1,2))
means “multiframe, row-wise, 1 line × 2 columns layout.” As a result, the two plots are organized in 1 row and 2 columns.
Although very popular among linguists, the mean is far from reliable!
Consider the two vectors below:
b <- c(10, 30, 50, 70, 80)
c <- c(10, 30, 50, 70, 110)
The vectors b
and c
are identical except for one value. In corpus linguistics, this might be caused by a word whose frequency is abnormally high. This minor difference translates into a large difference in the mean because of the few data that we have. The smaller the data set, the more sensitive it is towards extreme values.
mean(b)
## [1] 48
mean(c)
## [1] 54
To address this problem. The mean()
function comes with an optional argument (trim
) which allows you to specify a proportion of outlying values that are removed from the computation of the mean.
Each vector contains five values. If you want to remove the top and bottom values, you need to set trim
to 0.2 (i.e. 20%). Because \(5 × 0.2 = 1\), setting trim
to 0.2 will remove two values in each vector: the highest value and the lowest value.
mean(b, trim = 0.2) # = mean(c(30, 50, 70))
## [1] 50
mean(c, trim = 0.2) # = mean(c(30, 50, 70))
## [1] 50
The resulting trimmed means are equal. Trimming means makes sense if the data set is large. If it is not, you should reconsider calculating the mean, whether trimmed or not.
The median is the value that you obtain when you divide the values in your data set into two “equal” parts. When your data set consists of an uneven number of values, the median is the value in the middle. In the vector b
, this value in the middle is 50.
b
## [1] 10 30 50 70 80
We can verify this with the median()
function. There is an equal number of values on either part of the median.
median(b)
## [1] 50
When the vector consists of an even number of values, the value in the middle does not necessarily correspond to a value found in the vector.
median(c(b, 100))
## [1] 60
As opposed to the mean, the median is not affected by extreme values. What the median does not tell you is the behavior of the values on either side of it.
Interestingly, the median corresponds to the mean of the two middle values if the data consists of an even number of values.
a <- 1:12; a
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
median(a)
## [1] 6.5
mean(c(6,7))
## [1] 6.5
The median does not necessarily correspond to the mean, as evidenced below.
par(mfrow=c(1,2))
# first plot
plot(data$split.infinitive, xlab="decades", ylab="frequency counts", main="split infinitive")
abline(h = mean(data$split.infinitive), col="blue")
abline(h = median(data$split.infinitive), col="blue", lty=3)
text(5, mean(data$split.infinitive)+15, "mean", col="blue")
text(5, median(data$split.infinitive)+15, "median", col="blue")
# second plot
plot(data$unsplit.infinitive, xlab="decades", ylab="frequency counts", main="unsplit infinitive")
abline(h = mean(data$unsplit.infinitive), col="green")
abline(h = median(data$unsplit.infinitive), col="green", lty=3)
text(5, mean(data$unsplit.infinitive)+20, "mean", col="green")
text(5, median(data$unsplit.infinitive)+20, "median", col="green")
The mode is the nominal value that occurs the most frequently in a tabulated data set. You obtain the mode with which.max()
.
Let us load df_each_every_bnc_baby.txt
:
df.each.every <- read.delim("https://tinyurl.com/dfeachevery", header=TRUE)
head(df.each.every)
## corpus.file info mode type exact.match determiner
## 1 A1E.xml W newsp brdsht nat: commerce wtext NEWS each nation each
## 2 A1E.xml W newsp brdsht nat: commerce wtext NEWS each other each
## 3 A1E.xml W newsp brdsht nat: commerce wtext NEWS each other each
## 4 A1E.xml W newsp brdsht nat: commerce wtext NEWS each country each
## 5 A1E.xml W newsp brdsht nat: commerce wtext NEWS each type each
## 6 A1E.xml W newsp brdsht nat: commerce wtext NEWS every problem every
## NP NP_tag
## 1 nation NN1
## 2 other NN1
## 3 other NN1
## 4 country NN1
## 5 type NN1
## 6 problem NN1
str(df.each.every)
## 'data.frame': 2339 obs. of 8 variables:
## $ corpus.file: chr "A1E.xml" "A1E.xml" "A1E.xml" "A1E.xml" ...
## $ info : chr "W newsp brdsht nat: commerce" "W newsp brdsht nat: commerce" "W newsp brdsht nat: commerce" "W newsp brdsht nat: commerce" ...
## $ mode : chr "wtext" "wtext" "wtext" "wtext" ...
## $ type : chr "NEWS" "NEWS" "NEWS" "NEWS" ...
## $ exact.match: chr "each nation" "each other" "each other" "each country " ...
## $ determiner : chr "each" "each" "each" "each" ...
## $ NP : chr "nation" "other" "other" "country" ...
## $ NP_tag : chr "NN1" "NN1" "NN1" "NN1" ...
We want to know the mode of the variable NP_tag
, so that we know which value is most often observed among the four possible options (NN0
, NN1
, NN2
, and NP0
). We isolate and tabulate the variable of interest with table()
.
tab.NP.tags <- table(df.each.every$NP_tag)
tab.NP.tags
##
## NN0 NN1 NN1-AJ0 NN1-VVB NN2 NP0
## 13 2242 15 6 6 57
We run which.max()
on the tabulated data. The function returns the mode and its position.
which.max(tab.NP.tags)
## NN1
## 2
The mode of NP_tag
is NN1
. If you want the value itself, use max()
:
max(tab.NP.tags)
## [1] 2242
The mode is the tallest bar of a barplot.
barplot(tab.NP.tags)
Same as above, but in tidy style:
library(ggplot2)
library(dplyr)
##
## Attachement du package : 'dplyr'
## Les objets suivants sont masqués depuis 'package:stats':
##
## filter, lag
## Les objets suivants sont masqués depuis 'package:base':
##
## intersect, setdiff, setequal, union
df <- df.each.every %>%
count(NP_tag) %>%
rename(count = n)
ggplot(df, aes(x=NP_tag, y=count)) +
geom_bar(stat = "identity")
Dispersion is the spread of a set of observations. If many data points are scattered far from the value of a centrality measure, the dispersion is large.
quantile()
By default, the quantile()
function divides the frequency distribution into four equal ordered subgroups known as quartiles. The first quartile ranges from 0% to 25%, the second quartile from 25% to 50%, the third quartile from 50% to 75%, and the fourth quartile from 75% to 100%.
quantile(data$split.infinitive, type=1)
## 0% 25% 50% 75% 100%
## 4 35 102 181 873
quantile(data$unsplit.infinitive, type=1)
## 0% 25% 50% 75% 100%
## 42 1114 1309 1396 1537
The type
argument allows the user to choose from nine quantile algorithms, the detail of which may be accessed by entering ?quantile
. By default, R uses the seventh type.
IQR()
The interquartile range (IQR) is the difference between the third and the first quartiles, i.e. the 75th and the 25th percentiles of the data. It may be used as an alternative to the standard deviation to assess the spread of the data.
IQR(data$split.infinitive, type=7)
## [1] 129.75
IQR(data$unsplit.infinitive, type=7)
## [1] 264.25
In base R, the summary()
function combines centrality measures and quantiles (more precisely quartiles).
summary(data$split.infinitive)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 53.75 116.00 174.40 183.50 873.00
summary(data$unsplit.infinitive)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 42 1135 1318 1210 1399 1537
The IQR values confirm what we already know: the frequency distribution of the split infinitive is less dispersed than the frequency distribution of the unsplit infinitive.
The same function can be applied to the whole data frame.
summary(data)
## decade split.infinitive unsplit.infinitive
## Min. :1810 Min. : 4.00 Min. : 42
## 1st Qu.:1858 1st Qu.: 53.75 1st Qu.:1135
## Median :1905 Median :116.00 Median :1318
## Mean :1905 Mean :174.40 Mean :1210
## 3rd Qu.:1952 3rd Qu.:183.50 3rd Qu.:1399
## Max. :2000 Max. :873.00 Max. :1537
boxplot()
in base-RA boxplot provide a graphic representation of the spread of the values around a central point. A boxplot is the graphic equivalent of summary()
.
You obtain a boxplot with the boxplot()
function.
boxplot(rnorm(1000), col="grey")
Note that the boxplot will look different each time you enter the code because rnorm(1000)
generates 1000 random values of the normal distribution2.
The center of the plot consists of a box that corresponds to the middle half of the data. The height of the box is determined by the interquartile range. The thick horizontal line that splits the box in two is the median.
If the median is centered between the lower limit of the second quartile and the upper limit of the third quartile, this is because the central part of the data is roughly symmetric. If not, this is because the frequency distribution is skewed.
Whiskers are found on either side of the box. They go from the upper and lower limits of the box to the horizontal lines of the whiskers. These two lines are drawn 1.5 interquartile ranges above the third quartile or 1.5 interquartile ranges below the first quartile. If whiskers have different lengths, it is also a sign that the frequency distribution is skewed.
The data values beyond the whiskers are known as outliers. They are displayed individually as circle dots. Although connoted negatively, outliers can be interesting and should not be systematically ignored.
Regarding the split_unsplit.rds
data set, we use boxplots to compare the dispersion of the frequency distributions of split and unsplit infinitives. There are two ways of doing it. The first way consists in selecting the desired columns of the data frame. Here, this is done via subsetting.
boxplot(data[,c(2,3)])
The second way consists in plotting each variable side by side as two separate vectors. The variables are not labeled, unless we override the default settings with
boxplot(data$unsplit.infinitive, data$split.infinitive)
If we were to superimpose these two boxplots, they would not overlap. This shows that their distributions are radically different.
The boxplot for split.infinitive
is shorter than the boxplot for unsplit.infinitive
because the frequency distribution of the former variable is less dispersed than the frequency distribution of the latter. The boxplot for unsplit.infinitive
shows that the frequency distribution for this variable is skewed: the whiskers do not have the same length, and the median is close to the upper limit of the third quartile.
ggplot2
To make a boxplot with ggplot2
, use geom_boxplot()
. Before we can use this function, though, we need to ‘tidy’ the data.
library(tidyr)
data.tidy <- data %>%
pivot_longer(
!(decade), # all the columns except 'DECADE' are concerned
names_to = "construction", # new column
values_to = "count", # where the counts will appear
values_drop_na = TRUE # do not include NA values (providing NA values appear)
)
head(data.tidy) # inspect
## # A tibble: 6 × 3
## decade construction count
## <int> <chr> <int>
## 1 1810 split.infinitive 4
## 2 1810 unsplit.infinitive 42
## 3 1820 split.infinitive 4
## 4 1820 unsplit.infinitive 503
## 5 1830 split.infinitive 10
## 6 1830 unsplit.infinitive 946
library(ggplot2)
p <- ggplot(data.tidy, aes(construction, count))
p + geom_boxplot()
Same with notches
p + geom_boxplot(notch = TRUE)
## notch went outside hinges. Try setting notch=FALSE.
Same with colored outliers
p + geom_boxplot(outlier.colour = "red", outlier.shape = 1)
Same with one color per variable level (adds a legend).
p + geom_boxplot(aes(colour = construction))
Same without outliers but with original data points and no jitter.
p + geom_boxplot(aes(colour = construction), outlier.shape = NA) +
geom_jitter(width = 0)
Same with jitter, to facilitate the interpretation of data points.
p + geom_boxplot(aes(colour = construction), outlier.shape = NA) + geom_jitter(width = 0.1)
The variance (\(\sigma^2\)) and the standard deviation (\(\sigma\)) use the mean as their central point.
The variance (\(\sigma^2\)) measures how much a data set is spread out. It is calculated by:
\[\sigma^2 = \frac{\sum(x-\bar{x})^2}{N-1}\]
Fortunately, R has a built-in function for the variance: var()
.
var(data$split.infinitive)
## [1] 47190.88
var(data$unsplit.infinitive)
## [1] 130852.6
The standard deviation (\(\sigma\)) is the most widely used measure of dispersion. It is the square root of the variance.
\[ \sigma = \sqrt{\frac{\sum(x-\bar{x})^2}{N-1}} \] In R, you obtain the standard deviation of a frequency distribution either by first calculating the variance of the vector and then its square root
sqrt(var(data$split.infinitive))
## [1] 217.2346
sqrt(var(data$unsplit.infinitive))
## [1] 361.7356
or by applying the dedicated function: sd()
.
sd(data$split.infinitive)
## [1] 217.2346
sd(data$unsplit.infinitive)
## [1] 361.7356
As expected, the variance and the standard deviation of unsplit.infinitive
are larger than the variance and the standard deviation of split.infinitive
.
The data file for this exercise is modals.by.genre.BNC.rds
, which you load as follows:
modals <- readRDS(url("https://tinyurl.com/modalsbygenrebnc"))
modals
## ACPROSE CONVRSN FICTION NEWS NONAC OTHERPUB OTHERSP UNPUB
## can 44793 23161 32293 16269 53297 53392 26262 11816
## could 17379 7955 49826 14045 32923 19684 11976 5733
## may 35224 628 5302 6134 32934 20923 4267 6853
## might 11293 3524 13917 3634 13110 7215 4710 1465
## must 13511 2989 15043 4306 15522 11176 3045 4064
## ought 976 451 1649 221 1115 477 820 110
## shall 4097 1639 4855 408 3746 2306 1233 1701
## should 19420 4344 13791 8900 25622 22014 7647 7104
## will 32805 9032 24285 37476 53246 66980 15934 19258
## would 29903 9895 56934 23375 59211 33132 23778 8741
Write a for
-loop that prints (with cat()
):
The for
-loop should output something like the following:
## ACPROSE 20940.1 18399.5 44793
## CONVRSN 6361.8 3934 23161
## FICTION 21789.5 14480 56934
## NEWS 11476.8 7517 37476
## NONAC 29072.6 29272.5 59211
## OTHERPUB 23729.9 20303.5 66980
## OTHERSP 9967.2 6178.5 26262
## UNPUB 6684.5 6293 19258
A sociolinguist studies the use of the gap-filler ‘whatever’ in Valleyspeak. She records all conversations involving two Valley girls over a week and counts the number of times each girl says ‘whatever’ to fill a gap in the conversation. The (fictitious) data is summarized in Tab. 5.1.
Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday | |
---|---|---|---|---|---|---|---|
Valley girl 1 | 314 | 299 | 401 | 375 | 510 | 660 | 202 |
Valley girl 2 | 304 | 359 | 357 | 342 | 320 | 402 | 285 |
For each girl:
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.5 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()