Chapter 3 Statistical Analyses of a Single Variable

This chapter uses data from Google Analytics’ channels report, which demonstrates how much of traffic is coming from specific paid versus unpaid channels to illustrate how to analyze a single variable.

Let’s import the Google Analytics’ channels report into RStudio.

#You will need to install readxl package

library(readxl) 

my_data <- read_excel("google analytics sample data.xlsx") #Create a dataset and name it as my_data

Table 3.1 presents parts of a public demo data set of channels report from Google Analytics of Google Merchandise Store, which is an affiliate of the holding company, Google, and an e-commerce business site that deals with marketing and selling of Google, YouTube, and Android branded merchandise such as stationaries, bags among other products.

#You will need to install kableExtra package
library(kableExtra)

knitr::kable(head(my_data[,1:12], 20), digits = 5, booktabs = TRUE,
  caption = 'The First 20 Rows of the Google Analytics Sample Data'
) %>%
  kable_styling(font_size = 6.5)
Table 3.1: The First 20 Rows of the Google Analytics Sample Data
Channel Age Users New Users Sessions Bounce Rate Pages / Session Avg. Session Duration Ecommerce Conversion Rate Transactions Revenue Year
Direct 25-34 81822 76251 118241 0.4219 5.825 227.7 0.03649 4315 332772 2021
Direct 18-24 65506 61565 88898 0.4718 4.982 199.8 0.02642 2349 198848 2021
Direct 35-44 41947 38535 59313 0.4219 5.915 229.8 0.03842 2279 189793 2021
Direct 45-54 26800 24967 37477 0.4466 5.477 226.9 0.03421 1282 109532 2021
Direct 55-64 12981 12217 17703 0.4644 5.269 213.1 0.03474 615 54809 2021
Direct 65+ 9432 8849 12548 0.4679 5.239 210.6 0.02981 374 35610 2021
Paid Search 18-24 2015 1766 2797 0.5570 4.715 175.1 0.02252 63 8424 2021
Paid Search 35-44 1394 1216 1980 0.4753 5.209 157.3 0.03535 70 8175 2021
Paid Search 25-34 2175 1829 3198 0.4381 5.881 195.8 0.03502 112 7577 2021
Direct 25-34 30772 29628 40539 0.3780 5.784 233.6 0.00306 124 7240 2020
Direct 18-24 17545 17024 21888 0.4271 5.078 210.2 0.00315 69 5325 2020
Direct 45-54 8491 8126 11115 0.3857 5.731 230.9 0.00495 55 4174 2020
Paid Search 45-54 1083 960 1387 0.5667 4.397 138.7 0.02307 32 4086 2021
Direct 35-44 14088 13560 18461 0.3825 5.536 226.4 0.00444 82 3994 2020
Organic Search 25-34 35978 32853 47132 0.5139 3.971 164.3 0.00068 32 1738 2020
(Other) 35-44 284 174 523 0.2447 7.625 278.3 0.03824 20 1657 2021
Direct 55-64 4466 4287 5862 0.4024 5.466 232.1 0.00409 24 1530 2020
(Other) 25-34 3440 2370 4741 0.3153 5.907 232.5 0.00337 16 1472 2020
(Other) 18-24 500 302 849 0.3498 5.616 244.3 0.03651 31 1395 2021
Paid Search 55-64 804 748 1009 0.6254 3.591 111.1 0.01388 14 1316 2021

See the bibliography at the end of the chapter for referral to the complete dataset that you can access and work with.

3.1 Variables & Levels of Measurement

In a dataset, the columns identify the variables that represent certain qualities of individuals. Any quality that can vary can be considered a variable. Variables can take on different values or attributes. An attribute is a specific value on a variable. For example, gender might be a variable that we are interested in, and the attributes of gender could be male, female, transgender, and non-binary.

We use levels of measurement to describe how precisely variables are recorded. There are four levels of measurement: nominal, ordinal, interval, and ratio representing the transition from analyzing qualitative to quantitative data.

3.1.1 Nominal Data

Nominal data is used to label variables. Gender, nationality, race, and ethnicity are a few of the most common examples of nominal data. It can only be categorized based on qualitative differences and there are no orderings or directions among different categories. We only know that males and females have qualitative differences in terms of gender, but we cannot tell how much they differ, nor could we sort individuals based on their gender. We can still assign numbers to different attributes of nominal variables (e.g., we can assign 1 to females and 2 to males, or vice versa), but the numbers only serve as labels that lack numerical values.

For the dataset of Google Analytics from Google Merchandise Store (Table 1), the variable Channel is measured at the nominal level, and it indicates the channel through which traffic got to Google Merchandise Store. Organic Search indicates visits from unpaid search results. Direct indicates visits where users navigated directly to the URL, or the source of the visit is unknown. Referral indicates traffic where users clicked a link from another site, excluding major search engines. Display indicates traffic from display advertising, such as Google AdWords remarketing campaigns. Social indicates visits from social networks (Facebook, Twitter, etc.). Paid Search indicates traffic from the PPC (Pay Per Click) campaigns run in search results.

Since each variable should be exhaustive to include all the possibilities, for the variable of Channel, the attribute “Other” is assigned referring to traffic from an un-identified channel. All these different channels through which a user reaches our website are different attributes our channel variable can have.

3.1.2 Ordinal Data

Ordinal data also can be categorized, but there are rankings or orderings or scaling between categories. The ordinal measurement has been widely used to analyze variables such as education level (e.g., “high school,” “BS,” “MS,” “Ph.D.”), income level (e.g., “less than 50K,” “50-100K,” “over 100K”), and socioeconomic status (e.g., “lower class,” “middle class,” and “upper class”).

For the dataset of Google Analytics from Google Merchandise Store (Table 2.1), the variable Age is also ordinal, and it refers to the age group of the users. For such ordinal variable, we can turn it into quantitative by assigning numerical values to different categories.

For example, Age can be defined as having six attributes:1 = 18-24; 2 = 25-34; 3 = 35-44; 4 = 45-54; 5 = 55-64; 6 = 65+. There is a logical ordering among the different age groups, in a way that whoever answers “6” is older than individuals who answer “5” to this question. However, we cannot precisely tell the differences between the two groups because each attribute covers a wide range, and the ranges of different attributes are of different sizes.

3.1.3 Interval Data

Interval data refers not only to the classification and ordering of data but also specifies that the distances between every two adjacent attributes on a scale are equal. For interval data, the difference between two values is meaningful, but there is no true zero point or fixed beginning, which refers to a scale where 0 indicates the absence of something.

The most popular example is the temperature in degrees Fahrenheit or Celsius. The difference between 90 degrees F and 80 degrees F is the same difference between 70 degrees F and 60 degrees F; however, when it comes to 0 degrees F, we cannot say that temperature no longer exists. For interval measures, the value 0 does not mean the absence of this variable; rather, it is just an arbitrary setting that we have agreed upon.

Another example of interval data is year. For year, 0 does not represent the missing value but indicates the year when Christ was born, and the time before 0 is given the prefix BC. If you were born in the year of 2005 and I was born in the year of 2000, we know that I will be five years older than you.

For the dataset of Google Analytics from Google Merchandise Store (Table 2.1), the only interval variable is Year, indicating the year of each observation.

3.1.4 Ratio Data

For ratio data, the difference between two values is also meaningful, and it has a true zero, indicating the absence of such variable. Media consumption is a great example of ratio data. Media consumption can be measured in days, hours, or minutes a person exposes him/herself to media content. If a person has zero media consumption, that means this person does not consume media at all; in other words, to this person, media consumption is absent.

For ratio data, zero is the starting point, and the distance between two adjacent values is the same. In addition, it makes sense to calculate the ratio of different values: If you have six hours of media consumption and I only have three hours, then we can say that you consume twice as much media as I do.

Table 2.1 has six ratio variables. Sessions are the total number of sessions within the date range. A session is the period time a user is actively engaged with the website of Google Merchandise Store. Users are the number of users who have initiated at least one session during the date range, with the value of zero representing no such user exists. New Users are the number of first-time users during the selected date range. Transactions are the number of completed purchases on Google Merchandise Store. Ecommerce Conversion Rate refers to the E-commerce conversion rate that is the percentage of sessions that resulted in an e-commerce transaction. Revenue refers to the revenue of the completed purchases. If the revenue from the display channel is 0, that just means there is no income produced by the display Ads.

What you can do to analyze your data may be limited due to the level of measurement of the variable. The complexity and precision of the level of measurement increase from nominal to ratio. For the ratio level of measurement, we can not only add numbers or subtract numbers but calculate the ratio of two numbers. For example, if the revenue from the organic search channel is 8k and from the paid search is 4k, then the revenue generated from organic search is twice as much as the revenue from paid search.

The rows in a spreadsheet are closely associated with the unit of analysis, which is one of the most important ideas in a research project and the major entity that you are analyzing in your study. You will need to determine the unit of analysis since it determines the analysis you do for your research.

For instance, if you are comparing the e-commerce performances (e.g., User, Sessions, Transactions, Revenues, ECR) of different channels, the unit of analysis is each individual channel. If you are comparing the e-commerce performance of different age groups, the unit of analysis is each age group. If you are comparing the e-commerce performance of different channels and age groups, the unit of analysis is each individual channel at each age group.

After defining your research questions and the related unit(s) of analysis, and levels of measurement, we may start the data analysis journey by conducting a statistical analysis of a single variable.

3.2 Frequency Distribution

The easiest thing we can do is to count the number of times an event or a value occurs for the variable(s) of our choice, and this is called Frequency analysis. We can aggregate individual observations of a variable into groups so that a frequency distribution of these groups can serve as a convenient means of summarizing or describing our data.

We are going to use the tidyverse package to count the aggregated transactions by channel and age groups.

#Frequency and grouped frequency analysis
#You will need to install dplyr package
library(tidyverse)

Transaction_Frequency_Channel <- my_data %>% 
  group_by(Channel) %>% 
  summarise(Frequency = sum(Transactions))

Transaction_Frequency_Age <- my_data %>% 
  group_by(Age) %>% 
  summarise(Frequency = sum(Transactions))

Table 3.2 demonstrated the grouped frequency of transactions for each channel and age group. From the frequency analysis, we can tell that the direct channel (n = 11581) and the age group of 25 to 34 (n = 4654) had the greatest number of transactions in Google Merchandise Store.

knitr::kable(
  list(
    head(arrange(Transaction_Frequency_Channel, desc(Frequency))), 
    head(arrange(Transaction_Frequency_Age, desc(Frequency)))), 
  booktabs = TRUE,
  caption = 'Grouped Transaction Frequency for Channel and Age'
) %>%
  kable_styling(font_size = 14)
Table 3.2: Grouped Transaction Frequency for Channel and Age
Channel Frequency
Direct 11581
Paid Search 330
(Other) 138
Organic Search 98
Referral 32
Display 6
Age Frequency
25-34 4654
18-24 2558
35-44 2496
45-54 1394
55-64 673
65+ 414

3.3 Percentage Change

Here, we can also use the grouped frequency analysis to compare the revenue from each channel between 2020 and 2021 to assess the dynamic changes of each channel.

#Creat a subset for year 2020 and year 2021

my_data_2020 <- my_data %>% filter(Year == 2020)
my_data_2021 <- my_data %>% filter(Year == 2021)

Revenue_Frequency_2020 <- my_data_2020 %>% 
  group_by(Channel) %>% 
  summarise(Revenue = sum(Revenue))

Revenue_Frequency_2021 <- my_data_2021 %>% 
  group_by(Channel) %>% 
  summarise(Revenue = sum(Revenue))

As illustrated in Table 3.3, the direct channel created the most revenues in both 2020 and 2021. When you have such data for two points in time, you can calculate how much change there has been over this period. The result is expressed as a percentage in absolute numbers and is called the percentage change or the rate of change. Here is the formula: Percentage Change = [(Number at later time / Number at earlier time) - 1] × 100%.

knitr::kable(
  list(
    head(arrange(Revenue_Frequency_2020, desc(Revenue))), 
    head(arrange(Revenue_Frequency_2021, desc(Revenue)))), 
  booktabs = TRUE,
  caption = 'Grouped Revenue in 2020 and 2021'
) %>%
  kable_styling(font_size = 14)
Table 3.3: Grouped Revenue in 2020 and 2021
Channel Revenue
Direct 23565.7
Organic Search 4628.4
(Other) 3154.8
Paid Search 1975.9
Referral 864.7
Social 164.8
Channel Revenue
Direct 921363.0
Paid Search 29908.2
(Other) 5193.1
Referral 852.1
Display 354.3
Organic Search 188.6

For example, from Table 3.3, in 2020, the revenue generated from the paid search channel was $1975.93. And in 2020, the revenue generated from the paid search channel was 29908.18. The percentage change of revenue from the paid search channel from 2020 to 2021 is then calculated as follows: [(29908.18/ 1975.93)-1] x 100%= 1413.62%, meaning that from the year of 2020 to the year of 2021, the revenue generated from the paid search channel increased by 1413.62%.

3.4 Relative Frequency

For the frequency analysis, we also can count how often something happens divided by all outcomes, which is called the relative frequency analysis. For example, from Table 3.3, the revenue from the organic search channel in 2020 is $4,628.44, and the total revenue in 2020 is the sum of revenue from all channels, which is $34422.26.The relative revenue from the organic search channel is 4,628.44/34422.26 = 13.44%, indicating 13.44% of the revenue in 2020 was from the organic search channel. We can use the codes below to get the relative revenue for each channel in 2020 and 2021 (Table 3.4).

Revenue_Relative_Frequency_2020 <- my_data_2020 %>% 
  group_by(Channel) %>% 
  summarise('Revenue_Percentage' = sum(Revenue)/sum(my_data_2020$Revenue))

Revenue_Relative_Frequency_2021 <- my_data_2021 %>% 
  group_by(Channel) %>% 
  summarise('Revenue_Percentage' = sum(Revenue)/sum(my_data_2021$Revenue))
knitr::kable(
  list(
    head(arrange(Revenue_Relative_Frequency_2020, desc(Revenue_Percentage))), 
    head(arrange(Revenue_Relative_Frequency_2021, desc(Revenue_Percentage)))), 
  booktabs = TRUE,
  caption = 'Grouped Revenue in 2020 and 2021'
) %>%
  kable_styling(font_size = 14)
Table 3.4: Grouped Revenue in 2020 and 2021
Channel Revenue_Percentage
Direct 0.6846
Organic Search 0.1345
(Other) 0.0917
Paid Search 0.0574
Referral 0.0251
Social 0.0048
Channel Revenue_Percentage
Direct 0.9619
Paid Search 0.0312
(Other) 0.0054
Referral 0.0009
Display 0.0004
Organic Search 0.0002

3.5 Percentage Points

We can use the term “percentage points” to compare two different percentages at two points in time. The best way to explain this is through an example: from Table 3.4, in 2020, the share of revenue from the Paid Search channel was 5.74%. In 2021, the share of revenue from the Paid Search channel was 3.12%. The difference of these two percentages is calculated in percentage points: 3.12 % - 5.74 % = -2.62%. This means that the share of revenue from Paid Search channel in 2021 was almost three percentage points lower than the share of revenue from the Paid Search channel in 2020.

3.6 Rate

In data analysis, some variables, such as the ECR (e-commerce conversion rate from Table 2.1), are too small to visualize. Under such circumstances, we can represent the percentage as rate by converting the small proportions to a whole number with one, two, or three digits on a base of some multiple of 1,000, 10,000, 100,000, or 1 million.

For example, from Table 3.1, in 2020, the Ecommerce Conversion Rate of the organic search channel for users of 25 to 34 is 0.00068 or 0.068%. Instead of reporting the percentage, we can move the decimal place farther to the right, sufficient to make a number greater than 1, which allows us to better visualize the data. In the case of Ecommerce Conversion Rate, moving the decimal point four places to the right is the same as multiplying the value by 10,000. Consequently, the Ecommerce Conversion Rate of the organic search channel was 6.8 per 10,000 sessions.

We can use the r code below to add a column representing the Ecommerce Conversion Rate per 10,000 sessions for my_data.

my_data$ECR <- my_data$`Ecommerce Conversion Rate`*10000

Now, the new column has been added to my_data as illustrated in Table 3.5.

knitr::kable(head(my_data[,1:13], 20), digits = 5, booktabs = TRUE,
  caption = 'The First 20 Rows of the Google Analytics Sample Data'
) %>%
  kable_styling(font_size = 6.5)
Table 3.5: The First 20 Rows of the Google Analytics Sample Data
Channel Age Users New Users Sessions Bounce Rate Pages / Session Avg. Session Duration Ecommerce Conversion Rate Transactions Revenue Year ECR
Direct 25-34 81822 76251 118241 0.4219 5.825 227.7 0.03649 4315 332772 2021 364.933
Direct 18-24 65506 61565 88898 0.4718 4.982 199.8 0.02642 2349 198848 2021 264.235
Direct 35-44 41947 38535 59313 0.4219 5.915 229.8 0.03842 2279 189793 2021 384.233
Direct 45-54 26800 24967 37477 0.4466 5.477 226.9 0.03421 1282 109532 2021 342.076
Direct 55-64 12981 12217 17703 0.4644 5.269 213.1 0.03474 615 54809 2021 347.399
Direct 65+ 9432 8849 12548 0.4679 5.239 210.6 0.02981 374 35610 2021 298.055
Paid Search 18-24 2015 1766 2797 0.5570 4.715 175.1 0.02252 63 8424 2021 225.241
Paid Search 35-44 1394 1216 1980 0.4753 5.209 157.3 0.03535 70 8175 2021 353.535
Paid Search 25-34 2175 1829 3198 0.4381 5.881 195.8 0.03502 112 7577 2021 350.219
Direct 25-34 30772 29628 40539 0.3780 5.784 233.6 0.00306 124 7240 2020 30.588
Direct 18-24 17545 17024 21888 0.4271 5.078 210.2 0.00315 69 5325 2020 31.524
Direct 45-54 8491 8126 11115 0.3857 5.731 230.9 0.00495 55 4174 2020 49.483
Paid Search 45-54 1083 960 1387 0.5667 4.397 138.7 0.02307 32 4086 2021 230.714
Direct 35-44 14088 13560 18461 0.3825 5.536 226.4 0.00444 82 3994 2020 44.418
Organic Search 25-34 35978 32853 47132 0.5139 3.971 164.3 0.00068 32 1738 2020 6.789
(Other) 35-44 284 174 523 0.2447 7.625 278.3 0.03824 20 1657 2021 382.409
Direct 55-64 4466 4287 5862 0.4024 5.466 232.1 0.00409 24 1530 2020 40.942
(Other) 25-34 3440 2370 4741 0.3153 5.907 232.5 0.00337 16 1472 2020 33.748
(Other) 18-24 500 302 849 0.3498 5.616 244.3 0.03651 31 1395 2021 365.135
Paid Search 55-64 804 748 1009 0.6254 3.591 111.1 0.01388 14 1316 2021 138.751

3.7 Ratio

Ratio compares the size of two numbers indicating how many times one number contains another. For example, let’s calculate the ratio of returning users and new users using the formula: Number of Old Users / Number of New Users. We first use the r code below to create a new column old user (users - new users), then we use the group_by function to calculate the ratio fo returning users and new users for each channel.

#creating a new colunme of old users
my_data$OldUsers <- my_data$Users-my_data$'New Users'

userratio <- my_data %>% 
  group_by(Channel) %>% 
  summarise(OldUsers = sum(OldUsers), NewUsers = sum(`New Users`),Ratio = sum(OldUsers)/sum(`New Users`))

From Table 3.6, we can find that besides the un-defined channel, the referral channel has a relatively higher ratio of returning (old) users to new users, indicating a better retention rate.

knitr::kable(
    head(arrange(userratio, desc(Ratio))), 
  booktabs = TRUE,
  caption = 'The Ratio of Old Users and New Users for Each Channel'
) %>%
  kable_styling(font_size = 14)
Table 3.6: The Ratio of Old Users and New Users for Each Channel
Channel OldUsers NewUsers Ratio
(Other) 3383 7268 0.4655
Referral 8101 23027 0.3518
Paid Search 2116 11593 0.1825
Affiliates 1002 8071 0.1241
Organic Search 8428 81809 0.1030
Display 524 7624 0.0687

3.8 Central Tendency

Besides using frequency analysis to summarize the distribution of a variable, we also can use the central tendency analysis to understand which attribute of a variable is most typical or representative of the variable distribution. Specifically, the central tendency analysis describes what the center of the data for a variable is. What do we mean by the center of the data? For many, the mean, median, and mode. Mean is the sum of all data points divided by the total number of observations. Median is the midpoint in data with an equal number of data points above and below. If the number of data points is even, then Median is the mean of the two midpoints. Mode is the data point that is most prevalent in the data set. It represents the most likely outcome in a data set.

Let’s first create a small dataset of test scores from 20 students.

StudentID <- c(1:20)
Score <- c(40,50,50,60,60,60,60,60,70,70,70,80,80,80,80,90,90,90,90,100)
TestData <- data.frame(StudentID, Score)
ScoreFrequency <- as.data.frame(table(TestData$Score)) # Calculate the frequency of each score
colnames(ScoreFrequency) <- c("Score","Frequency")

Table 3.7 demonstrates the frequency of each score.

knitr::kable(
    head(arrange(ScoreFrequency)), 
  booktabs = TRUE,
  caption = 'Frequency of Test Scores from 20 Students', align = "l"
) %>%
  kable_styling(font_size = 16)
Table 3.7: Frequency of Test Scores from 20 Students
Score Frequency
40 1
50 2
60 5
70 3
80 4
90 4

The mean score is the sum of all scores divided by 20. The related formula is: (40x1 + 50x2 + 60x5 + 70x3 + 80x4 +90x4 + 100x1)/20 = 71.5, meaning that the average score among these 20 students is 71.5.

When it comes to identifying the median value, we want to sort all individuals from the lowest value to the highest value, and whoever in the middle gives us the median value. But the problem is, we have 20 students in this sample, which is an even number, meaning that there will be two individuals in the middle (Table 3.8: student number 10 and number 11). For situations like this, we take the mean of these two individuals’ values to compute the median: (70+70)/2 = 70 (Table 2.8).

knitr::kable(head(TestData, 20), booktabs = TRUE,
  caption = 'A Small Dataset of Test Scores from 20 Students', align = "l"
) %>%
  kable_styling(font_size = 16)
Table 3.8: A Small Dataset of Test Scores from 20 Students
StudentID Score
1 40
2 50
3 50
4 60
5 60
6 60
7 60
8 60
9 70
10 70
11 70
12 80
13 80
14 80
15 80
16 90
17 90
18 90
19 90
20 100

We can easily identify the most frequent test score by looking into the score distribution visualization.

# You will need install ggplot2 package
library(ggplot2)

# Barplot
ggplot(ScoreFrequency, aes(x=Score, y=Frequency)) + 
  geom_bar(stat = "identity")
Distribution of Test Scores from 20 Students

Figure 3.1: Distribution of Test Scores from 20 Students

Since 60 is the most frequent data point, it is the mode of the dataset. When we do the central tendency analysis, we also need to consider the data types. While interval and ratio data have a median, mode, and mean, the nominal data has only a mode. Furthermore, when the data distribution is symmetrical, such as your data is normally distributed, the mean is the best measure of central tendency. But, when the dataset has a strongly skewed distribution, the best indicator of central tendency is the median or mode, with the median usually preferred.

Now we can use the code below to get the mean, median, and mode for the number of Users and New Users, number of Sessions, E-commerce Conversion Rates, Transactions, and Revenues for the Google analytics data of the Google Merchandise Store. But note that when a vector doesn’t contain a repeating number (no mode), the code will return the first value in that vector. For the table below, there is no mode for sessions, and the code automatically returned the first value 118241 in the report.

#Calculate mean for each col
mean_report <- my_data %>% 
  summarise_if(is.numeric, ~mean)

#Calculate median for each col
median_report <- my_data %>% 
  summarise_if(is.numeric, median)

#Create the mode function
getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

#Calculate mode for each col
mode_report <- my_data %>% 
  summarise_if(is.numeric, getmode)


#combine the reports
centerreport <- as.data.frame(rbind(mean_report, median_report, mode_report))
rownames(centerreport) <- c("Mean","Median","Mode")
centerreport <- centerreport[, -10]
knitr::kable(
    arrange(centerreport), 
    digits = 0,
    booktabs = TRUE,
    caption = 'Descriptive Statistics-mean,median,mode', align = "l"
) %>%
  kable_styling(font_size = 9)
Table 3.9: Descriptive Statistics-mean,median,mode
Users New Users Sessions Bounce Rate Pages / Session Avg. Session Duration Ecommerce Conversion Rate Transactions Revenue ECR OldUsers
Mean 2 2 2 2 2 2 2 2 2 2 2
Median 940 768 1162 0 5 194 0 3 131 22 116
Mode 22 0 118241 0 6 228 0 0 0 0 3412

It is interesting to find the huge discrepancies between different central tendency measurements. This also indicates reporting results of different central tendency analyses can provide a more comprehensive understanding of the data distribution. For example, the mean value of Users (5661) is much higher than the median (940) and mode (22), indicating a skewed data distribution due to the existence of some large-value outliers.

Weighted Mean is a special mean where some data points contribute more than others. To calculate the weighted mean, we need to consider the weight for each data point. For example, we can calculate the weighted mean of the E-commerce Conversion Rate by taking the number of users as weight. The more users each data point has, the larger weight for its E-commerce Conversion Rate (ECR). In R, we can use the function of weighted.mean as below to get the weighted mean of the ECR of the Google Store:

#Calculated the weighted mean for ECR
weighted.mean(my_data$`Ecommerce Conversion Rate`, my_data$Users)
## [1] 0.01784

When taking the number of users into consideration, the weighted mean of the E-commerce Conversation Rate is almost 18 per 10,000 sessions, which is a bit higher than the normal mean, 11 per 10,000 sessions (Table above). This also indicates that the data points with high E-commerce Conversion Rates had more users than those with low rates.

3.9 Data Variation

If two data sets have the same center points, does that mean they have the same data distribution? Let’s look at two small datasets:

Score1 <- data.frame(c(110, 90,70,50,30))
Score2 <- data.frame(c(75, 70,70,70,65))
colnames(Score1) <- c("Score1")
colnames(Score2) <- c("Score2")
knitr::kable(
  list(
    head(arrange(Score1, desc(Score1))), 
    head(arrange(Score2, desc(Score2)))), 
  booktabs = TRUE,
  caption = 'Two Small Data Sets with the Same Mean and Median'
) %>%
  kable_styling(font_size = 14)
Table 3.10: Two Small Data Sets with the Same Mean and Median
Score1
110
90
70
50
30
Score2
75
70
70
70
65

Although they have the same mean (70) and median (70), the detailed data distributions vary significantly in a way that indicates the data in data set 1 were more spread out. This indicates that the mere central tendency measures cannot 100% accurately represent distribution. While the central tendency is the key to data description and prediction, variation is used to evaluate the quality of central tendency measures, and it plays an essential role in explaining statistical contribution.

Variation Ratio reflects the proportion of cases that are not in the mode category. For example, if the mode for Session is 3787, but this mode contains only two cases. Since the number of observations is 96, The variation ratio for Session is: (96-2)/96 = .98 or 98%, indicating the mode of 3787 is not a very representative indicator of sessions. To the contrary, for the variable of Transaction, the mode was 0 with 53 data points taking this value. The variation ratio of transaction is: (96-53)/96 = .45 or 45%, indicating “0” demonstrated relatively strong representativeness of transaction. This variation ratio also reflects the need for improvements of conversions for a variety of channels for the Google Merchandise Store.

Range is the difference between the highest and lowest scores of a distribution. In our dataset, the range in class sizes is 237-17=220.

Inter-ranges drops off a specified highest and lowest percentage of data points and then calculates the range of the remaining data points. The most popular inter-ranges are inter-decile range, inter-quintile range, and inter-quartile range, which drop off the highest and the lowest 10%, 20%, and 25% of the data points, respectively, and reflect the range of the remaining data points.

Table 3.11 lists the range and Inter-quartile range (IQR) for the number of users and new users, number of sessions, e-commerce conversion rate, transaction, and revenue for the Google analytics dataset.

#Calculate range for each col
range_report <- my_data %>% 
  summarise_if(is.numeric, range)

#Calculate Inter quartile range (IQR) for each col
range_report_interquartile <- my_data %>% 
  summarise_if(is.numeric, IQR)

#aggreate center report with range and IQR report
stats_report <- bind_rows(centerreport,range_report, range_report_interquartile)
rownames(stats_report) <- c("mean","median","mode","min","max","IQR")
stats_report <- stats_report[, -12]
knitr::kable(
    arrange(stats_report), 
    digits = 0,
    booktabs = TRUE,
    caption = 'Descriptive Statistics', align = "l"
) %>%
  kable_styling(font_size = 9)
Table 3.11: Descriptive Statistics
Users New Users Sessions Bounce Rate Pages / Session Avg. Session Duration Ecommerce Conversion Rate Transactions Revenue ECR OldUsers
mean 2 2 2 2 2 2 2 2 2 2 2
median 940 768 1162 0 5 194 0 3 131 22 116
mode 22 0 118241 0 6 228 0 0 0 0 3412
min 22 0 31 0 2 13 0 0 0 0 2
max 81822 76251 118241 1 13 583 0 4315 332772 1250 5571
IQR 2729 2262 3787 0 2 130 0 18 1149 150 364

IQR is important for finding the outlier of an abnormal distance from other values in a data set. We use the quartile to divide observations into four intervals with an equal number of data points: the first quartile includes the lowest 25% of data points; the second quartile covers data points between 25.1% and 50%, which corresponds to the median; the third quartile includes data points above the median between 50.1% and 75%; the fourth quartile has the highest 25% of numbers.

All data points that are above q0.75+1.5×IQR or below q0.25-1.5×IQR are considered as potential outliers. Here, q0.25 and q0.75 refers to the first and third quartile, respectively.

For data from digital and social media, outliers represent important opportunities for improving audience conversion and engagement.Let take a look at how to find outlier using code.

# Histogram: a simple way to detect outliers is to draw a histogram of the data.

#using ggplot2(https://www.colorbook.io/hexcolors/view/0C4C8A)

ggplot(my_data) +
  aes(x = Transactions) +
  geom_histogram(binwidth = 2000, fill = "#8A1537") +
  theme_minimal()

#TRY MORE THEME:https://ggplot2.tidyverse.org/reference/ggtheme.html
#Boxplot: in addition to histograms, boxplots are also useful to detect potential outliers.

boxplot(my_data$Transactions,
          ylab = "Transactions")

#Define Outlier
find_outlier <- function(x) {
  return(x < quantile(x, .25) - 1.5*IQR(x) | x > quantile(x, .75) + 1.5*IQR(x))
}

#add new column to data frame that indicates if each observation is an outlier
my_data_outlier <- my_data %>%
  filter(!Channel %in% c('Direct', '(Other)')) %>% 
  group_by(Age) %>%
  mutate(outlier = ifelse(find_outlier(Transactions), Transactions, NA))

#create box plot of points by age and label outliers
ggplot(my_data_outlier, aes(x=Age, y=Transactions)) +
  geom_boxplot() +
  geom_text(aes(label=outlier), na.rm=TRUE, hjust=-.5)

#Try to use a different group - Channel
my_data_outlier <- my_data %>%
  filter(!Channel %in% c('Direct', '(Other)')) %>% 
  group_by(Channel) %>%
  mutate(outlier = ifelse(find_outlier(Transactions), Transactions, NA))

#create box plot of points by Channel and label outliers
ggplot(my_data_outlier, aes(x=Channel, y=Transactions)) +
  geom_boxplot() +
  geom_text(aes(label=outlier), na.rm=TRUE, hjust=-.5)

For the above figures, data points related to the channels of “direct” and “other” were removed from the dataset since they might be classified due to the failure of channel detection in Google Analytics. For the paid search channel and referral channel, the audience in the age group of 25 to 34 had much more transactions compared to other age groups. This indicates the importance of using paid search and referral activities to drive more audiences in the age group of 25 to 34 to visit and shop at the store.

Standard Deviation Another term for analyzing data variance is Standard Deviation. It is a measure of the amount of variation in data, or simply put, it measures the extent to which your data is spread-out. It reflects the average squared distance from the mean.

Suppose we have two variables: the first one contains five units whose values are 2, 3, 3, 3, 4; and the second one contains five units as well, with the values being 1, 2, 3, 4, 5. Although these two variables have the same mean value (Mean = 3), you can clearly see that the latter one is more spread-out than the first one, indicating a bigger standard deviation.

Below illustrates the calculations of standard deviation for the Google analytics’ transactions data.

Calculations of Standard Deviation

The mean of transactions is 7. We first calculated the distance of each observation’s transactions from the mean, then took the squared distance because some of the distances are positive values and others are negative. Without taking the square of these distances first, if we added all distances together, they might cancel each other out, failing to indicate the real variation of the data set. Once we get the sum of all the squared distances, we divide the sum by the number of observations (n= 96) minus 1, and then take the square root of this value to bring the distance back to the original standard scale, which gives us the standard deviation. Usually, a low standard deviation indicates that most of the data points are close to the average, while a high standard deviation means that the data points are spread out.

The codes for calculating standard deviation is easy.

#Calculate sd for each col
sd_report <- my_data %>% 
  summarise_if(is.numeric, ~sd)

#aggregate sd to the stats_report
Descriptive_Statistics <- bind_rows(stats_report, sd_report)
rownames(Descriptive_Statistics) <- c("mean","median","mode","min","max","IQR", "sd")
Descriptive_Statistics <- Descriptive_Statistics[,-12]
knitr::kable(
    arrange(Descriptive_Statistics), 
    digits = 0,
    booktabs = TRUE,
    caption = 'Descriptive Statistics', align = "l"
) %>%
  kable_styling(font_size = 9)
Table 3.12: Descriptive Statistics
Users New Users Sessions Bounce Rate Pages / Session Avg. Session Duration Ecommerce Conversion Rate Transactions Revenue ECR OldUsers
mean 2 2 2 2 2 2 2 2 2 2 2
median 940 768 1162 0 5 194 0 3 131 22 116
mode 22 0 118241 0 6 228 0 0 0 0 3412
min 22 0 31 0 2 13 0 0 0 0 2
max 81822 76251 118241 1 13 583 0 4315 332772 1250 5571
IQR 2729 2262 3787 0 2 130 0 18 1149 150 364
sd 4 4 4 4 4 4 4 4 4 4 4

The high standard deviations of all variables indicate high variations between values and abnormal distributions for data.

3.10 Assignment: Analyzing a Single Variable

Learning Objectives: Apply the grouped frequency, relative frequency, percentage change, percentage point, rate, ratio, mean, median, mode, weighted mean, variation ratio, range, outlier, and standard deviation analysis to the process of analyzing of a single variable

Step 1 Download the google analytic data from UA-Google Merchandise Store (Acquisition – All Traffic – Channels). Add the secondary dimension of your choice (e.g., Age). Select the data for two “whole years” (e.g., 2010 & 2011). Download each year’s full data set from Google analytics as a .xlsx.

Step 2 Open the two .xlsx data sets and clean the data sets by removing the extra sheets.

Step 3 Import the two data sets in RStudio

Step 4 Using the codes of case study 2 – import data and data cleaning to import, clean, re-organize your data, then, develop your own codes for analyzing a single variable (or several variables) of your choice using the statistics of

a. Grouped frequency b. Relative frequency c. Percentage Change d. Percentage Points e. Rate f. Ratio g. Mean h. Median i. Mode j. Weighted Mean k. Variation Ratio l. Range m. Outlier n. Standard Deviation

Step 5 Discuss with your group members to interpret the results. (Storytelling)

Step 6 Orally present your research in class. Your grade will be evaluated by three factors: a) Cover the analysis of all metrics, b) present meaningful insights from the results of data analysis.