In this report, we aim to model the attendance in order for the management to act accordingly to maximize the attendances and profits. We are presented with a dataset containing information about a single season of Dodger home games. There are four promotions that the management can run: caps, shirts, fireworks and bobble heads. We explore how these promotional items influence the attendance. We construct a model using all the available information but fine tune the model to find that the attendance can be significantly explained by using a simple linear regression model where we only need the month, day of the week, bobble head and fireworks information. Our results show that we can expect an increase of 17,028 people for games where there are fireworks and an increase of 10,995 people when bobble heads are sold. To further fine tune this model, we use a polynomial regression model where attendance can be modeled by month, day of week, fireworks, bobble heads, and the temperature. Our conclusion from this data is that to increase the attendance, we should have fireworks and bobble heads. However, due to the imbalance of the shirts and caps information in our data, we can not decisively conclude their affect on the attendance.
In this part, we will explore out dataset. This includes identifying the type of data, i.e. numerical or categorical, the number of features, the distribution of these features. This will provide us with more insight into how we should construct the models.
library(RSQLite) ## if package is not on the computer, then install it only once using Tools > Install packages...
con <- dbConnect(SQLite(), "../data/dodgers.sqlite")
tbl(con, "events") %>%
collect() %>%
mutate(day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")),
month = factor(month, levels = c("APR","MAY","JUN","JUL","AUG","SEP","OCT"))) %>%
mutate_if(is.character, factor) %>%
mutate(temp = round((temp- 32)*5/9)) -> events
head(events)
# A tibble: 6 × 12
month day attend day_of_week opponent temp skies day_night cap shirt
<fct> <dbl> <dbl> <fct> <fct> <dbl> <fct> <fct> <fct> <fct>
1 APR 10 56000 Tuesday Pirates 19 Clear Day NO NO
2 APR 11 29729 Wednesday Pirates 14 Cloudy Night NO NO
3 APR 12 28328 Thursday Pirates 14 Cloudy Night NO NO
4 APR 13 31601 Friday Padres 12 Cloudy Night NO NO
5 APR 14 46549 Saturday Padres 14 Cloudy Night NO NO
6 APR 15 38359 Sunday Padres 18 Clear Day NO NO
# … with 2 more variables: fireworks <fct>, bobblehead <fct>
str(events)
tibble [81 × 12] (S3: tbl_df/tbl/data.frame)
$ month : Factor w/ 7 levels "APR","MAY","JUN",..: 1 1 1 1 1 1 1 1 1 1 ...
$ day : num [1:81] 10 11 12 13 14 15 23 24 25 27 ...
$ attend : num [1:81] 56000 29729 28328 31601 46549 ...
$ day_of_week: Factor w/ 7 levels "Monday","Tuesday",..: 2 3 4 5 6 7 1 2 3 5 ...
$ opponent : Factor w/ 17 levels "Angels","Astros",..: 13 13 13 11 11 11 3 3 3 10 ...
$ temp : num [1:81] 19 14 14 12 14 18 16 17 18 19 ...
$ skies : Factor w/ 2 levels "Clear","Cloudy": 1 2 2 2 2 1 2 2 2 1 ...
$ day_night : Factor w/ 2 levels "Day","Night": 1 2 2 2 2 1 2 2 2 2 ...
$ cap : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
$ shirt : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
$ fireworks : Factor w/ 2 levels "NO","YES": 1 1 1 2 1 1 1 1 1 2 ...
$ bobblehead : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
From the above summary of the events database, we see that we have a tibble of size 81x12, i.e. data about all the 81 games played by the Dodgers team. For each of these games, we have the following features (columns):
These are the features indicating the month that the game was played in, the day, the attendance, the weekday, the opponent, and the temperature.
barplot(summary(events$month), main = "Games per month", xlab = "month",ylab = "Games")
hist(events$day, breaks = dim(events%>% count(day))[1], ,main = "Games per day", xlab = "day", ylab = "Games")
hist(events$attend, breaks = 10, ,main = "Games per attendance", xlab = "attendance", ylab = "Games")
barplot(summary(events$day_of_week), main = "Games per day of week", xlab = "Day of week",ylab = "Games", las=2, cex.names=0.6)
barplot(summary(events$opponent), main = "Games per opponent", xlab = "Day of week",ylab = "Games", cex.names=0.6, las=2)
hist(events$temp, breaks = 10 ,main = "Games vs. Temperature", xlab = "temperature", ylab = "Games")
# Reference used for histograms: https://stackoverflow.com/questions/38994579/label-the-x-axis-correct-in-a-histogram-in-r
# and bar plots: https://stackoverflow.com/questions/21639392/make-frequency-histogram-for-factor-variables
From the histograms of the month and the day of the week, we can see that there are less games played in October and less games played on Thursdays as compared to other months and days respectively. Furthermore, we can see that the team plays each opponent at least three times. The attendance and temperature data seems to follow a Gaussian distribution.
This feature indicates if the skies were clear or cloudy at the time of the game. Counting the frequency of each possible level, we find that there are 62 days when its clear and 19 days when it is cloudy.
summary(events$skies)
Clear Cloudy
62 19
This feature indicates the time of the day. We can see that there are 15 games played during the day and 66 games played at night.
summary(events$day_night)
Day Night
15 66
This feature indicates if there were caps sold during the game. We have 79 games without and 2 games with caps sold. Due to the imbalance of this feature, i.e. in we have only two instances when caps are sold, we cannot expect to draw meaningful conclusions.
summary(events$cap)
NO YES
79 2
This feature indicates if shirts were sold at the match. we see that there are 78 games without and 3 games with shirts sold. Again, as with the caps, we see that there is a big imbalance in this feature which means that it will not be able to provide us with sufficient information about its impact on the attendance.
summary(events$shirt)
NO YES
78 3
This feature indicates if there were fireworks at the match.
summary(events$fireworks)
NO YES
67 14
This feature indicates if there were bobbleheads sold at the match.
summary(events$bobblehead)
NO YES
70 11
From this analysis, we can see that other than caps and shirts which have a severe imbalance, the other features can be useful in modeling the attendance at the game since they have more samples for each option (not evenly spread which would be better but relative to caps and shirts, it is better).
Next, we preprocess the data (one hot encoding) and find the correlation coefficient between the features and the attendance. Correlation tells us how attendance changes with change in each feature. Correlation can take any value between -1 and 1 where negative values mean that attendance decreases with an increase in that feature while a positive value indicates that attendance increases with increase in that feature. High magnitude represents a strong correlation while values near zero indicate that attendance is not affected by that feature much.
events_table <- one_hot(as.data.table(events)) # reference used: https://datatricks.co.uk/one-hot-encoding-in-r-three-simple-methods
correlation <- cor(events_table[, get("attend")],events_table[,-"attend"])
barplot(correlation, cex.names=0.7, las=2, main = "Correlation of features with attendance",
ylab = "Correlation Coefficient")
We can see that the bobble head promotion is highly and positively correlated with the attendance while fireworks are uncorrelated, The two other promotions are also lightly positively correlated but as previously mentioned, the imbalance may cause this value to not conform to the reality.
In this section, we wish to further investigate the underlying relationships in our dataset. For this, we analyse the frequency of different features with respect to others. This includes the frequency of each promotion (shirts, caps, fireworks, and bobble head) with respect to the day and the month.
Table 1.1 and Figure 1.1 show us the number of games played each month and day of the week.
events %>%
count(day_of_week, month) %>%
pivot_wider(names_from = day_of_week, values_from = n) %>%
replace_na(0) %>%
pander(caption = "(\\#tab:monthweekday) Number of games played in each weekday and month")
month | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday |
---|---|---|---|---|---|---|---|
APR | 1 | 2 | 2 | 1 | 2 | 2 | 2 |
MAY | 3 | 3 | 2 | 1 | 3 | 3 | 3 |
JUN | 1 | 1 | 1 | 1 | 2 | 2 | 1 |
JUL | 3 | 3 | 2 | 0 | 1 | 1 | 2 |
AUG | 2 | 2 | 3 | 1 | 3 | 2 | 2 |
SEP | 1 | 1 | 1 | 1 | 2 | 3 | 3 |
OCT | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
events %>%
ggplot(aes(day_of_week)) +
geom_bar(aes(fill=month))
Figure 1.1: Barplot of counts of games for each weekday and month
Table 1.2 shows the number of games played at day/night and when the sky was cloudy/clear each month. Table 1.3 shows the number of games played at day/night and when the sky was cloudy/clear each day of the week. Looking at the games with this perspective, we see that most of the games are played at night and when the sky was clear.
month_dn <- events %>%
count(day_night, month) %>%
pivot_wider(names_from = day_night, values_from = n) %>%
replace_na(0)
month_skies <- events %>%
count(skies, month) %>%
pivot_wider(names_from = skies, values_from = n) %>%
replace_na(0)
cbind(month_dn, month_skies[,2:3])%>%
pander(caption = "(\\#tab:monthdnskies) Frequency of day_night and skies with respect to month.")
month | Day | Night | Clear | Cloudy |
---|---|---|---|---|
APR | 3 | 9 | 5 | 7 |
MAY | 2 | 16 | 16 | 2 |
JUN | 2 | 7 | 8 | 1 |
JUL | 2 | 10 | 8 | 4 |
AUG | 3 | 12 | 14 | 1 |
SEP | 3 | 9 | 9 | 3 |
OCT | 0 | 3 | 2 | 1 |
day_dn <- events %>%
count(day_night, day_of_week) %>%
pivot_wider(names_from = day_night, values_from = n) %>%
replace_na(0)
day_skies <- events %>%
count(skies, day_of_week) %>%
pivot_wider(names_from = skies, values_from = n) %>%
replace_na(0)
cbind(day_dn, day_skies[,2:3])%>%
pander(caption = "(\\#tab:daydnskies) Frequency of day_night and skies with respect to the day of the week.")
day_of_week | Day | Night | Clear | Cloudy |
---|---|---|---|---|
Tuesday | 1 | 12 | 10 | 2 |
Wednesday | 2 | 10 | 9 | 4 |
Saturday | 1 | 12 | 7 | 5 |
Sunday | 11 | 2 | 4 | 1 |
Monday | 0 | 12 | 10 | 3 |
Thursday | 0 | 5 | 9 | 4 |
Friday | 0 | 13 | 13 | 0 |
Figure 1.2 shows the attendance of the games with respect to the month and the day of the week using a heatmap where the red color shows higher attendance and the yellow (lighter) color shows less attendance.
# Taken from the tutorial in class.
xtabs(attend ~ month + day_of_week, events) %>%
heatmap()
Figure 1.2: Heatmap of attendance versus weekday and month.
Bobble heads, shirts, caps, and fireworks are the four promotional features that the management uses to boost attendance. Table 1.4 and 1.5 shows the frequency of all these different promotions on each day of the week and month respectively. As mentioned before, we see that the caps and shirts are sold twice and thrice respectively and are not very informative. The fireworks are mostly part of the games on Fridays but are relatively more evenly spread over the months. One more observation is that there is no promotion during the month of October.
table_bobblehead <- events %>%
count(day_of_week, bobblehead) %>%
pivot_wider(names_from = bobblehead, values_from = n) %>%
replace_na((YES = 0)) %>%
mutate(Total = YES + NO) %>%
select(-NO) %>%
rename(bobblehead = YES)
table_cap <- events %>%
count(day_of_week, cap) %>%
pivot_wider(names_from = cap, values_from = n) %>%
replace_na((YES = 0)) %>%
mutate(Total = YES + NO) %>%
select(-NO) %>%
rename(cap = YES)
table_shirt <- events %>%
count(day_of_week, shirt) %>%
pivot_wider(names_from = shirt, values_from = n) %>%
replace_na((YES = 0)) %>%
mutate(Total = YES + NO) %>%
select(-NO) %>%
rename(shirt = YES)
table_fireworks <- events %>%
count(day_of_week, fireworks) %>%
pivot_wider(names_from = fireworks, values_from = n) %>%
replace_na((YES = 0)) %>%
mutate(Total = YES + NO) %>%
select(-NO) %>%
rename(fireworks = YES)
cbind(table_bobblehead[1:2],table_cap[2],table_shirt[2],table_fireworks[2]) %>%
pander(caption = "(\\#tab:frequencyday) Frequency of promotions on each day of the week.")
day_of_week | bobblehead | cap | shirt | fireworks |
---|---|---|---|---|
Monday | 0 | 0 | 1 | 0 |
Tuesday | 6 | 1 | 1 | 0 |
Wednesday | 0 | 0 | 0 | 1 |
Thursday | 2 | 0 | 0 | 0 |
Friday | 0 | 0 | 0 | 13 |
Saturday | 2 | 0 | 0 | 0 |
Sunday | 1 | 1 | 1 | 0 |
table_bobblehead1 <- events %>%
count(month, bobblehead) %>%
pivot_wider(names_from = bobblehead, values_from = n) %>%
replace_na((YES = 0)) %>%
mutate(Total = YES + NO) %>%
select(-NO) %>%
rename(bobblehead = YES)
table_cap1 <- events %>%
count(month, cap) %>%
pivot_wider(names_from = cap, values_from = n) %>%
replace_na((YES = 0)) %>%
mutate(Total = YES + NO) %>%
select(-NO) %>%
rename(cap = YES)
table_shirt1 <- events %>%
count(month, shirt) %>%
pivot_wider(names_from = shirt, values_from = n) %>%
replace_na((YES = 0)) %>%
mutate(Total = YES + NO) %>%
select(-NO) %>%
rename(shirt = YES)
table_fireworks1 <- events %>%
count(month, fireworks) %>%
pivot_wider(names_from = fireworks, values_from = n) %>%
replace_na((YES = 0)) %>%
mutate(Total = YES + NO) %>%
select(-NO) %>%
rename(fireworks = YES)
cbind(table_bobblehead1[1:2],table_cap1[2],table_shirt1[2],table_fireworks1[2])%>%
pander(caption = "(\\#tab:frequencymonth) Frequency of promotions each month.")
month | bobblehead | cap | shirt | fireworks |
---|---|---|---|---|
APR | 1 | 0 | 1 | 2 |
MAY | 2 | 0 | 0 | 3 |
JUN | 2 | 0 | 1 | 2 |
JUL | 3 | 1 | 0 | 2 |
AUG | 3 | 1 | 0 | 3 |
SEP | 0 | 0 | 1 | 2 |
OCT | 0 | 0 | 0 | 0 |
Next, we plot the boxplots of some of the features with respect to the attendance. Boxplots show the median of the features, the lower and upper quartiles, the minimum, the maximum and any outliers. Statistically outliers are defined as an point lying outside the range: mean ± 1.5*IQR where IQR: Inter-Quartile Range = upper quartile - lower quartile.
We plot the features we plot for are: Month, Day of the Week, Day/Night, Skies, Bobble head, and fireworks.
# Adapted from the tutorial in class.
events %>%
ggplot(aes(month, attend)) +
geom_boxplot() +
labs(title = "Attendance vs. Month")
events %>%
ggplot(aes(day_of_week, attend)) +
geom_boxplot() +
labs(title = "Attendance vs. Day of Week")
events %>%
ggplot(aes(day_night, attend)) +
geom_boxplot()+
labs(title = "Attendance vs. Skies (Day/Night)")
events %>%
ggplot(aes(skies, attend)) +
geom_boxplot()+
labs(title = "Attendance vs. Clear/Cloudy")
events %>%
ggplot(aes(bobblehead, attend)) +
geom_boxplot()+
labs(title = "Attendance vs. Bobblehead")
events %>%
ggplot(aes(fireworks, attend)) +
geom_boxplot()+
labs(title = "Attendance vs. Fireworks")
According to the boxplot of Attendance vs. Skies (Day/Night), we see that the mean is the same for both the levels, which is also the same in the case of fireworks. This is consistent with the result of the correlation barplot where the correlation for fireworks is an extremely small value and similarly the correlation value for skies is small. Furthermore, we see that the correlation between the bobblehead and the attendance is strong which is seen by a higher boxplot for when bobbleheads were sold.
Hypothesis testing can be used to draw conclusions about the statistical relationships between features. Here, we perform the test to see if there is an effect on the attendance due to the time of the day. The null hypothesis states that the true means of the distributions is the same. We calculate the p-value and if that is less than 5%, we reject the null hypothesis, otherwise, we do not reject it. Rejecting the null hypothesis means that there is a statistical difference between the means, while not rejecting the null hypothesis means that there is no statistical difference between the means of the two distributions.
# Taken from class tutorial
t.test(x=events$attend[events$day_night=="Day"],
y=events$attend[events$day_night=="Night"])
Welch Two Sample t-test
data: events$attend[events$day_night == "Day"] and events$attend[events$day_night == "Night"]
t = 0.42851, df = 23.62, p-value = 0.6722
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3531.652 5380.397
sample estimates:
mean of x mean of y
41793.27 40868.89
When we do this T-test using the distributions of the time of the day, we see that the p-value is 0.6722. Since this is larger than 5%, we do not reject the null hypothesis which means that the true mean difference is equal to 0, i.e. there is no difference between games during the day or at night.
We do the same test using the skies feature to determine if there is a statistical difference between the attendance when the sky was clear or cloudy.
t.test(x=events$attend[events$skies=="Clear"],
y=events$attend[events$skies=="Cloudy"])
Welch Two Sample t-test
data: events$attend[events$skies == "Clear"] and events$attend[events$skies == "Cloudy"]
t = 1.2868, df = 27.664, p-value = 0.2088
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1741.315 7617.103
sample estimates:
mean of x mean of y
41729.21 38791.32
Again, as we see, the p-value is 0.2 which is greater than 5%. Therefore, we do not reject the null hypothesis and we can claim that there is no statistical difference between the mean of attendance on clear and cloudy games.
For the numerical features (day and temperature), we can create a scatter plot of the attendance and the features. Then using the ‘geom_smooth()’ function we can create a line showing the relationship between the two variables. Using a linear model, we can also find the approximate linear equation of the relationship.
# In the class, we performed the relationship between the attendance and the temperature as follows (i use 24 instead of 23 since it fits better):
events %>%
ggplot(aes(temp, attend)) +
geom_jitter() +
geom_smooth(se = FALSE) +
geom_smooth(se = FALSE, method = "lm",
formula = y ~ x + pmax(0, x - 24) , col = "red")
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
The figure above shows the temperature plotted with respect to the attendance. the smoothing line drawn by the ‘geom_smooth’ function is plotted in blue. Using a linear relationship of the following type:
\[ attend = \beta_0 + \beta_1 temp + \beta_2 (temp - 24)^+ + \varepsilon_i \]
we create the red line which seems to be very close to the actual relationship. In order to determine the coefficients of this line, we can use a linear regression model which gives us the following results:
lm(attend ~ temp + pmax(0, temp - 24), data = events) %>% summary()
Call:
lm(formula = attend ~ temp + pmax(0, temp - 24), data = events)
Residuals:
Min 1Q Median 3Q Max
-16488.8 -5799.8 -220.5 5079.2 16250.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18387.9 6645.9 2.767 0.007065 **
temp 1124.3 317.0 3.547 0.000664 ***
pmax(0, temp - 24) -2269.4 614.8 -3.691 0.000412 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7717 on 78 degrees of freedom
Multiple R-squared: 0.1567, Adjusted R-squared: 0.1351
F-statistic: 7.249 on 2 and 78 DF, p-value: 0.001295
From this result, we can determine the coefficients. The intercept is the attendance when temperature is 0. The coefficient of the temperature variable is positive which means that there is a positive correlation between the two. The other variable, i.e. pmax(0, temp - 24) has a negative coefficient meaning that there is a negative correlation. In terms of our case, when the temperature is zero, we have approximately 18388 people in the stadium. Until the temperature is 24 or less, we add 1124.3 people for every 1 degree increase in temperature. The maximum is when the temperature is 24 and then we decrease (1124.3-2269.4) people for each 1 degree increase in temperature.
Following this analysis, we do the same for the remaining numerical feature, i.e. day. We plot the attendance vs. day scatter plot and try to fit the data.
# Adapted from class
events %>%
ggplot(aes(day, attend)) +
geom_jitter() +
geom_smooth(se = FALSE) +
geom_smooth(se = FALSE, method = "lm",
formula = y ~ x , col = "red")
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
Looking at the blue line. we see that the relationship between the variables is approximately linear. Thus, we use a linear model as such:
\[ attend = \beta_0 + \beta_1 day + \varepsilon_i \]
The result of this is plotted in the red line. Next, using the linear regression model, we can find the parameter values:
lm(attend ~ day, data = events) %>% summary()
Call:
lm(formula = attend ~ day, data = events)
Residuals:
Min 1Q Median 3Q Max
-16678.1 -6216.2 -823.1 5831.9 15103.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40662.44 1821.41 22.325 <0.0000000000000002 ***
day 23.40 97.15 0.241 0.81
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8347 on 79 degrees of freedom
Multiple R-squared: 0.000734, Adjusted R-squared: -0.01191
F-statistic: 0.05803 on 1 and 79 DF, p-value: 0.8103
From the results, we can see that there is always an attendance of 40662 with an increase of 23.4 for each day of the month. From the coefficient and the line in the graph, we can conclude that the day has a very small effect on the number of people that come to the games.
In this section, we aim to create a model to predict the attendance. Our aim here is to not only create an accurate model but also keep it as simple as possible.
We start our analysis with the following model that takes into account all of the features available. Any predictor with a p-value that is smaller than 5% indicates that that particular feature is important and significant in modeling attendance. From the result below, we see that a model fitted on all predictors tells us that bobble head and the day of week (especially Tuesday) are important features to increase attendance. However this is a big model which includes unnecessary features. These features complicate the model while not providing much information about the attendance. Therefore, we start with a simpler model and gradually add features while checking their significance. We will compare the rest of the models compared to this one based on the residual standard error i.e. 5979 and the adjusted R-squared i.e. 0.4809. For a better model, we expect the residual standard error should decrease while the adjusted R-squared should increase (adjusted R-squared will be between 0 and 1).
full_mod <- lm(attend ~ month + day_of_week + bobblehead + cap + shirt + fireworks + day + temp + day_night + skies + opponent, data = events)
full_mod%>%summary()
Call:
lm(formula = attend ~ month + day_of_week + bobblehead + cap +
shirt + fireworks + day + temp + day_night + skies + opponent,
data = events)
Residuals:
Min 1Q Median 3Q Max
-9071.5 -3105.3 -83.3 1398.1 12467.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49144.43 14120.70 3.480 0.00114 **
monthMAY 4562.82 6161.08 0.741 0.46288
monthJUN -2780.38 12052.35 -0.231 0.81862
monthJUL 4725.36 6220.19 0.760 0.45150
monthAUG 8631.58 7717.22 1.118 0.26943
monthSEP 4296.83 7739.53 0.555 0.58158
monthOCT 6217.52 9532.41 0.652 0.51763
day_of_weekTuesday 8670.34 2943.86 2.945 0.00514 **
day_of_weekWednesday 6.94 2764.44 0.003 0.99801
day_of_weekThursday 460.46 3897.62 0.118 0.90650
day_of_weekFriday -18358.17 9210.84 -1.993 0.05247 .
day_of_weekSaturday 3674.93 3341.26 1.100 0.27737
day_of_weekSunday 892.73 4810.77 0.186 0.85364
bobbleheadYES 9395.63 3203.05 2.933 0.00531 **
capYES -6503.20 5866.45 -1.109 0.27365
shirtYES 949.27 4594.60 0.207 0.83727
fireworksYES 20200.13 8352.77 2.418 0.01980 *
day 134.00 141.29 0.948 0.34811
temp -43.20 436.72 -0.099 0.92166
day_nightNight -3662.30 3590.06 -1.020 0.31325
skiesCloudy -114.01 2385.81 -0.048 0.96210
opponentAstros -20678.17 14211.96 -1.455 0.15277
opponentBraves -18497.20 13741.48 -1.346 0.18517
opponentBrewers -22322.21 15155.70 -1.473 0.14791
opponentCardinals -12532.09 13146.23 -0.953 0.34565
opponentCubs -10529.08 12640.84 -0.833 0.40938
opponentGiants -16880.09 13263.72 -1.273 0.20983
opponentMarlins -19147.75 13897.97 -1.378 0.17525
opponentMets -4234.12 6444.47 -0.657 0.51459
opponentNationals -5850.86 13928.61 -0.420 0.67649
opponentPadres -11169.54 11310.93 -0.988 0.32880
opponentPhillies -13698.94 12586.97 -1.088 0.28237
opponentPirates -12451.05 13301.72 -0.936 0.35436
opponentReds -16473.41 11915.48 -1.383 0.17379
opponentRockies -16964.63 12946.55 -1.310 0.19687
opponentSnakes -19919.33 12719.45 -1.566 0.12450
opponentWhite Sox -928.32 5735.06 -0.162 0.87215
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5979 on 44 degrees of freedom
Multiple R-squared: 0.7145, Adjusted R-squared: 0.4809
F-statistic: 3.058 on 36 and 44 DF, p-value: 0.0002485
We can use this model to predict the attendance in order to calculate the root mean square error = 4406.356 The plot below shows the predicted (blue) and the actual (red) attendance. We can see that this model performs relatively well but as stated before, we wish to simplify it.
predictions_full <- predict(full_mod, data = events[,-"attend"])
error_full <- (sum((predictions_full-events$attend)**2)/81)**0.5
diff_full <- (predictions_full-events$attend)
plot(predictions_full, type = 'o', col = 'blue', main = 'Attendace for the full model', ylab = 'attendance')
lines(events$attend, type = 'o', col = 'red')
In order to find only certain features that significantly increase the performance, we use the anova() function and compare models while adding features. We start with bobble head and the day of the week since we already concluded that they are important features.
lm1 <- lm(attend ~ bobblehead + day_of_week, data=events)
lm2 <- update(lm1, . ~ . + month)
anova(lm1,lm2)
Analysis of Variance Table
Model 1: attend ~ bobblehead + day_of_week
Model 2: attend ~ bobblehead + day_of_week + month
Res.Df RSS Df Sum of Sq F Pr(>F)
1 73 3129721926
2 67 2509574563 6 620147363 2.7594 0.01858 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value for the second model is less than 5%, we conclude that the predictor month is significant and therefore we add this feature to our model Now we add the other features one by one and test with this model to determine if the addition of each feature is significant to the performance of our model. Next, we add the day feature:
lm3 <- update(lm2, . ~ . + day)
anova(lm2,lm3)
Analysis of Variance Table
Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + day
Res.Df RSS Df Sum of Sq F Pr(>F)
1 67 2509574563
2 66 2491793316 1 17781247 0.471 0.4949
We see that the p-value is greater than 5%. Referring to the linear model plot (scatter plot) earlier, we see that the coefficient for the day parameter is very small, i.e. 23. For this case, the change in day versus the change in attendance is not very significant. This conforms to our current findings. We decide not to include day predictor in the model.
Next, we try with the temperature feature:
lm3 <- update(lm2, . ~ . + temp)
anova(lm2,lm3)
Analysis of Variance Table
Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + temp
Res.Df RSS Df Sum of Sq F Pr(>F)
1 67 2509574563
2 66 2500949923 1 8624640 0.2276 0.6349
Again, we can observe the the p-value (0.6349) is more than 5% which means that we do not reject the null hypothesis and conclude that the temperature feature is not significant in the prediction of attendance. The next feature we test is skies
lm3 <- update(lm2, . ~ . + skies)
anova(lm2,lm3)
Analysis of Variance Table
Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + skies
Res.Df RSS Df Sum of Sq F Pr(>F)
1 67 2509574563
2 66 2430749943 1 78824620 2.1403 0.1482
We see that the skies predictor is also not significant as the p-value is less than 5%. Next we try the day_night predictor:
lm3 <- update(lm2, . ~ . + day_night)
anova(lm2,lm3)
Analysis of Variance Table
Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + day_night
Res.Df RSS Df Sum of Sq F Pr(>F)
1 67 2509574563
2 66 2508075740 1 1498824 0.0394 0.8432
We see that the day_night predictor is also not significant. Next we try the opponent predictor:
lm3 <- update(lm2, . ~ . + opponent)
anova(lm2,lm3)
Analysis of Variance Table
Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + opponent
Res.Df RSS Df Sum of Sq F Pr(>F)
1 67 2509574563
2 51 2017425853 16 492148710 0.7776 0.7022
We see that the opponent predictor is also not significant. Next we try the cap predictor:
lm3 <- update(lm2, . ~ . + cap)
anova(lm2,lm3)
Analysis of Variance Table
Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + cap
Res.Df RSS Df Sum of Sq F Pr(>F)
1 67 2509574563
2 66 2432756707 1 76817856 2.084 0.1536
We see that the cap predictor is also not significant. Next we try the shirt predictor:
lm3 <- update(lm2, . ~ . + shirt)
anova(lm2,lm3)
Analysis of Variance Table
Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + shirt
Res.Df RSS Df Sum of Sq F Pr(>F)
1 67 2509574563
2 66 2402756980 1 106817583 2.9341 0.09142 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the shirt predictor is also not significant. Next we try the fireworks predictor:
lm3 <- update(lm2, . ~ . + fireworks)
anova(lm2,lm3)
Analysis of Variance Table
Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + fireworks
Res.Df RSS Df Sum of Sq F Pr(>F)
1 67 2509574563
2 66 2265194779 1 244379784 7.1204 0.009581 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the p-value here is less than 5% and thus adding this to our model will significantly increase the attendance.
After testing all these predictors, we see that the optimum model with the least number predictors is:
\[ attend = \beta_0 + \beta_1 bobblehead+ \beta_2 day_of_week + \beta_3 month+ \beta_4 firework + \varepsilon_i \]
lm3%>%summary()
Call:
lm(formula = attend ~ bobblehead + day_of_week + month + fireworks,
data = events)
Residuals:
Min 1Q Median 3Q Max
-9504 -3683 -709 2569 15390
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34321.68 2418.90 14.189 < 0.0000000000000002 ***
bobbleheadYES 10995.05 2318.43 4.742 0.0000117 ***
day_of_weekTuesday 7750.46 2587.35 2.996 0.00386 **
day_of_weekWednesday 904.16 2476.14 0.365 0.71617
day_of_weekThursday 309.24 3341.63 0.093 0.92655
day_of_weekFriday -12386.23 6901.86 -1.795 0.07729 .
day_of_weekSaturday 6094.17 2445.16 2.492 0.01521 *
day_of_weekSunday 6577.96 2400.14 2.741 0.00788 **
monthMAY -2492.79 2193.60 -1.136 0.25990
monthJUN 7062.62 2616.13 2.700 0.00881 **
monthJUL 1315.38 2534.42 0.519 0.60549
monthAUG 2377.88 2300.15 1.034 0.30501
monthSEP -55.37 2413.63 -0.023 0.98177
monthOCT -502.88 3873.86 -0.130 0.89711
fireworksYES 17028.78 6381.63 2.668 0.00958 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5858 on 66 degrees of freedom
Multiple R-squared: 0.5887, Adjusted R-squared: 0.5015
F-statistic: 6.749 on 14 and 66 DF, p-value: 0.00000002848
We can observe here that the residual standard error has decreased from 5979 to 5858 whereas the adjusted R-squared has increased from 0.04809 to 0.5015.
Again, we can predict the attendance using this model and calculate the root mean square which comes out as 5288.229. we see that this is less than that of the full model (4406.356) but that is to be expected since we loose some information when we do not use all of the data. We plot the attendance predicted by the model (blue) and the actual attendance (red).
predictions <- predict(lm3, data = select(events, month, day_of_week, bobblehead,fireworks))
error <- (sum((predictions-events$attend)**2)/81)**0.5
diff <- (predictions-events$attend)
plot(events$attend, type = 'o', col = 'red', main = 'Attendance of smaller model', ylab = 'attendance')
lines(predictions, type = 'o', col = 'blue')
The figure below shows the predictions of the full model (Red) and the small model (blue).
plot(predictions, type = 'o', col = 'blue', main = 'Attendance predictions of the two models', ylab = 'attendance')
lines(predictions_full, type = 'o', col = 'red')
Between the two linear models (full vs small) we get the root mean square error = 2923.934.
error_both <- (sum((predictions-predictions_full)**2)/81)**0.5
The models explored above are linear in nature. In this section we explore models with polynomials for the ‘day’ and ‘temp’ features. In the first one is an addition to the small linear model determined before where we add a polynomial term as follows
\[ attend = \beta_0 + \beta_1 bobblehead+ \beta_2 day of week + \beta_3 month+ \beta_4 firework + \beta_5 temp + \beta_6 temp^2 + \varepsilon_i \]
Creating the model, we see that the Residual standard error is 5660 which is less than the linear model and the adjusted R-squared value is 0.5347. From these values we see that the model performs better than the linear model.
poly_lm <- lm(attend ~ month + day_of_week + poly(temp, 2) + bobblehead +fireworks , data = events)
poly_lm %>%summary()
Call:
lm(formula = attend ~ month + day_of_week + poly(temp, 2) + bobblehead +
fireworks, data = events)
Residuals:
Min 1Q Median 3Q Max
-10082.1 -3272.2 -218.6 2439.8 13783.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36976.3 2917.4 12.674 < 0.0000000000000002 ***
monthMAY -5002.8 2404.1 -2.081 0.04144 *
monthJUN 3918.3 2931.5 1.337 0.18607
monthJUL -2155.1 3022.5 -0.713 0.47844
monthAUG -1394.7 3260.2 -0.428 0.67024
monthSEP -2437.4 4039.6 -0.603 0.54839
monthOCT -3445.5 4978.5 -0.692 0.49139
day_of_weekTuesday 7360.7 2510.2 2.932 0.00466 **
day_of_weekWednesday 669.8 2395.0 0.280 0.78065
day_of_weekThursday 474.8 3231.1 0.147 0.88363
day_of_weekFriday -12216.2 6668.5 -1.832 0.07162 .
day_of_weekSaturday 6588.7 2370.3 2.780 0.00713 **
day_of_weekSunday 6268.3 2436.1 2.573 0.01241 *
poly(temp, 2)1 7623.8 11646.7 0.655 0.51508
poly(temp, 2)2 -17226.7 6923.6 -2.488 0.01546 *
bobbleheadYES 10279.8 2300.5 4.468 0.0000328 ***
fireworksYES 17343.7 6181.6 2.806 0.00664 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5660 on 64 degrees of freedom
Multiple R-squared: 0.6278, Adjusted R-squared: 0.5347
F-statistic: 6.746 on 16 and 64 DF, p-value: 0.00000001244
In order to check if we should include the parameter to the linear model, we perform an anova analysis.
anova(lm3,poly_lm)
Analysis of Variance Table
Model 1: attend ~ bobblehead + day_of_week + month + fireworks
Model 2: attend ~ month + day_of_week + poly(temp, 2) + bobblehead + fireworks
Res.Df RSS Df Sum of Sq F Pr(>F)
1 66 2265194779
2 64 2050242023 2 214952755 3.355 0.04115 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the anova result, we can see that the p-value is 0.04 which is less than 5% and thus we can conclude that adding this to the model is significant. Next, we add a polynomial term for the day feature and get the following model:
\[ attend = \beta_0 + \beta_1 bobblehead+ \beta_2 day of week + \beta_3 month+ \beta_4 firework + \beta_5 temp + \beta_6 temp^2 + \beta_6 day^7 + \varepsilon_i \].
poly_lm1 <- lm(attend ~ month + day_of_week + poly(temp, 2) + I(day^7) + bobblehead +fireworks , data = events)
poly_lm1 %>% summary()
Call:
lm(formula = attend ~ month + day_of_week + poly(temp, 2) + I(day^7) +
bobblehead + fireworks, data = events)
Residuals:
Min 1Q Median 3Q Max
-9813.1 -3046.2 -542.6 2428.4 13924.8
Coefficients:
Estimate Std. Error t value
(Intercept) 38550.24764072680 3042.44303917733 12.671
monthMAY -5447.95922479578 2390.56426556345 -2.279
monthJUN 3078.16973838830 2941.97366130247 1.046
monthJUL -3211.65072460583 3056.47100142234 -1.051
monthAUG -2935.09865908399 3358.89193369438 -0.874
monthSEP -4479.95258883456 4186.21143395065 -1.070
monthOCT -5872.16848184368 5142.50826507972 -1.142
day_of_weekTuesday 7204.96355455526 2481.29102709043 2.904
day_of_weekWednesday 547.41884727917 2366.80572272789 0.231
day_of_weekThursday 1620.39389202626 3269.56259734322 0.496
day_of_weekFriday -11404.80626336155 6605.89217725946 -1.726
day_of_weekSaturday 6824.26064247097 2345.72208702429 2.909
day_of_weekSunday 5938.59277091868 2414.85768358441 2.459
poly(temp, 2)1 14705.20663611353 12313.45656746572 1.194
poly(temp, 2)2 -18453.62947920880 6880.77822417911 -2.682
I(day^7) -0.00000014386 0.00000008921 -1.613
bobbleheadYES 10568.89620959513 2279.34031365261 4.637
fireworksYES 17183.57060841969 6106.56094999663 2.814
Pr(>|t|)
(Intercept) < 0.0000000000000002 ***
monthMAY 0.02607 *
monthJUN 0.29942
monthJUL 0.29738
monthAUG 0.38553
monthSEP 0.28863
monthOCT 0.25782
day_of_weekTuesday 0.00508 **
day_of_weekWednesday 0.81784
day_of_weekThursday 0.62190
day_of_weekFriday 0.08917 .
day_of_weekSaturday 0.00500 **
day_of_weekSunday 0.01668 *
poly(temp, 2)1 0.23686
poly(temp, 2)2 0.00934 **
I(day^7) 0.11183
bobbleheadYES 0.0000183 ***
fireworksYES 0.00652 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5590 on 63 degrees of freedom
Multiple R-squared: 0.6425, Adjusted R-squared: 0.5461
F-statistic: 6.661 on 17 and 63 DF, p-value: 0.000000011
Looking at the residual standard error we see that it has decreased from 5660 to 5590 and the value of adjusted R-squared increases from 0.5347 to 0.5461. Both indicate that the model performs better on the data. However, in order to determine if we should add this parameter to the final model, we perform the anova analysis again.
anova(lm3,poly_lm1)
Analysis of Variance Table
Model 1: attend ~ bobblehead + day_of_week + month + fireworks
Model 2: attend ~ month + day_of_week + poly(temp, 2) + I(day^7) + bobblehead +
fireworks
Res.Df RSS Df Sum of Sq F Pr(>F)
1 66 2265194779
2 63 1968970217 3 296224561 3.1594 0.03067 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(poly_lm,poly_lm1)
Analysis of Variance Table
Model 1: attend ~ month + day_of_week + poly(temp, 2) + bobblehead + fireworks
Model 2: attend ~ month + day_of_week + poly(temp, 2) + I(day^7) + bobblehead +
fireworks
Res.Df RSS Df Sum of Sq F Pr(>F)
1 64 2050242023
2 63 1968970217 1 81271806 2.6004 0.1118
looking at the anova analysis, we see that the p-value (0.03077) when comparing with the linear model is less than 5% but the p-value for the analysis between the two polynomial models is 0.1118 which is more than 5% and thus adding the day7 parameter to the model does not significantly improve the model. Therefore we conclude that the first polynomial model is the final regression model, i.e.
Final Model:
\[ attend = \beta_0 + \beta_1 bobblehead+ \beta_2 day of week + \beta_3 month+ \beta_4 firework + \beta_5 temp + \beta_6 temp^2 + \varepsilon_i \]
Predicting the attendance using the polynomial model and the data, we get an Root Mean Square Error of 5031.066 (less than the linear model). The plot below shows the actual attendance (red) and the predicted attendance (blue).
predictions_poly <- predict(poly_lm, data = select(events, month, day_of_week, bobblehead,fireworks,temp))
error_poly <- (sum((predictions_poly-events$attend)**2)/81)**0.5
error_poly
[1] 5031.066
diff_poly <- (predictions_poly-events$attend)
plot(events$attend, type = 'o', col = 'red', main = 'Attendance of polymonial regression model', ylab = 'attendance')
lines(predictions_poly, type = 'o', col = 'blue')
We can also calculate the root mean squares error between our best non linear model (best overall) and the best linear model to see the difference in the predictions. We get this value as: 1629.03
error_models <- (sum((predictions_poly-predictions)**2)/81)**0.5