1 Exploratory data analysis
2 Hypothesis Testing
- 2.0.1 Significance determination using linear models
3 Regression Analysis
- 3.1 Ordinary Least Squares
4 Polynomial Regression

Executive Summary

In this report, we aim to model the attendance in order for the management to act accordingly to maximize the attendances and profits. We are presented with a dataset containing information about a single season of Dodger home games. There are four promotions that the management can run: caps, shirts, fireworks and bobble heads. We explore how these promotional items influence the attendance. We construct a model using all the available information but fine tune the model to find that the attendance can be significantly explained by using a simple linear regression model where we only need the month, day of the week, bobble head and fireworks information. Our results show that we can expect an increase of 17,028 people for games where there are fireworks and an increase of 10,995 people when bobble heads are sold. To further fine tune this model, we use a polynomial regression model where attendance can be modeled by month, day of week, fireworks, bobble heads, and the temperature. Our conclusion from this data is that to increase the attendance, we should have fireworks and bobble heads. However, due to the imbalance of the shirts and caps information in our data, we can not decisively conclude their affect on the attendance.

1 Exploratory data analysis

In this part, we will explore out dataset. This includes identifying the type of data, i.e. numerical or categorical, the number of features, the distribution of these features. This will provide us with more insight into how we should construct the models.

library(RSQLite)  ## if package is not on the computer, then install it only once using Tools > Install packages...
con <- dbConnect(SQLite(), "../data/dodgers.sqlite")
tbl(con, "events") %>% 
  collect() %>% 
  mutate(day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")),
         month = factor(month, levels = c("APR","MAY","JUN","JUL","AUG","SEP","OCT"))) %>% 
  mutate_if(is.character, factor) %>% 
  mutate(temp = round((temp- 32)*5/9)) -> events

head(events)

# A tibble: 6 × 12
  month   day attend day_of_week opponent  temp skies  day_night cap   shirt
  <fct> <dbl>  <dbl> <fct>       <fct>    <dbl> <fct>  <fct>     <fct> <fct>
1 APR      10  56000 Tuesday     Pirates     19 Clear  Day       NO    NO   
2 APR      11  29729 Wednesday   Pirates     14 Cloudy Night     NO    NO   
3 APR      12  28328 Thursday    Pirates     14 Cloudy Night     NO    NO   
4 APR      13  31601 Friday      Padres      12 Cloudy Night     NO    NO   
5 APR      14  46549 Saturday    Padres      14 Cloudy Night     NO    NO   
6 APR      15  38359 Sunday      Padres      18 Clear  Day       NO    NO   
# … with 2 more variables: fireworks <fct>, bobblehead <fct>

str(events)

tibble [81 × 12] (S3: tbl_df/tbl/data.frame)
 $ month      : Factor w/ 7 levels "APR","MAY","JUN",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ day        : num [1:81] 10 11 12 13 14 15 23 24 25 27 ...
 $ attend     : num [1:81] 56000 29729 28328 31601 46549 ...
 $ day_of_week: Factor w/ 7 levels "Monday","Tuesday",..: 2 3 4 5 6 7 1 2 3 5 ...
 $ opponent   : Factor w/ 17 levels "Angels","Astros",..: 13 13 13 11 11 11 3 3 3 10 ...
 $ temp       : num [1:81] 19 14 14 12 14 18 16 17 18 19 ...
 $ skies      : Factor w/ 2 levels "Clear","Cloudy": 1 2 2 2 2 1 2 2 2 1 ...
 $ day_night  : Factor w/ 2 levels "Day","Night": 1 2 2 2 2 1 2 2 2 2 ...
 $ cap        : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
 $ shirt      : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
 $ fireworks  : Factor w/ 2 levels "NO","YES": 1 1 1 2 1 1 1 1 1 2 ...
 $ bobblehead : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...

1.1 Feature description

From the above summary of the events database, we see that we have a tibble of size 81x12, i.e. data about all the 81 games played by the Dodgers team. For each of these games, we have the following features (columns):

1.1.1 month, day, attendance, day of week, opponent, Temperature

These are the features indicating the month that the game was played in, the day, the attendance, the weekday, the opponent, and the temperature.

barplot(summary(events$month), main = "Games per month", xlab = "month",ylab = "Games")
hist(events$day, breaks = dim(events%>% count(day))[1], ,main = "Games per day", xlab = "day", ylab = "Games")
hist(events$attend, breaks = 10, ,main = "Games per attendance", xlab = "attendance", ylab = "Games")
barplot(summary(events$day_of_week), main = "Games per day of week", xlab = "Day of week",ylab = "Games", las=2, cex.names=0.6)
barplot(summary(events$opponent), main = "Games per opponent", xlab = "Day of week",ylab = "Games", cex.names=0.6, las=2)
hist(events$temp, breaks = 10 ,main = "Games vs. Temperature", xlab = "temperature", ylab = "Games")
# Reference used for histograms: https://stackoverflow.com/questions/38994579/label-the-x-axis-correct-in-a-histogram-in-r
# and bar plots: https://stackoverflow.com/questions/21639392/make-frequency-histogram-for-factor-variables

From the histograms of the month and the day of the week, we can see that there are less games played in October and less games played on Thursdays as compared to other months and days respectively. Furthermore, we can see that the team plays each opponent at least three times. The attendance and temperature data seems to follow a Gaussian distribution.

1.1.2 Skies,

This feature indicates if the skies were clear or cloudy at the time of the game. Counting the frequency of each possible level, we find that there are 62 days when its clear and 19 days when it is cloudy.

summary(events$skies)

 Clear Cloudy 
    62     19

1.1.3 day_night,

This feature indicates the time of the day. We can see that there are 15 games played during the day and 66 games played at night.

summary(events$day_night)

  Day Night 
   15    66

1.1.4 cap,

This feature indicates if there were caps sold during the game. We have 79 games without and 2 games with caps sold. Due to the imbalance of this feature, i.e. in we have only two instances when caps are sold, we cannot expect to draw meaningful conclusions.

summary(events$cap)

 NO YES 
 79   2

1.1.5 shirt,

This feature indicates if shirts were sold at the match. we see that there are 78 games without and 3 games with shirts sold. Again, as with the caps, we see that there is a big imbalance in this feature which means that it will not be able to provide us with sufficient information about its impact on the attendance.

summary(events$shirt)

 NO YES 
 78   3

1.1.6 fireworks,

This feature indicates if there were fireworks at the match.

summary(events$fireworks)

 NO YES 
 67  14

1.1.7 bobblehead.

This feature indicates if there were bobbleheads sold at the match.

summary(events$bobblehead)

 NO YES 
 70  11

From this analysis, we can see that other than caps and shirts which have a severe imbalance, the other features can be useful in modeling the attendance at the game since they have more samples for each option (not evenly spread which would be better but relative to caps and shirts, it is better).

1.2 Correlation Analysis

Next, we preprocess the data (one hot encoding) and find the correlation coefficient between the features and the attendance. Correlation tells us how attendance changes with change in each feature. Correlation can take any value between -1 and 1 where negative values mean that attendance decreases with an increase in that feature while a positive value indicates that attendance increases with increase in that feature. High magnitude represents a strong correlation while values near zero indicate that attendance is not affected by that feature much.

events_table <- one_hot(as.data.table(events)) # reference used: https://datatricks.co.uk/one-hot-encoding-in-r-three-simple-methods
correlation <- cor(events_table[, get("attend")],events_table[,-"attend"])
barplot(correlation,  cex.names=0.7, las=2, main = "Correlation of features with attendance",
        ylab = "Correlation Coefficient")

We can see that the bobble head promotion is highly and positively correlated with the attendance while fireworks are uncorrelated, The two other promotions are also lightly positively correlated but as previously mentioned, the imbalance may cause this value to not conform to the reality.

1.3 Frequency of promotions

In this section, we wish to further investigate the underlying relationships in our dataset. For this, we analyse the frequency of different features with respect to others. This includes the frequency of each promotion (shirts, caps, fireworks, and bobble head) with respect to the day and the month.

Table 1.1 and Figure 1.1 show us the number of games played each month and day of the week.

events %>% 
  count(day_of_week, month) %>% 
  pivot_wider(names_from = day_of_week, values_from = n) %>% 
  replace_na(0) %>%
  pander(caption = "(\\#tab:monthweekday) Number of games played in each weekday and month")

Table 1.1: Number of games played in each weekday and month
month	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday	Sunday
APR	1	2	2	1	2	2	2
MAY	3	3	2	1	3	3	3
JUN	1	1	1	1	2	2	1
JUL	3	3	2	0	1	1	2
AUG	2	2	3	1	3	2	2
SEP	1	1	1	1	2	3	3
OCT	1	1	1	0	0	0	0

events %>% 
  ggplot(aes(day_of_week)) +
  geom_bar(aes(fill=month))

Figure 1.1: Barplot of counts of games for each weekday and month

Table 1.2 shows the number of games played at day/night and when the sky was cloudy/clear each month. Table 1.3 shows the number of games played at day/night and when the sky was cloudy/clear each day of the week. Looking at the games with this perspective, we see that most of the games are played at night and when the sky was clear.

month_dn <- events %>% 
  count(day_night, month) %>% 
  pivot_wider(names_from = day_night, values_from = n) %>% 
  replace_na(0) 
month_skies <- events %>% 
  count(skies, month) %>% 
  pivot_wider(names_from = skies, values_from = n) %>% 
  replace_na(0) 
cbind(month_dn, month_skies[,2:3])%>%
pander(caption = "(\\#tab:monthdnskies) Frequency of day_night and skies with respect to month.")

Table 1.2: Frequency of day_night and skies with respect to month.
month	Day	Night	Clear	Cloudy
APR	3	9	5	7
MAY	2	16	16	2
JUN	2	7	8	1
JUL	2	10	8	4
AUG	3	12	14	1
SEP	3	9	9	3
OCT	0	3	2	1

day_dn <- events %>% 
  count(day_night, day_of_week) %>% 
  pivot_wider(names_from = day_night, values_from = n) %>% 
  replace_na(0) 
day_skies <- events %>% 
  count(skies, day_of_week) %>% 
  pivot_wider(names_from = skies, values_from = n) %>% 
  replace_na(0) 
cbind(day_dn, day_skies[,2:3])%>%
pander(caption = "(\\#tab:daydnskies) Frequency of day_night and skies with respect to the day of the week.")

Table 1.3: Frequency of day_night and skies with respect to the day of the week.
day_of_week	Day	Night	Clear	Cloudy
Tuesday	1	12	10	2
Wednesday	2	10	9	4
Saturday	1	12	7	5
Sunday	11	2	4	1
Monday	0	12	10	3
Thursday	0	5	9	4
Friday	0	13	13	0

Figure 1.2 shows the attendance of the games with respect to the month and the day of the week using a heatmap where the red color shows higher attendance and the yellow (lighter) color shows less attendance.

# Taken from the tutorial in class.
xtabs(attend ~  month + day_of_week, events) %>% 
  heatmap()

Figure 1.2: Heatmap of attendance versus weekday and month.

Bobble heads, shirts, caps, and fireworks are the four promotional features that the management uses to boost attendance. Table 1.4 and 1.5 shows the frequency of all these different promotions on each day of the week and month respectively. As mentioned before, we see that the caps and shirts are sold twice and thrice respectively and are not very informative. The fireworks are mostly part of the games on Fridays but are relatively more evenly spread over the months. One more observation is that there is no promotion during the month of October.

table_bobblehead <- events %>% 
  count(day_of_week, bobblehead) %>% 
  pivot_wider(names_from = bobblehead, values_from = n) %>% 
  replace_na((YES = 0)) %>% 
  mutate(Total = YES + NO) %>% 
  select(-NO) %>% 
  rename(bobblehead = YES)

table_cap <- events %>% 
  count(day_of_week, cap) %>% 
  pivot_wider(names_from = cap, values_from = n) %>% 
  replace_na((YES = 0)) %>% 
  mutate(Total = YES + NO) %>% 
  select(-NO) %>% 
  rename(cap = YES)

table_shirt <- events %>% 
  count(day_of_week, shirt) %>% 
  pivot_wider(names_from = shirt, values_from = n) %>% 
  replace_na((YES = 0)) %>% 
  mutate(Total = YES + NO) %>% 
  select(-NO) %>% 
  rename(shirt = YES)

table_fireworks <- events %>% 
  count(day_of_week, fireworks) %>% 
  pivot_wider(names_from = fireworks, values_from = n) %>% 
  replace_na((YES = 0)) %>% 
  mutate(Total = YES + NO) %>% 
  select(-NO) %>% 
  rename(fireworks = YES)

cbind(table_bobblehead[1:2],table_cap[2],table_shirt[2],table_fireworks[2]) %>%
pander(caption = "(\\#tab:frequencyday) Frequency of promotions on each day of the week.")

Table 1.4: Frequency of promotions on each day of the week.
day_of_week	bobblehead	cap	shirt	fireworks
Monday	0	0	1	0
Tuesday	6	1	1	0
Wednesday	0	0	0	1
Thursday	2	0	0	0
Friday	0	0	0	13
Saturday	2	0	0	0
Sunday	1	1	1	0

table_bobblehead1 <- events %>% 
  count(month, bobblehead) %>% 
  pivot_wider(names_from = bobblehead, values_from = n) %>% 
  replace_na((YES = 0)) %>% 
  mutate(Total = YES + NO) %>% 
  select(-NO) %>% 
  rename(bobblehead = YES)

table_cap1 <- events %>% 
  count(month, cap) %>% 
  pivot_wider(names_from = cap, values_from = n) %>% 
  replace_na((YES = 0)) %>% 
  mutate(Total = YES + NO) %>% 
  select(-NO) %>% 
  rename(cap = YES)

table_shirt1 <- events %>% 
  count(month, shirt) %>% 
  pivot_wider(names_from = shirt, values_from = n) %>% 
  replace_na((YES = 0)) %>% 
  mutate(Total = YES + NO) %>% 
  select(-NO) %>% 
  rename(shirt = YES)

table_fireworks1 <- events %>% 
  count(month, fireworks) %>% 
  pivot_wider(names_from = fireworks, values_from = n) %>% 
  replace_na((YES = 0)) %>% 
  mutate(Total = YES + NO) %>% 
  select(-NO) %>% 
  rename(fireworks = YES)
cbind(table_bobblehead1[1:2],table_cap1[2],table_shirt1[2],table_fireworks1[2])%>%
    pander(caption = "(\\#tab:frequencymonth) Frequency of promotions each month.")

Table 1.5: Frequency of promotions each month.
month	bobblehead	cap	shirt	fireworks
APR	1	0	1	2
MAY	2	0	0	3
JUN	2	0	1	2
JUL	3	1	0	2
AUG	3	1	0	3
SEP	0	0	1	2
OCT	0	0	0	0

Next, we plot the boxplots of some of the features with respect to the attendance. Boxplots show the median of the features, the lower and upper quartiles, the minimum, the maximum and any outliers. Statistically outliers are defined as an point lying outside the range: mean ± 1.5*IQR where IQR: Inter-Quartile Range = upper quartile - lower quartile.

We plot the features we plot for are: Month, Day of the Week, Day/Night, Skies, Bobble head, and fireworks.

# Adapted from the tutorial in class.
events %>% 
  ggplot(aes(month, attend)) +
  geom_boxplot() +
  labs(title = "Attendance vs. Month")

events %>% 
  ggplot(aes(day_of_week, attend)) +
  geom_boxplot() +
  labs(title = "Attendance vs. Day of Week")

events %>% 
  ggplot(aes(day_night, attend)) +
  geom_boxplot()+
  labs(title = "Attendance vs. Skies (Day/Night)")

events %>% 
  ggplot(aes(skies, attend)) +
  geom_boxplot()+
  labs(title = "Attendance vs. Clear/Cloudy")

events %>% 
  ggplot(aes(bobblehead, attend)) +
  geom_boxplot()+
  labs(title = "Attendance vs. Bobblehead")

events %>% 
  ggplot(aes(fireworks, attend)) +
  geom_boxplot()+
  labs(title = "Attendance vs. Fireworks")

According to the boxplot of Attendance vs. Skies (Day/Night), we see that the mean is the same for both the levels, which is also the same in the case of fireworks. This is consistent with the result of the correlation barplot where the correlation for fireworks is an extremely small value and similarly the correlation value for skies is small. Furthermore, we see that the correlation between the bobblehead and the attendance is strong which is seen by a higher boxplot for when bobbleheads were sold.

2 Hypothesis Testing

Hypothesis testing can be used to draw conclusions about the statistical relationships between features. Here, we perform the test to see if there is an effect on the attendance due to the time of the day. The null hypothesis states that the true means of the distributions is the same. We calculate the p-value and if that is less than 5%, we reject the null hypothesis, otherwise, we do not reject it. Rejecting the null hypothesis means that there is a statistical difference between the means, while not rejecting the null hypothesis means that there is no statistical difference between the means of the two distributions.

# Taken from class tutorial
t.test(x=events$attend[events$day_night=="Day"],
       y=events$attend[events$day_night=="Night"])


    Welch Two Sample t-test

data:  events$attend[events$day_night == "Day"] and events$attend[events$day_night == "Night"]
t = 0.42851, df = 23.62, p-value = 0.6722
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3531.652  5380.397
sample estimates:
mean of x mean of y 
 41793.27  40868.89

When we do this T-test using the distributions of the time of the day, we see that the p-value is 0.6722. Since this is larger than 5%, we do not reject the null hypothesis which means that the true mean difference is equal to 0, i.e. there is no difference between games during the day or at night.

We do the same test using the skies feature to determine if there is a statistical difference between the attendance when the sky was clear or cloudy.

t.test(x=events$attend[events$skies=="Clear"],
       y=events$attend[events$skies=="Cloudy"])


    Welch Two Sample t-test

data:  events$attend[events$skies == "Clear"] and events$attend[events$skies == "Cloudy"]
t = 1.2868, df = 27.664, p-value = 0.2088
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1741.315  7617.103
sample estimates:
mean of x mean of y 
 41729.21  38791.32

Again, as we see, the p-value is 0.2 which is greater than 5%. Therefore, we do not reject the null hypothesis and we can claim that there is no statistical difference between the mean of attendance on clear and cloudy games.

2.0.1 Significance determination using linear models

For the numerical features (day and temperature), we can create a scatter plot of the attendance and the features. Then using the ‘geom_smooth()’ function we can create a line showing the relationship between the two variables. Using a linear model, we can also find the approximate linear equation of the relationship.

# In the class, we performed the relationship between the attendance and the temperature as follows (i use 24 instead of 23 since it fits better):

events %>%
  ggplot(aes(temp, attend)) +
  geom_jitter() +
  geom_smooth(se = FALSE) +
  geom_smooth(se = FALSE, method = "lm",
              formula = y ~ x + pmax(0, x - 24)  , col = "red")

`geom_smooth()` using method = 'loess' and formula 'y ~ x'

The figure above shows the temperature plotted with respect to the attendance. the smoothing line drawn by the ‘geom_smooth’ function is plotted in blue. Using a linear relationship of the following type:

\[ attend = \beta_0 + \beta_1 temp + \beta_2 (temp - 24)^+ + \varepsilon_i \]

we create the red line which seems to be very close to the actual relationship. In order to determine the coefficients of this line, we can use a linear regression model which gives us the following results:

lm(attend ~ temp + pmax(0, temp - 24), data = events) %>% summary()


Call:
lm(formula = attend ~ temp + pmax(0, temp - 24), data = events)

Residuals:
     Min       1Q   Median       3Q      Max 
-16488.8  -5799.8   -220.5   5079.2  16250.8 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         18387.9     6645.9   2.767 0.007065 ** 
temp                 1124.3      317.0   3.547 0.000664 ***
pmax(0, temp - 24)  -2269.4      614.8  -3.691 0.000412 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7717 on 78 degrees of freedom
Multiple R-squared:  0.1567,    Adjusted R-squared:  0.1351 
F-statistic: 7.249 on 2 and 78 DF,  p-value: 0.001295

From this result, we can determine the coefficients. The intercept is the attendance when temperature is 0. The coefficient of the temperature variable is positive which means that there is a positive correlation between the two. The other variable, i.e. pmax(0, temp - 24) has a negative coefficient meaning that there is a negative correlation. In terms of our case, when the temperature is zero, we have approximately 18388 people in the stadium. Until the temperature is 24 or less, we add 1124.3 people for every 1 degree increase in temperature. The maximum is when the temperature is 24 and then we decrease (1124.3-2269.4) people for each 1 degree increase in temperature.

Following this analysis, we do the same for the remaining numerical feature, i.e. day. We plot the attendance vs. day scatter plot and try to fit the data.

# Adapted from class
events %>%
  ggplot(aes(day, attend)) +
  geom_jitter() +
  geom_smooth(se = FALSE) +
  geom_smooth(se = FALSE, method = "lm",
              formula = y ~ x  , col = "red")

`geom_smooth()` using method = 'loess' and formula 'y ~ x'

Looking at the blue line. we see that the relationship between the variables is approximately linear. Thus, we use a linear model as such:

\[ attend = \beta_0 + \beta_1 day + \varepsilon_i \]

The result of this is plotted in the red line. Next, using the linear regression model, we can find the parameter values:

lm(attend ~ day, data = events) %>% summary()


Call:
lm(formula = attend ~ day, data = events)

Residuals:
     Min       1Q   Median       3Q      Max 
-16678.1  -6216.2   -823.1   5831.9  15103.5 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 40662.44    1821.41  22.325 <0.0000000000000002 ***
day            23.40      97.15   0.241                0.81    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8347 on 79 degrees of freedom
Multiple R-squared:  0.000734,  Adjusted R-squared:  -0.01191 
F-statistic: 0.05803 on 1 and 79 DF,  p-value: 0.8103

From the results, we can see that there is always an attendance of 40662 with an increase of 23.4 for each day of the month. From the coefficient and the line in the graph, we can conclude that the day has a very small effect on the number of people that come to the games.

3 Regression Analysis

In this section, we aim to create a model to predict the attendance. Our aim here is to not only create an accurate model but also keep it as simple as possible.

3.1 Ordinary Least Squares

We start our analysis with the following model that takes into account all of the features available. Any predictor with a p-value that is smaller than 5% indicates that that particular feature is important and significant in modeling attendance. From the result below, we see that a model fitted on all predictors tells us that bobble head and the day of week (especially Tuesday) are important features to increase attendance. However this is a big model which includes unnecessary features. These features complicate the model while not providing much information about the attendance. Therefore, we start with a simpler model and gradually add features while checking their significance. We will compare the rest of the models compared to this one based on the residual standard error i.e. 5979 and the adjusted R-squared i.e. 0.4809. For a better model, we expect the residual standard error should decrease while the adjusted R-squared should increase (adjusted R-squared will be between 0 and 1).

full_mod <- lm(attend ~ month + day_of_week + bobblehead + cap + shirt + fireworks + day + temp + day_night + skies + opponent, data = events)
full_mod%>%summary()


Call:
lm(formula = attend ~ month + day_of_week + bobblehead + cap + 
    shirt + fireworks + day + temp + day_night + skies + opponent, 
    data = events)

Residuals:
    Min      1Q  Median      3Q     Max 
-9071.5 -3105.3   -83.3  1398.1 12467.6 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)   
(Intercept)           49144.43   14120.70   3.480  0.00114 **
monthMAY               4562.82    6161.08   0.741  0.46288   
monthJUN              -2780.38   12052.35  -0.231  0.81862   
monthJUL               4725.36    6220.19   0.760  0.45150   
monthAUG               8631.58    7717.22   1.118  0.26943   
monthSEP               4296.83    7739.53   0.555  0.58158   
monthOCT               6217.52    9532.41   0.652  0.51763   
day_of_weekTuesday     8670.34    2943.86   2.945  0.00514 **
day_of_weekWednesday      6.94    2764.44   0.003  0.99801   
day_of_weekThursday     460.46    3897.62   0.118  0.90650   
day_of_weekFriday    -18358.17    9210.84  -1.993  0.05247 . 
day_of_weekSaturday    3674.93    3341.26   1.100  0.27737   
day_of_weekSunday       892.73    4810.77   0.186  0.85364   
bobbleheadYES          9395.63    3203.05   2.933  0.00531 **
capYES                -6503.20    5866.45  -1.109  0.27365   
shirtYES                949.27    4594.60   0.207  0.83727   
fireworksYES          20200.13    8352.77   2.418  0.01980 * 
day                     134.00     141.29   0.948  0.34811   
temp                    -43.20     436.72  -0.099  0.92166   
day_nightNight        -3662.30    3590.06  -1.020  0.31325   
skiesCloudy            -114.01    2385.81  -0.048  0.96210   
opponentAstros       -20678.17   14211.96  -1.455  0.15277   
opponentBraves       -18497.20   13741.48  -1.346  0.18517   
opponentBrewers      -22322.21   15155.70  -1.473  0.14791   
opponentCardinals    -12532.09   13146.23  -0.953  0.34565   
opponentCubs         -10529.08   12640.84  -0.833  0.40938   
opponentGiants       -16880.09   13263.72  -1.273  0.20983   
opponentMarlins      -19147.75   13897.97  -1.378  0.17525   
opponentMets          -4234.12    6444.47  -0.657  0.51459   
opponentNationals     -5850.86   13928.61  -0.420  0.67649   
opponentPadres       -11169.54   11310.93  -0.988  0.32880   
opponentPhillies     -13698.94   12586.97  -1.088  0.28237   
opponentPirates      -12451.05   13301.72  -0.936  0.35436   
opponentReds         -16473.41   11915.48  -1.383  0.17379   
opponentRockies      -16964.63   12946.55  -1.310  0.19687   
opponentSnakes       -19919.33   12719.45  -1.566  0.12450   
opponentWhite Sox      -928.32    5735.06  -0.162  0.87215   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5979 on 44 degrees of freedom
Multiple R-squared:  0.7145,    Adjusted R-squared:  0.4809 
F-statistic: 3.058 on 36 and 44 DF,  p-value: 0.0002485

We can use this model to predict the attendance in order to calculate the root mean square error = 4406.356 The plot below shows the predicted (blue) and the actual (red) attendance. We can see that this model performs relatively well but as stated before, we wish to simplify it.

predictions_full <- predict(full_mod, data = events[,-"attend"])
error_full <- (sum((predictions_full-events$attend)**2)/81)**0.5
diff_full <- (predictions_full-events$attend)
plot(predictions_full, type = 'o', col = 'blue', main = 'Attendace for the full model', ylab = 'attendance')
lines(events$attend, type = 'o', col = 'red')

In order to find only certain features that significantly increase the performance, we use the anova() function and compare models while adding features. We start with bobble head and the day of the week since we already concluded that they are important features.

lm1 <- lm(attend ~ bobblehead + day_of_week, data=events)
lm2 <- update(lm1, . ~ . + month)

anova(lm1,lm2)

Analysis of Variance Table

Model 1: attend ~ bobblehead + day_of_week
Model 2: attend ~ bobblehead + day_of_week + month
  Res.Df        RSS Df Sum of Sq      F  Pr(>F)  
1     73 3129721926                              
2     67 2509574563  6 620147363 2.7594 0.01858 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value for the second model is less than 5%, we conclude that the predictor month is significant and therefore we add this feature to our model Now we add the other features one by one and test with this model to determine if the addition of each feature is significant to the performance of our model. Next, we add the day feature:

lm3 <- update(lm2, . ~ . + day)
anova(lm2,lm3)

Analysis of Variance Table

Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + day
  Res.Df        RSS Df Sum of Sq     F Pr(>F)
1     67 2509574563                          
2     66 2491793316  1  17781247 0.471 0.4949

We see that the p-value is greater than 5%. Referring to the linear model plot (scatter plot) earlier, we see that the coefficient for the day parameter is very small, i.e. 23. For this case, the change in day versus the change in attendance is not very significant. This conforms to our current findings. We decide not to include day predictor in the model.

Next, we try with the temperature feature:

lm3 <- update(lm2, . ~ . + temp)
anova(lm2,lm3)

Analysis of Variance Table

Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + temp
  Res.Df        RSS Df Sum of Sq      F Pr(>F)
1     67 2509574563                           
2     66 2500949923  1   8624640 0.2276 0.6349

Again, we can observe the the p-value (0.6349) is more than 5% which means that we do not reject the null hypothesis and conclude that the temperature feature is not significant in the prediction of attendance. The next feature we test is skies

lm3 <- update(lm2, . ~ . + skies)
anova(lm2,lm3)

Analysis of Variance Table

Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + skies
  Res.Df        RSS Df Sum of Sq      F Pr(>F)
1     67 2509574563                           
2     66 2430749943  1  78824620 2.1403 0.1482

We see that the skies predictor is also not significant as the p-value is less than 5%. Next we try the day_night predictor:

lm3 <- update(lm2, . ~ . + day_night)
anova(lm2,lm3)

Analysis of Variance Table

Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + day_night
  Res.Df        RSS Df Sum of Sq      F Pr(>F)
1     67 2509574563                           
2     66 2508075740  1   1498824 0.0394 0.8432

We see that the day_night predictor is also not significant. Next we try the opponent predictor:

lm3 <- update(lm2, . ~ . + opponent)
anova(lm2,lm3)

Analysis of Variance Table

Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + opponent
  Res.Df        RSS Df Sum of Sq      F Pr(>F)
1     67 2509574563                           
2     51 2017425853 16 492148710 0.7776 0.7022

We see that the opponent predictor is also not significant. Next we try the cap predictor:

lm3 <- update(lm2, . ~ . + cap)
anova(lm2,lm3)

Analysis of Variance Table

Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + cap
  Res.Df        RSS Df Sum of Sq     F Pr(>F)
1     67 2509574563                          
2     66 2432756707  1  76817856 2.084 0.1536

We see that the cap predictor is also not significant. Next we try the shirt predictor:

lm3 <- update(lm2, . ~ . + shirt)
anova(lm2,lm3)

Analysis of Variance Table

Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + shirt
  Res.Df        RSS Df Sum of Sq      F  Pr(>F)  
1     67 2509574563                              
2     66 2402756980  1 106817583 2.9341 0.09142 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the shirt predictor is also not significant. Next we try the fireworks predictor:

lm3 <- update(lm2, . ~ . + fireworks)
anova(lm2,lm3)

Analysis of Variance Table

Model 1: attend ~ bobblehead + day_of_week + month
Model 2: attend ~ bobblehead + day_of_week + month + fireworks
  Res.Df        RSS Df Sum of Sq      F   Pr(>F)   
1     67 2509574563                                
2     66 2265194779  1 244379784 7.1204 0.009581 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the p-value here is less than 5% and thus adding this to our model will significantly increase the attendance.

After testing all these predictors, we see that the optimum model with the least number predictors is:

\[ attend = \beta_0 + \beta_1 bobblehead+ \beta_2 day_of_week + \beta_3 month+ \beta_4 firework + \varepsilon_i \]

lm3%>%summary()


Call:
lm(formula = attend ~ bobblehead + day_of_week + month + fireworks, 
    data = events)

Residuals:
   Min     1Q Median     3Q    Max 
 -9504  -3683   -709   2569  15390 

Coefficients:
                      Estimate Std. Error t value             Pr(>|t|)    
(Intercept)           34321.68    2418.90  14.189 < 0.0000000000000002 ***
bobbleheadYES         10995.05    2318.43   4.742            0.0000117 ***
day_of_weekTuesday     7750.46    2587.35   2.996              0.00386 ** 
day_of_weekWednesday    904.16    2476.14   0.365              0.71617    
day_of_weekThursday     309.24    3341.63   0.093              0.92655    
day_of_weekFriday    -12386.23    6901.86  -1.795              0.07729 .  
day_of_weekSaturday    6094.17    2445.16   2.492              0.01521 *  
day_of_weekSunday      6577.96    2400.14   2.741              0.00788 ** 
monthMAY              -2492.79    2193.60  -1.136              0.25990    
monthJUN               7062.62    2616.13   2.700              0.00881 ** 
monthJUL               1315.38    2534.42   0.519              0.60549    
monthAUG               2377.88    2300.15   1.034              0.30501    
monthSEP                -55.37    2413.63  -0.023              0.98177    
monthOCT               -502.88    3873.86  -0.130              0.89711    
fireworksYES          17028.78    6381.63   2.668              0.00958 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5858 on 66 degrees of freedom
Multiple R-squared:  0.5887,    Adjusted R-squared:  0.5015 
F-statistic: 6.749 on 14 and 66 DF,  p-value: 0.00000002848

We can observe here that the residual standard error has decreased from 5979 to 5858 whereas the adjusted R-squared has increased from 0.04809 to 0.5015.

Again, we can predict the attendance using this model and calculate the root mean square which comes out as 5288.229. we see that this is less than that of the full model (4406.356) but that is to be expected since we loose some information when we do not use all of the data. We plot the attendance predicted by the model (blue) and the actual attendance (red).

predictions <- predict(lm3, data = select(events, month, day_of_week, bobblehead,fireworks))
error <- (sum((predictions-events$attend)**2)/81)**0.5
diff <- (predictions-events$attend)
plot(events$attend, type = 'o', col = 'red', main = 'Attendance of smaller model', ylab = 'attendance')
lines(predictions, type = 'o', col = 'blue')

The figure below shows the predictions of the full model (Red) and the small model (blue).

plot(predictions, type = 'o', col = 'blue', main = 'Attendance predictions of the two models', ylab = 'attendance')
lines(predictions_full, type = 'o', col = 'red')

Between the two linear models (full vs small) we get the root mean square error = 2923.934.

error_both <- (sum((predictions-predictions_full)**2)/81)**0.5

4 Polynomial Regression

The models explored above are linear in nature. In this section we explore models with polynomials for the ‘day’ and ‘temp’ features. In the first one is an addition to the small linear model determined before where we add a polynomial term as follows

\[ attend = \beta_0 + \beta_1 bobblehead+ \beta_2 day of week + \beta_3 month+ \beta_4 firework + \beta_5 temp + \beta_6 temp^2 + \varepsilon_i \]

Creating the model, we see that the Residual standard error is 5660 which is less than the linear model and the adjusted R-squared value is 0.5347. From these values we see that the model performs better than the linear model.

poly_lm <- lm(attend ~ month +  day_of_week + poly(temp, 2) + bobblehead +fireworks , data = events) 
poly_lm %>%summary()


Call:
lm(formula = attend ~ month + day_of_week + poly(temp, 2) + bobblehead + 
    fireworks, data = events)

Residuals:
     Min       1Q   Median       3Q      Max 
-10082.1  -3272.2   -218.6   2439.8  13783.0 

Coefficients:
                     Estimate Std. Error t value             Pr(>|t|)    
(Intercept)           36976.3     2917.4  12.674 < 0.0000000000000002 ***
monthMAY              -5002.8     2404.1  -2.081              0.04144 *  
monthJUN               3918.3     2931.5   1.337              0.18607    
monthJUL              -2155.1     3022.5  -0.713              0.47844    
monthAUG              -1394.7     3260.2  -0.428              0.67024    
monthSEP              -2437.4     4039.6  -0.603              0.54839    
monthOCT              -3445.5     4978.5  -0.692              0.49139    
day_of_weekTuesday     7360.7     2510.2   2.932              0.00466 ** 
day_of_weekWednesday    669.8     2395.0   0.280              0.78065    
day_of_weekThursday     474.8     3231.1   0.147              0.88363    
day_of_weekFriday    -12216.2     6668.5  -1.832              0.07162 .  
day_of_weekSaturday    6588.7     2370.3   2.780              0.00713 ** 
day_of_weekSunday      6268.3     2436.1   2.573              0.01241 *  
poly(temp, 2)1         7623.8    11646.7   0.655              0.51508    
poly(temp, 2)2       -17226.7     6923.6  -2.488              0.01546 *  
bobbleheadYES         10279.8     2300.5   4.468            0.0000328 ***
fireworksYES          17343.7     6181.6   2.806              0.00664 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5660 on 64 degrees of freedom
Multiple R-squared:  0.6278,    Adjusted R-squared:  0.5347 
F-statistic: 6.746 on 16 and 64 DF,  p-value: 0.00000001244

In order to check if we should include the parameter to the linear model, we perform an anova analysis.

anova(lm3,poly_lm)

Analysis of Variance Table

Model 1: attend ~ bobblehead + day_of_week + month + fireworks
Model 2: attend ~ month + day_of_week + poly(temp, 2) + bobblehead + fireworks
  Res.Df        RSS Df Sum of Sq     F  Pr(>F)  
1     66 2265194779                             
2     64 2050242023  2 214952755 3.355 0.04115 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the anova result, we can see that the p-value is 0.04 which is less than 5% and thus we can conclude that adding this to the model is significant. Next, we add a polynomial term for the day feature and get the following model:

\[ attend = \beta_0 + \beta_1 bobblehead+ \beta_2 day of week + \beta_3 month+ \beta_4 firework + \beta_5 temp + \beta_6 temp^2 + \beta_6 day^7 + \varepsilon_i \].

poly_lm1 <- lm(attend ~ month +  day_of_week + poly(temp, 2) + I(day^7) + bobblehead +fireworks , data = events)
poly_lm1 %>% summary()


Call:
lm(formula = attend ~ month + day_of_week + poly(temp, 2) + I(day^7) + 
    bobblehead + fireworks, data = events)

Residuals:
    Min      1Q  Median      3Q     Max 
-9813.1 -3046.2  -542.6  2428.4 13924.8 

Coefficients:
                               Estimate         Std. Error t value
(Intercept)           38550.24764072680   3042.44303917733  12.671
monthMAY              -5447.95922479578   2390.56426556345  -2.279
monthJUN               3078.16973838830   2941.97366130247   1.046
monthJUL              -3211.65072460583   3056.47100142234  -1.051
monthAUG              -2935.09865908399   3358.89193369438  -0.874
monthSEP              -4479.95258883456   4186.21143395065  -1.070
monthOCT              -5872.16848184368   5142.50826507972  -1.142
day_of_weekTuesday     7204.96355455526   2481.29102709043   2.904
day_of_weekWednesday    547.41884727917   2366.80572272789   0.231
day_of_weekThursday    1620.39389202626   3269.56259734322   0.496
day_of_weekFriday    -11404.80626336155   6605.89217725946  -1.726
day_of_weekSaturday    6824.26064247097   2345.72208702429   2.909
day_of_weekSunday      5938.59277091868   2414.85768358441   2.459
poly(temp, 2)1        14705.20663611353  12313.45656746572   1.194
poly(temp, 2)2       -18453.62947920880   6880.77822417911  -2.682
I(day^7)                 -0.00000014386      0.00000008921  -1.613
bobbleheadYES         10568.89620959513   2279.34031365261   4.637
fireworksYES          17183.57060841969   6106.56094999663   2.814
                                 Pr(>|t|)    
(Intercept)          < 0.0000000000000002 ***
monthMAY                          0.02607 *  
monthJUN                          0.29942    
monthJUL                          0.29738    
monthAUG                          0.38553    
monthSEP                          0.28863    
monthOCT                          0.25782    
day_of_weekTuesday                0.00508 ** 
day_of_weekWednesday              0.81784    
day_of_weekThursday               0.62190    
day_of_weekFriday                 0.08917 .  
day_of_weekSaturday               0.00500 ** 
day_of_weekSunday                 0.01668 *  
poly(temp, 2)1                    0.23686    
poly(temp, 2)2                    0.00934 ** 
I(day^7)                          0.11183    
bobbleheadYES                   0.0000183 ***
fireworksYES                      0.00652 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5590 on 63 degrees of freedom
Multiple R-squared:  0.6425,    Adjusted R-squared:  0.5461 
F-statistic: 6.661 on 17 and 63 DF,  p-value: 0.000000011

Looking at the residual standard error we see that it has decreased from 5660 to 5590 and the value of adjusted R-squared increases from 0.5347 to 0.5461. Both indicate that the model performs better on the data. However, in order to determine if we should add this parameter to the final model, we perform the anova analysis again.

anova(lm3,poly_lm1)

Analysis of Variance Table

Model 1: attend ~ bobblehead + day_of_week + month + fireworks
Model 2: attend ~ month + day_of_week + poly(temp, 2) + I(day^7) + bobblehead + 
    fireworks
  Res.Df        RSS Df Sum of Sq      F  Pr(>F)  
1     66 2265194779                              
2     63 1968970217  3 296224561 3.1594 0.03067 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(poly_lm,poly_lm1)

Analysis of Variance Table

Model 1: attend ~ month + day_of_week + poly(temp, 2) + bobblehead + fireworks
Model 2: attend ~ month + day_of_week + poly(temp, 2) + I(day^7) + bobblehead + 
    fireworks
  Res.Df        RSS Df Sum of Sq      F Pr(>F)
1     64 2050242023                           
2     63 1968970217  1  81271806 2.6004 0.1118

looking at the anova analysis, we see that the p-value (0.03077) when comparing with the linear model is less than 5% but the p-value for the analysis between the two polynomial models is 0.1118 which is more than 5% and thus adding the day⁷ parameter to the model does not significantly improve the model. Therefore we conclude that the first polynomial model is the final regression model, i.e.

Final Model: