Problem Set 8
Logistic Regression
We will now continue modeling data with logistic regression using nfl_wp.csv to analyze win percentages as a
function of multiple variables. The dataset includes
game_id; season; label_win, which
is a binary (0 or 1) indicator on whether the possessing team won the
game; score_differential, which is the current score
differential between the team in possession and the team on defense;
posteam_spread, which is the +- spread for the possessing
team prior to the game start, yardline_100;
down; and ydstogo.
- First, load in the data. Previously, two different datasets have been provided for test and train. We can do this ourselves by running the following code which may also be helpful for your projects. For the first half of the problem set, use the training data.
nfl_wp = read.csv('data/nfl_wp.csv')
set.seed(42)
#get unique games
unique_game_ids <- unique(nfl_wp$game_id)
# randomly sample 80% of the game ids for training, rounding down to nearest integer
train_game_ids <- sample(unique_game_ids, size = floor(0.8 * length(unique_game_ids)))
#filter for train and test
nfl_train <- nfl_wp %>% filter(game_id %in% train_game_ids)
nfl_test <- nfl_wp %>% filter(!(game_id %in% train_game_ids))Next, do some exploratory data analysis (EDA) to understand the data. First, plot a histogram of
score_differentialto show the distribution of the variable. Then, make a separate data table grouped by game and select the first instance ofposteam_spreadso that we have only one per game. Plot the distribution of this variable.Then fit a logistic model called
diff_modelusingglmto predictlabel_winbased onscore_differential. Call the model after you fit it to see the coefficients and then add its predictions to the dataset.Repeat step 3 but predicting with
posteam_spreadinstead.Now, plot the predicted probabilities of winning based on the two models you created in steps 3 and 4. Put the predictor variable on the x-axis and the predicted win probability on the y axis. Color the points by the label_win variable using
as.factor(). Which variable seems to predict better?Finally, let’s make a model to account for both variables. Name your model
spread_diff_modeland follow the previous steps to obtain predictions and plot the data as a function of score differential (since this variable is more continuous than spread as we saw with our histogram). How does this model compare to the previous two?Let’s test all of our models. Use
predict()on the test data with the previously created models, then add them to the test dataset. Then, run the following code below to convert from win probability to a predicted win or loss and then calculate the accuracy. Print your results. Which model ended up being the best?
#set win = >=0.5 win probability, loss otherwise
nfl_test <- nfl_test %>%
mutate(
diff_pred_label = ifelse(diff_pred_prob >= 0.5, 1, 0),
spread_pred_label = ifelse(spread_pred_prob >= 0.5, 1, 0),
spread_diff_pred_label = ifelse(spread_diff_pred_prob >= 0.5, 1, 0)
)
#calculate accuracy as being the proportion of correct predictions
accuracy <- function(actual, predicted) {
mean(actual == predicted)
}
#apply to all three models
acc_diff <- accuracy(nfl_test$label_win, nfl_test$diff_pred_label)
acc_spread <- accuracy(nfl_test$label_win, nfl_test$spread_pred_label)
acc_both <- accuracy(nfl_test$label_win, nfl_test$spread_diff_pred_label)Challenge: Can you make an even better model with the other variables in the dataset or with numerical transformations to variables? Additionally, feel free to explore other metrics for out-of-sample performance, such as log-loss or AUC!