Problem Set 7
Testing Regression Models
We will now continue modeling data with regression, but we can now test and compare different models more efficiently. We will use this dataset on mlb payrolls: mlb_payrolls_train.csv to analyze general manager performance based on team payrolls and winning percentages.
- First, load in the data to R. Let’s see the range of seasons that are included in this data
## Rows: 603 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): teamID, name
## dbl (7): yearID, G, W, L, wins_162, payroll_norm, log_payroll_norm
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] 1998 1999 2000 2002 2003 2004 2005 2006 2007 2008 2010 2011 2013 2014 2017
## [16] 2018 2020 2021 2022 2001 2009 2012 2015 2019 2023 2016
We can see that the data ranges from 1998 to 2023. Since the 2020 season was shortened due to Covid, filter it out. Then use
mutate()to create a column namedWPbased on wins (W) and games (G).Then, explore the relationship between payrolls and winning percentage by generating a scatterplot. Use
ifelse()with the%in%operator to mark the Oakland A’s and New York Yankees’ points on the plot. Be sure to mark them with different colors by usingscale_color_manual()and to change the alpha values or colors if needed so that overlapping point’s don’t completely obscure each other.Now, fit a regression line to the data as a function of
payroll_norm, then add it to the plot.Next, try fitting a model based on
log_payroll_normand adding that to the plot.
Just by looking at the plots from 3 and 4, which seems to fit the data better?
- We saw that in lecture 6, one way to potentially look at model
performance is with the residuals. Using
add_residualsfrommodelr, add a column of residuals from each model to your payroll table. Then, generate a scatter plot of the residuals for each model.
An indication of a poor model fit is a non-uniform residual plot. This would indicate that perhaps a linear model is being used when the underlying trend is instead logarithmic, or polynomial. Do you observe this?
- Finally, we can directly compare our models with out-of-sample
performance. Load in mlb_payrolls_test.csv and call it
payrolls_test. Do not refit your model but instead usepredictwith the models you have already created to generate predictions for the test data set calledtest_predandtest_log_pred, similar to its usage in Lecture 6. Then, use the line below to compute the RMSE for each model.
rmse <- sqrt(mean((payrolls_test$WP - payrolls_test$test_pred)^2))
rmse_log <- sqrt(mean((payrolls_test$WP - payrolls_test$test_log_pred)^2))Based on the RMSE values, which model is better?
Challenge: Find the difference between the actual WP values
and predictions of the log model in the test data and add them
to payrolls_test. Group by team and find the mean
difference across all years. Multiply their percentages by 162 to obtain
number of wins. Make a bar plot showing teams’ mean performance compared
to predictions.