Problem Set 2
Review Lecture 2 Notes
Please spend a few minutes reading through the notes from Lecture 2. Like in Problem Set 1, you should go through each code block with someone in your group and see if you can both explain to each other what all of the code does.
Payroll and Winning Percentage in the MLB
In lecture, Professor Wyner discussed the relationship between a team’s payroll and its winning percentage. In particular, for each season, he computed the “relative payroll” of each team by taking its payroll and dividing it by the median of payrolls of all teams in that season. We will replicate his analysis in the following problems using the dataset “mlb_relative_payrolls.csv”, which we saved to the “data” folder of our working directory back in Lecture 0. You should save all of the code for this analysis in an R script called “ps2_mlb_payroll.R”.
Read the data in from “mlb_relative_payrolls.csv” and save it as a tbl called “relative_payroll”
Make a histogram of team winning percentages. Play around with different binwidths.
Make a histogram of the relative payrolls.
Make a scatterplot with relative payroll on the horizontal axis and winning percentage on the vertical axis.
Without executing the code below, discuss with your group and see if you can figure out what it is doing.
- Execute the code above. What can you say about how team payrolls have evolved over time? Make a similar plot that visualizes how relative payrolls have evolved over time.
MLB Batting Statistics
In this problem set, we will gain more experience using the dplyr verbs we learned in Lecture 2 to analyze batting statistics of MLB players with at least 502.2 plate appearances. We will be using the dataset hitting_qualified.csv (click on that link to download it, then move it into your “data” folder). You should write save all of the code for this analyses in an R script called “ps2_mlb_batting.R”.
Load the data into a tibble called
hitting_qualifiedusingread_csv().The columns of this dataset include:
playerID: the player’s ID codeyearID: Yearstint: the player’s stint (order of appearances within a season)teamID: the player’s teamlgID: the player’s leagueG: the number of Games the player played in that yearAB: number of At Bats of that player in that yearPA: number of plate appearances by the player that yearR: number of Runs the player made in that yearH: number of Hits the player had in that yearX2B: number of Doubles (hits on which the batter reached second base safely)X3B: number of Triples (hits on which the batter reached third base safely)HR: number of Homeruns the player made that yearRBI: number of Runs Batted In the player made that yearSB: number of Bases Stolen by the player in that yearCS: number of times a player was Caught Stealing that yearBB: Base on BallsSO: number of Strikeouts the player had that yearIBBIntentional walksHBP: Hit by pitchSH: Sacrifice hitsSFSacrifice fliesGIDPGrounded into double plays
Use
arrange()to find out the first and last season for which we have data. Hint: you may need to usedesc()as well.Use
reframe()to find out the first and last season for which we have data. Hint, you only need one line of code to do thisWhen you print out
hitting_qualifiedyou’ll notice that some columns were read in as characters and not integers or numerics. This can happen sometimes whenever the original csv file has missing values. In this case, the columns IBB, HBP, SH, SF, and GIDP were read in as characters. We want to convert these to integers. We can do this usingmutate()and the functionas.integer().
hitting_qualified <- mutate(hitting_qualified,
IBB = as.integer(IBB),
HBP = as.integer(HBP),
# finish on your own- Let’s take a look at some of the columns we just converted:
## # A tibble: 12,043 × 8
## playerID yearID AB IBB HBP SH SF GIDP
## <chr> <dbl> <dbl> <int> <int> <int> <int> <int>
## 1 ansonca01 1884 475 NA NA NA NA NA
## 2 bradyst01 1884 485 NA 0 NA NA NA
## 3 connoro01 1884 477 NA NA NA NA NA
## 4 dalryab01 1884 521 NA NA NA NA NA
## 5 farreja02 1884 469 NA NA NA NA NA
## 6 gleasbi01 1884 472 NA 12 NA NA NA
## 7 hinespa01 1884 490 NA NA NA NA NA
## 8 hornujo01 1884 518 NA NA NA NA NA
## 9 jonesch01 1884 472 NA 10 NA NA NA
## 10 nelsoca01 1884 432 NA 9 NA NA NA
## # ℹ 12,033 more rows
- You’ll notice that a lot of these columns contain
NAvalues, which indicates that some of these values are missing. This makes sense, since a lot of these statistics were not recorded in the early years of baseball. A popular convention for dealing with these missing statistics is to impute the missing values with 0. That is, for instance, every place we see anNAwe need to replace it with a 0. We can do that withmutate()andreplace_na()function as follows.
hitting_qualified <- replace_na(hitting_qualified,
list(IBB = 0, HBP = 0, SH = 0, SF = 0, GIDP = 0))- We will discuss the syntax for
replace_na()later in lecture. Now, rerun theselect()function from above to check that the NAs were replaced with zeros.
Use
mutate()to add a column for the number of singles, which can be computed as \(\text{X1B} = \text{H} - \text{X2B} - \text{X3B} - \text{HR}\).The variable BB includes as a subset all intentional walks (IBB). Use
mutate()to add a column tohitting_qualifiedthat counts the number of un-intentional walks (uBB). Be sure to save the resulting tibble ashitting_qualified.Use
mutate()to add columns for the following offensive statistics, whose formulae are given below. We have also included links to pages on Fangraphs that define and discuss each of these statistics.- Walk Percentage (BBP): \[ \text{BBP} = \frac{\text{BB}}{\text{PA}} \]
- Strike-out Percentage (KP): \[\text{KP} = \frac{\text{SO}}{\text{PA}}\]
- On-Base Percentage (OBP): \[\text{OBP} = \frac{\text{H} + \text{BB} + \text{HBP}}{\text{AB} + \text{BB} + \text{HBP} + \text{SF}}\]
- Slugging (SLG): \[ \text{SLG} = \frac{\text{X1B} + 2 \times \text{X2B} + 3\times \text{X3B} + 4\times \text{HR}}{\text{AB}} \]
- On-Base Plus Slugging (OPS): \[\text{OPS} = \text{OBP} + \text{SLG}\]
- weighted On-Base Average (wOBA): We will use the 2013 weights which can be found here \[ \text{wOBA} = \frac{0.687 \times \text{uBB} + 0.718 \times \text{HBP} + 0.881 \times \text{X1B} + 1.256 \times \text{X2B} + 1.594 \times \text{X3B} + 2.065 \times \text{HR}}{\text{AB} + \text{uBB} + \text{SF} + \text{HBP}} \]
- For most of the statistics in the previous question, Fangraphs has
defined rating scales (to see these ratings, click on the linked page
for each statistic in Question 6 and scroll down to the “Context”
section of the page). Use
mutate()andcase_when()to add the ratings for walk percentage (BBP), strike-out percentage (KP), on-base percentage (OBP), on-base plus slugging (OPS), and wOBA. Call the columns “BBP_rating”, “KP_rating”, “OBP_rating”, “OPS_rating”, and “wOBA_rating.”
hitting_qualified <- mutate(hitting_qualified,
BBP_rating = case_when(BBP >= .15 ~ "Excellent",
BBP < .15 & BBP >= .125 ~ "Great",
# finish on your ownUse
filter()to subset the players who played between 2000 and 2015. Call the new tbltmp_batting.Use
select()ontmp_battingto create a tibble calledbatting_recentcontaining all players who played between 2000 and 2015 with the following columns: playerID, yearID, teamID, lgID, and all of the statistics and rankings created in Problems 8 and 9.Explore the distribution of some of the batting statistics introduced in problem 8 using the tbl
batting_recentusing histograms. Then explore the relationship between some of these statistics with scatterplots.Explore the distribution of the ratings for the statistics introduced in Problem 9 using bar plots. Hint: You may need to mutate the rating columns with
factor(BBP_rating, levels = c(Awful, Poor...))to ensure the ratings are plotted in the correct order.