{"id":11554,"date":"2022-01-27T15:43:31","date_gmt":"2022-01-27T23:43:31","guid":{"rendered":"https:\/\/threecloud.wpengine.com\/?p=11554"},"modified":"2023-12-11T13:48:06","modified_gmt":"2023-12-11T21:48:06","slug":"using-linear-regression-pitchers-performance","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/using-linear-regression-pitchers-performance\/","title":{"rendered":"Using Linear Regression to Predict a Pitcher&#8217;s Performance"},"content":{"rendered":"<p style=\"font-size: 13px; text-align: center;\"><em>E<\/em><em>ditor\u2019s Note: The post was originally published in [November, 2017] and has been updated for freshness, accuracy and comprehensiveness.<\/em><\/p>\n<p>For better or worse, the game of baseball has changed drastically in just the last few years.\u201cSmall ball\u201d (an approach to baseball that involves base hits, sac fly\u2019s, bunts) is dying. Players are swinging for the fences every chance they get and thus, are striking out at a higher rate than ever before. Balls are being put in play less and less.<\/p>\n<p><!--more--><\/p>\n<p>Why is this happening? The use of real-time data in sports is challenging players to push the boundaries of athletic performance. In 2007, MLB introduced StatCast which gives teams access to data never imagined before. Over the last 10 years, we\u2019ve seen baseball transform from a competitive team sport to a more individualized performance sport driven by the desire to improve one\u2019s own statistics. A great show of sportsmanship is now more about the performance of the individual than the W or L of the team.<\/p>\n<p>This \u201canalytics revolution\u201d in baseball is just beginning.<\/p>\n<ul>\n<li>Analysts are working to predict injuries to pitchers just by looking at minute differences in pitch velocity and spin rate.<\/li>\n<li>Batters finish their \u201cat bat\u201d and go back to the dugout and check what their \u201claunch angle\u201d and \u201cexit velocity\u201d were.<\/li>\n<li>Outfielders check the piece of paper in their back pocket to see exactly where they need to stand for the current batter based on that batter\u2019s spray chart.<\/li>\n<li>\u201cThe shift\u201d is a new concept that baseball traditionalists scoff at. Infielders move from their normal position and move the other side of the infield because the data says that that is where the ball is most likely to go.<\/li>\n<li>Pitchers are being trained to throw the ball harder and harder than ever before.<\/li>\n<\/ul>\n<p>Don\u2019t believe me? Let\u2019s let the data speak for itself. Does the data support this new trend of pitchers throwing harder than ever before? We will use simple linear regression and compare the ERA of pitchers in 2016 to their average fastball speed.<\/p>\n<p>For those who do not know, ERA is a commonly used statistic for pitchers that stands for <strong>Earned Run Average<\/strong>. It is the number of runs scored against that pitcher per nine innings pitched. The lower the ERA the better.<\/p>\n<p><strong>What is Linear Regression?<\/strong><\/p>\n<table style=\"border-collapse: collapse; width: 100%;\" border=\"0\">\n<tbody>\n<tr>\n<td style=\"width: 50%;\"><strong>X<\/strong><\/td>\n<td style=\"width: 50%;\"><strong>Y<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\">1<\/td>\n<td style=\"width: 50%;\">5<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\">2<\/td>\n<td style=\"width: 50%;\">7.5<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\">2.7<\/td>\n<td style=\"width: 50%;\">10.3<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\">3.5<\/td>\n<td style=\"width: 50%;\">12.9<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\">3.7<\/td>\n<td style=\"width: 50%;\">13.9<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\">4.9<\/td>\n<td style=\"width: 50%;\">16.6<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\">5.8<\/td>\n<td style=\"width: 50%;\">19.5<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\">6.2<\/td>\n<td style=\"width: 50%;\">20.6<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><img decoding=\"async\" style=\"width: 500px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/5670923\/USING%20LINEAR%20REGRESSION%20TO%20PREDICT%20A%20PITCHERS%20PERFORMANCE_image.png\" alt=\"USING LINEAR REGRESSION TO PREDICT A PITCHERS PERFORMANCE_image\" width=\"500\" \/><\/p>\n<p>A<strong> linear regression<\/strong> involves plotting a line that best represents a scatter-plot of points, like the one above. The line of best fit is the line that minimizes the total squared distance from each point to the line. You can use the equation of the line to predict future point values. For example, if X were 7.3 we can use the equation to predict the value of Y.Y = 3.0273*(7.3) + 2.0106 = 24.10989.<\/p>\n<p>Another important value to note in a linear regression model is <strong>correlation<\/strong>. Correlation is a value between -1 and 1 that describes how well the scatter-plot fits to a line. Positive correlation means as X increases, so does Y. Negative means as X increases, Y decreases.The closer correlation is to 1 or -1 then the better the data fits to a line. A correlation of 0 means there is no correlation.The correlation for this example is 0.997688.<\/p>\n<p><strong>Gathering the data<\/strong><\/p>\n<p>We will use two different data sources for this pitch speed analysis. The first is called the Lahman Database. Run by a man named Sean Lahman, the Lahman database has complete end of season stats going all the way back to 1871. At <a href=\"http:\/\/www.seanlahman.com\/baseball-archive\/statistics\/\" rel=\" noopener\">http:\/\/www.seanlahman.com\/baseball-archive\/statistics\/ <\/a>you will find a Microsoft Access version, a CSV version, and a SQL version.For this demo we will be using the CSV \u201cPitching\u201d table from the database.You\u2019ll find a description of all the columns here: <a href=\"http:\/\/seanlahman.com\/files\/database\/readme2016.txt\" rel=\" noopener\">http:\/\/seanlahman.com\/files\/database\/readme2016.txt<\/a>.<\/p>\n<h6><img decoding=\"async\" style=\"width: 600px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/5670923\/USING%20LINEAR%20REGRESSION%20TO%20PREDICT%20A%20PITCHERS%20PERFORMANCE_image2.png\" alt=\"USING LINEAR REGRESSION TO PREDICT A PITCHERS PERFORMANCE_image2\" width=\"600\" \/><\/h6>\n<h6>Small portion of the \u201cPitching\u201d table from the Lahman Database<\/h6>\n<p>The Lahman database covers a lot of different stats but it gives you no way to explain \u201cwhy\u201d players are or are not having success. This is where MLB.com\u2019s <a href=\"https:\/\/fastballs.wordpress.com\/category\/pitchfx-glossary\/\" rel=\" noopener\">PitchF\/x data <\/a>comes into play.As of the 2008 season, we have data for every pitch thrown including the type of pitch, where it crossed the plate, the speed, how much break it had and in what direction, and much more. <a href=\"https:\/\/fastballs.wordpress.com\/category\/pitchfx-glossary\/\" rel=\" noopener\">https:\/\/fastballs.wordpress.com\/category\/pitchfx-glossary\/<\/a> is a great resource describing what all the columns from the pitchF\/x data mean.<\/p>\n<p><img decoding=\"async\" style=\"width: 1015px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/5670923\/USING%20LINEAR%20REGRESSION%20TO%20PREDICT%20A%20PITCHERS%20PERFORMANCE_image3.png\" alt=\"USING LINEAR REGRESSION TO PREDICT A PITCHERS PERFORMANCE_image3\" width=\"1015\" \/><\/p>\n<p>Example of what the raw data looks like from MLB\u2019s website.<\/p>\n<p>HTTP:\/\/GD2.MLB.COM\/COMPONENTS\/GAME\/MLB\/<\/p>\n<p>This data powers the MLB At Bat application.The picture above is all the data from an at bat by Evan Longoria of the Rays with Danny Barnes of the Blue Jays pitching. At the top, you\u2019ll see the result of the at bat (\u201cdes\u201d) which was \u201cEvan Longoria walks. Brad Miller to 2nd.\u201d I\u2019ve highlighted a few key stats from the penultimate pitch thrown to Longoria.You can see there that type = \u201cB\u201d(ball), pitch_type = \u201cCH\u201d (changeup), and start_speed = 80.2 (80 mph when released).<\/p>\n<p><img decoding=\"async\" style=\"width: 500px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/5670923\/USING%20LINEAR%20REGRESSION%20TO%20PREDICT%20A%20PITCHERS%20PERFORMANCE_image4.png\" alt=\"USING LINEAR REGRESSION TO PREDICT A PITCHERS PERFORMANCE_image4\" width=\"500\" \/><\/p>\n<h6>Barnes throws an 80 MPH Changeup for a ball to Evan Longoria. Screenshot taken from MLB\u2019s At Bat application.<\/h6>\n<p>The next step will be getting the PitchF\/x data, which is an XML file, into a form that we can digest in R. Then we will join these databases together so we can look at pitcher\u2019s fastball speed and compare this to their ERA, both for the 2016 season. As a reminder, the lower the ERA, the better.<\/p>\n<p><strong>Massaging the data<\/strong><\/p>\n<p>Finally, time to start using some R code.<\/p>\n<p>Using the \u201cpitchRx\u201d package and the \u201cscrape\u201d command we can pull pitch data from the MLB website for specific games or for a date range into a nice data frame that is easy to use.You\u2019ll find all documentation for the \u201cpitchRx\u201d package here: https:\/\/cran.r-project.org\/web\/packages\/pitchRx\/pitchRx.pdf.<\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>INSTALL.PACKAGES(&#8220;PITCHRX&#8221;)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>LIBRARY(PITCHRX)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>GAME &lt;- SCRAPE(GAME.IDS = &#8220;GID_2017_07_19_TBAMLB_OAKMLB_1&#8221;)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>GAMES_APRIL &lt;- SCRAPE(&#8220;2016-04-03&#8243;,&#8221;2016-04-15&#8221;)<\/strong><\/p>\n<p>This gives you 5 tables. The best way to utilize the results are to combine the pitch table with the atbat table. This way you have every pitch that was thrown as well as the results of the at-bat.<\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>PITCHES_APRIL &lt;- PLYR::JOIN(GAMES_APRIL$ATBAT, GAMES_APRIL$PITCH, BY=C(&#8220;NUM&#8221;, &#8220;URL&#8221;), TYPE=&#8221;INNER&#8221;)<\/strong><\/p>\n<p>Unfortunately, the scrape command has a limit of 200 games per use. To get around this we\u2019ll simply use the \u201cscrape\u201d command multiple times and combined the results into one table.<\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>GAMES_APRIL_2 &lt;- SCRAPE(&#8220;2016-04-16&#8243;,&#8221;2016-04-28&#8221;)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>PITCHES_APRIL_2 &lt;- PLYR::JOIN(GAMES_APRIL_2$ATBAT, GAMES_APRIL_2$PITCH, BY=C(&#8220;NUM&#8221;, &#8220;URL&#8221;), TYPE=&#8221;INNER&#8221;)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>PITCHES_ALL &lt;- PLYR::JOIN(PITCHES_APRIL, PITCHES_APRIL_2, TYPE = &#8220;FULL&#8221;)<\/strong><\/p>\n<p>Rinse and repeat.<\/p>\n<p>Now we need to join this table with the pitching table from the Lahman database.Unfortunately, there is no direct way to join these tables together.MLB uses unique numbers to identify pitchers and Lahman uses a playerID column. The Lahman database has a \u201cmaster\u201d table for this purpose but the master table does not have the MLBcode.Baseballprospectus.com has a table with both playerID and MLBcode but this table is incomplete.<\/p>\n<p>To join the tables, we used Lahman\u2019s \u201cMaster\u201d table and made a new \u201cpitcher_name\u201d column and joined the tables using the pitcher\u2019s full names.<\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>SETWD(\u201c#MY FOLDER#&#8221;)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>PITCHING_STATS &lt;- READ.CSV(&#8220;PITCHING.CSV&#8221;<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>PITCHING_STATS &lt;- SUBSET(PITCHING_STATS, PITCHING_STATS$YEARID == 2016)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>MASTER &lt;-READ.CSV(&#8220;MASTER.CSV&#8221;)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>MASTER &lt;- SUBSET(MASTER, SELECT = C(&#8220;PLAYERID&#8221;,&#8221;NAMEFIRST&#8221;,&#8221;NAMELAST&#8221;))<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>MASTER$PITCHER_NAME &lt;- PASTE(MASTER$NAMEFIRST, MASTER$NAMELAST, SEP=&#8221; &#8220;)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>PITCHING_STATS_WITH_NAMES &lt;- PLYR::JOIN(PITCHING_STATS, MASTER, BY= &#8220;PLAYERID&#8221;, TYPE=&#8221;INNER&#8221;)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>PITCHES_ALL_AND_STATS &lt;- PLYR::JOIN(PITCHING_STATS_WITH_NAMES, PITCHES_ALL, BY= &#8220;PITCHER_NAME&#8221;, TYPE=&#8221;INNER&#8221;)<\/strong><\/p>\n<p>This is a very large table, with which a lot of analysis can be done. If you choose to go a different direction from here please let me know what insights you find!<\/p>\n<p>We, however, are going to start trimming the fat to look at only the columns we need for this analysis.<\/p>\n<p style=\"padding-left: 40px;\"><strong><span style=\"font-size: 11px;\">PITCHES_ALL_AND_STATS &lt;- SUBSET(PITCHES_ALL_AND_STATS , STINT == 1)<\/span><\/strong><\/p>\n<p style=\"padding-left: 40px;\"><strong><span style=\"font-size: 11px;\">PITCHES_ALL_AND_STATS &lt;- SUBSET(PITCHES_ALL_AND_STATS, IPOUTS &gt;= 150)<\/span><\/strong><\/p>\n<p style=\"padding-left: 40px;\"><strong><span style=\"font-size: 11px;\">PITCHES_FASTBALLS_AND_STATS &lt;- SUBSET(PITCHES_ALL_AND_STATS, PITCH_TYPE %IN% C(&#8220;FA&#8221;, &#8220;FF&#8221;, &#8220;FT&#8221;))<\/span><\/strong><\/p>\n<p style=\"padding-left: 40px;\"><strong><span style=\"font-size: 11px;\">PITCHES_FASTBALLS_AND_STATS &lt;- SUBSET(PITCHES_FASTBALLS_AND_STATS, SELECT = C(&#8220;PITCHER_NAME&#8221;,&#8221;START_SPEED&#8221;,&#8221;ERA&#8221;))<\/span><\/strong><\/p>\n<p>The \u201cstint == 1\u201d is to ensure there is only one row for each pitcher. Pitchers who pitched for multiple teams in one year will have more than one row for each team. \u201cIPouts &gt;= 150\u201d is to filter the results to only include pitchers who have thrown at least 50 innings (50 innings = 150 Outs) to avoid pitchers with a small sample size.<\/p>\n<p>Now we have every fastball thrown for pitchers who threw at least 50 innings in 2016 along with the name of the pitcher and that pitcher\u2019s 2016 ERA.Now the only thing left to do is to average the fastball speed so we have one row for each pitcher and then rename the column back to \u201cpitcher_name.\u201d<\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>PITCHERS_FASTBALLS_ERA &lt;- AGGREGATE(PITCHES_FASTBALLS_AND_STATS[,2:3], LIST(PITCHES_FASTBALLS_AND_STATS$PITCHER_NAME), MEAN)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>COLNAMES(PITCHERS_FASTBALLS_ERA)[COLNAMES(PITCHERS_FASTBALLS_ERA)==&#8221;GROUP.1&#8243;] &lt;- &#8220;PITCHER_NAME&#8221;<\/strong><\/p>\n<p><strong><img decoding=\"async\" style=\"width: 309px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/5670923\/USING%20LINEAR%20REGRESSION%20TO%20PREDICT%20A%20PITCHERS%20PERFORMANCE_image5.png\" alt=\"USING LINEAR REGRESSION TO PREDICT A PITCHERS PERFORMANCE_image5\" width=\"309\" \/><\/strong><\/p>\n<p><strong>Creating the Linear Model<\/strong><\/p>\n<p>We\u2019ll start by looking at the correlation between start_speed and ERA.<\/p>\n<p style=\"padding-left: 40px;\"><strong><span style=\"font-size: 11px;\">COR(PITCHERS_FASTBALLS_ERA$ERA,PITCHERS_FASTBALLS_ERA$START_SPEED)<\/span><\/strong><\/p>\n<p>The correlation we get is -0.234668 which is not a particularly strong correlation but it shows that there is some sort of relationship between speed and ERA. As speed increases, ERA tends to decrease.<\/p>\n<p>Now, let\u2019s create a linear model.<\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>LM_ERA &lt;- LM(FORMULA = ERA ~ START_SPEED, DATA = PITCHERS_FASTBALLS_ERA)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>SUMMARY(LM_ERA)<\/strong><\/p>\n<p><strong>Residuals:<\/strong><\/p>\n<p><strong>Min 1Q Median 3Q Max<\/strong><\/p>\n<p><strong>-2.2760 -0.8130 -0.0811 0.6954 5.3207<\/strong><\/p>\n<p><strong>Coefficients:<\/strong><\/p>\n<p>&nbsp;<\/p>\n<p><strong>(Intercept) 14.90057 2.78855 5.343 1.99e-07 ***<\/strong><\/p>\n<p><strong>start_speed -0.11759 0.03021 -3.893 0.000126 ***<\/strong><\/p>\n<p><strong>&#8212;<\/strong><\/p>\n<p><strong>Signif. codes: 0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1<\/strong><\/p>\n<p><strong>Residual standard error: 1.147 on 260 degrees of freedom<\/strong><\/p>\n<p><strong>Multiple R-squared: 0.05507, Adjusted R-squared: 0.05143<\/strong><\/p>\n<p><strong>F-statistic: 15.15 on 1 and 260 DF, p-value: 0.0001261<\/strong><\/p>\n<p>Our line of best fit has the equation:<\/p>\n<p>ERA = 14.90057 \u2013 0.11759*(start_speed)<\/p>\n<p>Using this equation we can predict ERA using start_speed but how accurate would this prediction be?<\/p>\n<p>Also, worth noting are the p-value for start_speed and the adjusted R-squared value. The p-value for start speed (represented by Pr(&gt;|t|) is 0.000126. What this means is that there is a 0.0126% chance that our results just happen by coincidence. This tells me that fastball speed <strong>does<\/strong> have a statistically significant impact on ERA.<\/p>\n<p>The adjusted R-squared value is 0.05143. What this means is 5.143% of the variance in ERA is explained by our model. Not a large amount unfortunately. This tells me that, while fastball speed does influence performance, it is only just one part of it.<\/p>\n<p>Using the ggplot2 package we can create a scatterplot of our data with the line of best fit. http:\/\/ggplot2.org\/<\/p>\n<p>&nbsp;<\/p>\n<p style=\"padding-left: 40px;\"><strong><span style=\"font-size: 11px;\">INSTALL.PACKAGES(&#8220;GGPLOT2&#8221;)<\/span><\/strong><\/p>\n<p style=\"padding-left: 40px;\"><strong><span style=\"font-size: 11px;\">LIBRARY(GGPLOT2)<\/span><\/strong><\/p>\n<p style=\"padding-left: 40px;\"><strong><span style=\"font-size: 11px;\">GGPLOT(PITCHERS_FASTBALLS_ERA,<\/span><\/strong><\/p>\n<p style=\"padding-left: 40px;\"><strong><span style=\"font-size: 11px;\"> AES(X = `START_SPEED`, Y = `ERA`)) + GEOM_POINT() +<\/span><\/strong><\/p>\n<p style=\"padding-left: 40px;\"><strong><span style=\"font-size: 11px;\"> THEME(PANEL.BORDER = ELEMENT_RECT(COLOR = &#8220;BLACK&#8221;, FILL = NA, SIZE = 1),<\/span><\/strong><\/p>\n<p style=\"padding-left: 40px;\"><strong><span style=\"font-size: 11px;\">PANEL.BACKGROUND = ELEMENT_RECT(FILL = &#8220;WHITE&#8221;),<\/span><\/strong><\/p>\n<p style=\"padding-left: 40px;\"><strong><span style=\"font-size: 11px;\"> PANEL.GRID.MAJOR = ELEMENT_LINE(COLOR = &#8220;GREY&#8221;, LINETYPE = &#8220;DASHED&#8221;)) +<\/span><\/strong><\/p>\n<p style=\"padding-left: 40px;\"><strong><span style=\"font-size: 11px;\"> STAT_SMOOTH(METHOD = &#8220;LM&#8221;, COLOR = &#8220;ORANGE&#8221;, SIZE = 1, LEVEL = 0.95)<\/span><\/strong><\/p>\n<p style=\"padding-left: 40px;\"><strong><span style=\"font-size: 11px;\"><img decoding=\"async\" style=\"width: 500px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/5670923\/USING%20LINEAR%20REGRESSION%20TO%20PREDICT%20A%20PITCHERS%20PERFORMANCE_image6.png\" alt=\"USING LINEAR REGRESSION TO PREDICT A PITCHERS PERFORMANCE_image6\" width=\"500\" \/><\/span><\/strong><\/p>\n<p>The shaded gray area represents the confidence interval. You\u2019ll see in the code that we set the confidence level for the \u201cstat_smooth\u201d argument to 95%.<\/p>\n<p><strong>Testing Assumptions and Improving the Model<\/strong><\/p>\n<p>The first thing you should always check when making a linear regression model is that the residuals are normally distributed (bell curve). Residuals are the distance of each point to the line of best fit. If the residuals fit a normal distribution then this tells us that a linear model makes sense and not a polynomial one.<\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>LM_ERA_RESID &lt;- AS.DATA.FRAME(LM_ERA$RESIDUALS)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>PLOT(DENSITY(LM_ERA$RESIDUALS),XLAB = &#8220;RESIDUALS&#8221;, MAIN = &#8220;DISTRIBUTION OF RESIDUALS&#8221;)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong><img decoding=\"async\" style=\"width: 500px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/5670923\/USING%20LINEAR%20REGRESSION%20TO%20PREDICT%20A%20PITCHERS%20PERFORMANCE_image7.png\" alt=\"USING LINEAR REGRESSION TO PREDICT A PITCHERS PERFORMANCE_image7\" width=\"500\" \/><\/strong><\/p>\n<p>This does look normally distributed with a little skew to the right. To confirm that it\u2019s normally distributed, let\u2019s create a Q-Q Plot. A<strong> Q-Q plot<\/strong> compares our data to that of a theoretical normal distribution. If the graph forms a straight line then our residuals are normally distributed.<\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>QQNORM(LM_ERA$RESIDUALS, MAIN = &#8220;Q-Q PLOT TO TEST NORMALITY OF THE RESIDUALS&#8221;)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>QQLINE(LM_ERA$RESIDUALS, LWD = 3, COL = &#8220;RED&#8221;)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong><img decoding=\"async\" style=\"width: 500px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/5670923\/USING%20LINEAR%20REGRESSION%20TO%20PREDICT%20A%20PITCHERS%20PERFORMANCE_image8-1.png\" alt=\"USING LINEAR REGRESSION TO PREDICT A PITCHERS PERFORMANCE_image8-1\" width=\"500\" \/><\/strong><\/p>\n<p>Our data fits the line well, but it appears there may be some outliers. We can use a box plot to check for outliers and then remove them.<\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>BOXPLOT &lt;- BOXPLOT(PITCHERS_FASTBALLS_ERA$START_SPEED, PITCHERS_FASTBALLS_ERA$ERA, MAIN = &#8220;FASTBALLS AND ERA BOXPLOT&#8221;)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>ERA_BOXPLOT &lt;- BOXPLOT(PITCHERS_FASTBALLS_ERA$ERA, MAIN = &#8220;ERA BOXPLOT&#8221;)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>SPEED_BOXPLOT &lt;- BOXPLOT(PITCHERS_FASTBALLS_ERA$START_SPEED, MAIN = &#8220;SPEED BOXPLOT&#8221;)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong><img decoding=\"async\" style=\"width: 490px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/5670923\/USING%20LINEAR%20REGRESSION%20TO%20PREDICT%20A%20PITCHERS%20PERFORMANCE_image10.png\" alt=\"USING LINEAR REGRESSION TO PREDICT A PITCHERS PERFORMANCE_image10\" width=\"490\" \/><\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong><img decoding=\"async\" style=\"width: 490px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/5670923\/USING%20LINEAR%20REGRESSION%20TO%20PREDICT%20A%20PITCHERS%20PERFORMANCE_image11.png\" alt=\"USING LINEAR REGRESSION TO PREDICT A PITCHERS PERFORMANCE_image11\" width=\"490\" \/><\/strong><\/p>\n<p>You can read more about how a boxplot is calculated here <a href=\"http:\/\/stattrek.com\/statistics\/charts\/boxplot.aspx\" rel=\" noopener\">http:\/\/stattrek.com\/statistics\/charts\/boxplot.aspx<\/a>.<\/p>\n<p>boxplot_outliers &lt;- data.frame(boxplot$out, boxplot$group)<\/p>\n<table style=\"border-collapse: collapse; width: 100%;\" border=\"0\">\n<tbody>\n<tr>\n<td style=\"width: 237px;\"><\/td>\n<td style=\"width: 235.5px;\"><strong>boxplot.out<\/strong><\/td>\n<td style=\"width: 236.25px;\"><strong>boxplot.group<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 237px;\"><strong>1<\/strong><\/td>\n<td style=\"width: 235.5px;\"><strong>83.10969<\/strong><\/td>\n<td style=\"width: 236.25px;\"><strong>1<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 237px;\"><strong>2<\/strong><\/td>\n<td style=\"width: 235.5px;\"><strong>83.23261<\/strong><\/td>\n<td style=\"width: 236.25px;\"><strong>1<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 237px;\"><strong>3<\/strong><\/td>\n<td style=\"width: 235.5px;\"><strong>7.59000<\/strong><\/td>\n<td style=\"width: 236.25px;\"><strong>2<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 237px;\"><strong>4<\/strong><\/td>\n<td style=\"width: 235.5px;\"><strong>9.36000<\/strong><\/td>\n<td style=\"width: 236.25px;\"><strong>2<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 237px;\"><strong>5<\/strong><\/td>\n<td style=\"width: 235.5px;\"><strong>8.02000<\/strong><\/td>\n<td style=\"width: 236.25px;\"><strong>2<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Now let\u2019s mark which points in our data are outliers and graph it again<\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>BOXPLOT_OUTLIERS_ERA &lt;- PITCHERS_FASTBALLS_ERA[PITCHERS_FASTBALLS_ERA$ERA %IN% BOXPLOT_OUTLIERS[3:5, 1], ]<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>BOXPLOT_OUTLIERS_SPEED &lt;- PITCHERS_FASTBALLS_ERA[PITCHERS_FASTBALLS_ERA$START_SPEED %IN% BOXPLOT_OUTLIERS[1:2, 1], ]<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>BOXPLOT_OUTLIERS_ERA_NAMES &lt;- ROWNAMES(BOXPLOT_OUTLIERS_ERA)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>BOXPLOT_OUTLIERS_SPEED_NAMES &lt;- ROWNAMES(BOXPLOT_OUTLIERS_SPEED<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>BOXPLOT_OUTLIERS_NAMES &lt;- C(BOXPLOT_OUTLIERS_ERA_NAMES, BOXPLOT_OUTLIERS_SPEED_NAMES)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>GGPLOT(PITCHERS_FASTBALLS_ERA, AES(X = `START_SPEED`, Y = `ERA`)) + GEOM_POINT() + THEME(PANEL.BORDER = ELEMENT_RECT(COLOR = &#8220;BLACK&#8221;, FILL = NA, SIZE = 1),<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong> PANEL.BACKGROUND = ELEMENT_RECT(FILL = &#8220;WHITE&#8221;), PANEL.GRID.MAJOR = ELEMENT_LINE(COLOR = &#8220;GREY&#8221;, LINETYPE &#8220;DASHED&#8221;)) +<\/strong><strong>STAT_SMOOTH(METHOD = &#8220;LM&#8221;, COLOR = &#8220;ORANGE&#8221;, SIZE = 1, LEVEL = 0.95) +<\/strong><br \/>\n<strong>GEOM_POINT(DATA = PITCHERS_FASTBALLS_ERA[BOXPLOT_OUTLIERS_NAMES,],<\/strong><br \/>\n<strong>AES(X = PITCHERS_FASTBALLS_ERA[BOXPLOT_OUTLIERS_NAMES, ]$START_SPEED,<\/strong><br \/>\n<strong>Y = PITCHERS_FASTBALLS_ERA[BOXPLOT_OUTLIERS_NAMES, ]$ERA), COLOR = &#8220;RED&#8221;,<\/strong><br \/>\n<strong>SIZE = 3)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong><img decoding=\"async\" style=\"width: 500px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/5670923\/USING%20LINEAR%20REGRESSION%20TO%20PREDICT%20A%20PITCHERS%20PERFORMANCE_image13.png\" alt=\"USING LINEAR REGRESSION TO PREDICT A PITCHERS PERFORMANCE_image13\" width=\"500\" \/><\/strong><\/p>\n<p>The red points represent the outliers.; Now let\u2019s take these points out and recreate the model.<\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>PITCHERS_FASTBALLS_ERA_NO_OUTLIERS &lt;- SUBSET(PITCHERS_FASTBALLS_ERA, !ROWNAMES(PITCHERS_FASTBALLS_ERA) %IN% BOXPLOT_OUTLIERS_NAMES)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>GGPLOT(PITCHERS_FASTBALLS_ERA_NO_OUTLIERS, AES(X = `START_SPEED`, Y = `ERA`)) +<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>GEOM_POINT() + THEME(PANEL.BORDER = ELEMENT_RECT(COLOR = &#8220;BLACK&#8221;, FILL = NA, SIZE = 1), PANEL.BACKGROUND = ELEMENT_RECT(FILL = &#8220;WHITE&#8221;), PANEL.GRID.MAJOR = ELEMENT_LINE(COLOR = &#8220;GREY&#8221;, LINETYPE = &#8220;DASHED&#8221;)) + STAT_SMOOTH(METHOD = &#8220;LM&#8221;, COLOR = &#8220;ORANGE&#8221;, SIZE = 1, LEVEL = 0.95)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong><img decoding=\"async\" style=\"width: 500px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/5670923\/USING%20LINEAR%20REGRESSION%20TO%20PREDICT%20A%20PITCHERS%20PERFORMANCE_image14.png\" alt=\"USING LINEAR REGRESSION TO PREDICT A PITCHERS PERFORMANCE_image14\" width=\"500\" \/><\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>COR(PITCHERS_FASTBALLS_ERA_NO_OUTLIERS$ERA,PITCHERS_FASTBALLS_ERA_NO_OUTLIERS$START_SPEED)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\">Correlation has improved from -0.235 to -0.265<\/p>\n<p style=\"padding-left: 40px; font-size: 11px;\"><strong>LM_ERA_NO_OUTLIERS &lt;- LM(FORMULA = ERA ~ START_SPEED, DATA = PITCHERS_FASTBALLS_ERA_NO_OUTLIERS)<\/strong><\/p>\n<p style=\"padding-left: 40px; font-size: 11px;\"><strong>SUMMARY(LM_ERA_NO_OUTLIERS)<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>Residuals:<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong> Min 1Q Median 3Q; Max<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>-2.2256 -0.7723 -0.0313 0.6986 3.2559<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>Coefficients:<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>Estimate Std. Error t value Pr(&gt;|t|)<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>(Intercept) 16.00280 2.72378 5.875 1.31e-08 ***<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>start_speed -0.12998 0.02948 -4.409 1.53e-05 ***<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>&#8212;<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>Signif. codes: 0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>Residual standard error: 1.052 on 255 degrees of freedom<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>Multiple R-squared: 0.07082, Adjusted R-squared: 0.06718<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>F-statistic: 19.44 on 1 and 255 DF, p-value: 1.535e-05<\/strong><\/p>\n<p>Our new formula is<\/p>\n<p style=\"padding-left: 40px;\">ERA = 16.00280 &#8211; 0.12998*(start_Speed)<\/p>\n<p>What we see here is that, with such a low p-value, fastball speed definitely influences ERA.However, even though it has improved, the adjusted R-squared is still only 0.06718 which means that our model still only explains less than 7% of the variation in ERA.<\/p>\n<p><strong>Multivariate Linear Model<\/strong><\/p>\n<p>To help improve the model, we can add more factors.Let\u2019s use my new favorite visualization to look at correlation between multiple objects: ERA, fastball speed, spin rate on fastballs, spin rate on curveballs and sliders, and salary.<\/p>\n<p>The ggpairs command in the \u201cGGally\u201d package displays a correlation matrix with a scatterplot comparing all the columns to each other, a histogram showing the distribution of each column, and each correlation value.<\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>INSTALL.PACKAGES(&#8220;GGALLY&#8221;)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>LIBRARY(GGALLY)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>GGPAIRS(PITCHERS_NO_OUTLIERS[,C(&#8220;ERA&#8221;,&#8221;START_SPEED&#8221;,&#8221;SPIN_RATE&#8221;,&#8221;SPIN_RATE_CU&#8221;,&#8221;SALARY&#8221;)],<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>LOWER = LIST(CONTINUOUS = &#8220;SMOOTH&#8221;), DIAG = LIST(CONTINUOUS = &#8220;BARDIAG&#8221;))<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong><img decoding=\"async\" style=\"width: 500px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/5670923\/USING%20LINEAR%20REGRESSION%20TO%20PREDICT%20A%20PITCHERS%20PERFORMANCE_image15.png\" alt=\"USING LINEAR REGRESSION TO PREDICT A PITCHERS PERFORMANCE_image15\" width=\"500\" \/><\/strong><\/p>\n<p>Looking at this, it appears spin rate does influence ERA, although the correlation is not as strong as it is for fastball speed.Salary does not have an impact and actually decreases the adjusted R- squared value of the model.Salary is often a representation of how long a player has been in the league, less so how well they perform.<\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>MULTIVARIATE_LM_ERA &lt;- LM(FORMULA = ERA ~ START_SPEED + SPIN_RATE + SPIN_RATE_CU, DATA = PITCHERS_NO_OUTLIERS)<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>SUMMARY(MULTIVARIATE_LM_ERA)<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>Residuals:<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>Min1QMedian3Q Max<\/strong><\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>-2.23046 -0.77149 -0.07979 0.70964 3.08181<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>Coefficients:<\/strong><\/p>\n<p><strong><span style=\"font-size: 11px;\"> Estimate Std. Error t value Pr(&gt;|t|)<\/span><\/strong><\/p>\n<p><strong><span style=\"font-size: 11px;\">(Intercept)17.8763102 2.7985197 6.388 7.93e-10 ***<\/span><\/strong><\/p>\n<p><strong><span style=\"font-size: 11px;\">start_speed -0.1431117 0.0304353 -4.702 4.22e-06 ***<\/span><\/strong><\/p>\n<p><strong><span style=\"font-size: 11px;\">spin_rate -0.0001266 0.0002769 -0.4570.6480<\/span><\/strong><\/p>\n<p><strong><span style=\"font-size: 11px;\">spin_rate_cu -0.00033090.0001638-2.021 0.0444 *<\/span><\/strong><\/p>\n<p>&#8212;<\/p>\n<p style=\"font-size: 11px;\"><strong>Signif. codes: 0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>Residual standard error: 1.07 on 255 degrees of freedom<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>(6 observations deleted due to missingness)<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>Multiple R-squared: 0.09316, Adjusted R-squared: 0.08249<\/strong><\/p>\n<p style=\"font-size: 11px;\"><strong>F-statistic: 8.732 on 3 and 255 DF,p-value: 1.556e-05<\/strong><\/p>\n<p>&nbsp;<\/p>\n<p><strong>Conclusion<\/strong><\/p>\n<p>The new equation to predict ERA is:<\/p>\n<p style=\"font-size: 11px; padding-left: 40px;\"><strong>ERA = 17.876 \u2013 0.143*(START_SPEED) \u2013 0.000126*(SPIN_RATE) \u2013 0.0003309*(SPIN_RATE_CU)<\/strong><\/p>\n<p>The very small p-value associated with start speed shows that fastball speed <strong>does<\/strong> have a statistically significant impact on performance.However, the adjusted R-squared value is still only 0.0825 so there is still a lot that this model is unable to explain. So, while fastball speed is an important factor, it is certainly not the only one that is necessary to be a good pitcher.<\/p>\n<p>One could continue to improve the model from here. Some factors that we did not include but intuitively would influence performance are: location of pitches, variance of speed and\/or spin between pitches, pitch selection, and many others that may have yet to be discovered.<a href=\"https:\/\/3cloudsolutions.com\/get-started\/\"> Connect with us<\/a> to start the conversation and discover how 3Cloud expertise can transform your business\u2019s future.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Editor\u2019s Note: The post was originally published in [November, 2017] and has been updated for&mldr;<\/p>\n","protected":false},"author":74,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[429],"class_list":["post-11554","post","type-post","status-publish","format-standard","hentry","category-data-ai","tag-data-and-ai","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/11554","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/74"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=11554"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/11554\/revisions"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=11554"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=11554"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=11554"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}