Posts Tagged line
How to Build a Line Database
Apologies for not having provided any content lately (my tweets have certainly offended about ten users). I would have wrote this months ago but I didn’t.
Let me preface this futher by saying, assuming one will be building a database on a local web-server, I highly recommend using another computer other than the primary one to run a server. I have an old Toshiba laptop that is running Debian (Debian 6.0 is the latest version) and sits in the back of my closet.
In a previous post I uploaded an Excel file that automatically extracts lines from Pinnacle and inserts into an Access database on open (keep the file open and invoke the “Application.OnTime” VBA function for a reoccurring call, or set up a windows task scheduler event). But that requires Windows, and ideally one would want a solution that can be applied across various operating systems. PHP and MySQL is one such solution. Linux users can simply download apache, php, and mysql from the repository. Windows or MAC users might want to look into downloading XAMPP. PHP is a server-side scripting language, so it operates via some sort of web-server, such as apache (if PHP is unfamiliar, just carefully read the code it and should be pretty straight-foward). And MySQL provides the database structure and query language that can be interfaced with most programming languages. I would also suggest setting up an ssh connection, from one computer on the network to the one running the server.
Here is my SQL table structure configured for baseball lines from Pinnacle (assuming a database has already been created):
CREATE TABLE IF NOT EXISTS `LINES` ( `Date` varchar(55) NOT NULL, `vRot` varchar(5) NOT NULL, `Away` varchar(55) NOT NULL, `vListed` varchar(55) NOT NULL, `vLine` varchar(12) NOT NULL, `vTotal` varchar(12) NOT NULL, `vML` int(11) NOT NULL, `hRot` varchar(5) NOT NULL, `Home` varchar(55) NOT NULL, `hListed` varchar(55) NOT NULL, `hLine` varchar(12) NOT NULL, `hTotal` varchar(12) NOT NULL, `hML` int(11) NOT NULL, `nowTime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, UNIQUE KEY `ID` (`Date`,`vRot`,`vListed`,`hRot`,`hListed`,`hML`,`hTotal`,`hLine`,`vML`,`vTotal`,`vLine`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1; |
The ‘nowTime’ column automatically tracks the current time on data insert. This table is meant to accomodate those interested in tracking line movement, because Pinnacle’s XML updates everytime there is new information added. To take advantage of this, an intermittent call (Pinnacle allows at least 60 seconds between calls) can be made using whatever fashion is most convenient for the programmer (cron job, delayed loop…). And to avoid redundant database inserts, indexing every column and using the ‘INSERT IGNORE’ sql command is essential.
Again, I’m using PHP, and here is my PHP code to grab MLB lines from Pinnacle and insert them into the above SQL table (my database name is ‘MLB’):
//error_reporting(0); $host='localhost'; $username='USER'; $pswrd='PASS'; $con = mysql_connect($host,$username,$pswrd); if (!$con) { die('Could not connect: ' . mysql_error()); } mysql_select_db("MLB",$con) or die('Error while selecting db'); $xmldoc = new DOMDocument(); $url = 'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sporttype=baseball&sportsubtype=MLB'; $xmldoc->load($url); $doc = $xmldoc->documentElement; $event = $doc->getElementsByTagName("event"); foreach( $event as $ev ) { $ml_v = $ev->getElementsbyTagName("moneyline_visiting")->item(0)->nodeValue; $ml_h = $ev->getElementsbyTagName("moneyline_home")->item(0)->nodeValue; if ($ml_h==""){ continue; } $total_v = $ev->getElementsbyTagName("total_points")->item(0)->nodeValue . " " . $ev->getElementsbyTagName("over_adjust")->item(0)->nodeValue; $total_h = $ev->getElementsbyTagName("total_points")->item(0)->nodeValue . " " . $ev->getElementsbyTagName("under_adjust")->item(0)->nodeValue; $d = $ev->getElementsByTagName("event_datetimeGMT")->item(0)->nodeValue; $teamnames = $ev->getElementsByTagName("participant_name"); $name_v = str_replace("'","",$teamnames->item(0)->nodeValue); $name_h = str_replace("'","",$teamnames->item(1)->nodeValue); $rot = $ev->getElementsByTagName("rotnum"); $rotv = $rot->item(0)->nodeValue; $roth = $rot->item(1)->nodeValue; $pitcher = $ev->getElementsByTagName("pitcher"); $pitch_v = mysql_real_escape_string($pitcher->item(0)->nodeValue); $pitch_h = mysql_real_escape_string($pitcher->item(1)->nodeValue); $spread_v = $ev->getElementsbyTagName("spread_visiting")->item(0)->nodeValue . " " . $ev->getElementsbyTagName("spread_adjust_visiting")->item(0)->nodeValue; $spread_h = $ev->getElementsbyTagName("spread_home")->item(0)->nodeValue . " " . $ev->getElementsbyTagName("spread_adjust_home")->item(0)->nodeValue; $sql = "INSERT IGNORE INTO MLB.LINES (Date,vRot,Away,vListed,vLine,vTotal,vML,hRot,Home,hListed,hLine,hTotal,hML) Values ('$d','$rotv','$name_v','$pitch_v','$spread_v','$total_v','$ml_v','$roth','$name_h','$pitch_h','$spread_h','$total_h','$ml_h')"; $query = mysql_query($sql,$con); if(!$query) { die('Could not insert values: ' . mysql_error()); } } mysql_close($con); |
You can use whatever language you want, some are more comfortable with python, perl, javascript, brainfuck, etc…
What is important is knowing how to access your MySQL database from the script and how to navigate the Pinnacle XML file.
As a paranthetical, I previously mentioned running a cron job. In Windows, one may have to use the task scheduler. In MAC or LINUX, the ability to run a cron job should already be set up, just edit the crontab file. For example, a linux user simply has to type in a terminal:
crontab -e
And add the line:
*/2 * * * * /usr/bin/php path/to/php/file.php
This simply means, every two minutes (“/2″), a php file will be opened by the program “php.”
Now if everything works, we can start to present the lines in a nice HTML table. First, create a PHP file to query the database, grabbing the latest lines for each game listed at Pinnacle, and outputting the information in JSON format. This can be a bit tricky, but here is my solution (after connecting to a database with the name ‘MLB’):
... $sql="SELECT * \n" . "FROM MLB.LINES AS m\n" . "INNER JOIN (\n" . "\n" . "SELECT c.vROT, MAX( c.nowTime ) AS maxtime\n" . "FROM MLB.LINES AS c\n" . "GROUP BY c.vROT\n" . ") AS a ON m.vROT = a.vROT\n" . "AND m.nowTime = a.maxtime WHERE NOW()<=DATE_SUB(m.Date,INTERVAL 4 HOUR)"; $results = mysql_query($sql); while($row=mysql_fetch_assoc($results)){ $array[$i]['D']=$row['Date']; $array[$i]['hRot'] = $row['hRot']; $array[$i]['vRot'] = $row['vRot']; $array[$i]['Away'] = $row['Away']; $array[$i]['Home'] = $row['Home']; $array[$i]['vListed'] = $row['vListed']; $array[$i]['hListed'] = $row['hListed']; $array[$i]['vML'] = $row['vML']; $array[$i]['hML'] = $row['hML']; $array[$i]['vTotal'] = $row['vTotal']; $array[$i]['hTotal'] = $row['hTotal']; $array[$i]['vLine'] = $row['vLine']; $array[$i]['hLine'] = $row['hLine']; $i++; } header('Content-type: application/json'); echo json_encode($array); ... |
The advantage of Jquery is the background calls it can make to another file on a server and at the same time read and parse the information that is queried from that file. The “.getJSON” subroutine makes this possible by making calls every 120 seconds (120000) to the php file “MLB_Pinny.php”:
jQuery(document).ready( function( $ ){ var timeout, d; getPinny(); function getPinny() { $.getJSON('php/MLB_Pinny.php?'+new Date().getTime(), function(json_data){ update = new Date(); $('#update td:first').text('LAST UPDATE: '+update.toString("yyyy-MM-dd h:mm")); $('#today tr:not(:first)').empty(); $('#tomorrow tr:not(:first)').empty(); $.each(json_data, function(i, item){ d = Date.parse(item.D).addHours(-4); cur = (d.getDayName()==Date.today().getDayName()) ? "today" : "tomorrow"; $("#"+cur).append($('<tr><td rowspan="2">' + d.toString("yyyy-MM-dd h:mm") + '</td><td>' + item.vRot + '</td><td class="team">' + item.Away + '</td><td class="pitch">' + item.vListed + '</td><td class="ml">' + item.vML + '</td><td>' + item.vTotal + '</td><td>' + item.vLine + '</td></tr><tr><td>' + item.hRot + '</td><td class="team">' + item.Home + '</td><td class="pitch">' + item.hListed + '</td><td class="ml">' + item.hML + '</td><td>' + item.hTotal + '</td><td>' + item.hLine + '</td></tr>')); }); }); } timeout = setInterval(function() { getPinny() }, 120000); }); |
Mine looks like this:
I have two “tbody” sections, one with ‘id = “today”‘ and the other ‘id = “tomorrow”‘. This should be self-explanatory.
Feel free to add some table enhancements (in this case, a toggle):
$('#today').children('tr:eq(0)').click(function () { $('#today').children('tr:gt(0)').toggle(); }); $('#tomorrow').children('tr:eq(0)').click(function () { $('#tomorrow').children('tr:gt(0)').toggle(); }); |
It would be nice if included was the ability to query a pitcher’s closing lines for each start:
... if(isset($_GET['pitch'])) { $query = str_replace("%",". ",$_GET['pitch']); $sql="SELECT * \n" . "FROM MLB.LINES AS m\n" . "WHERE vLISTED LIKE '%$query%' OR hLISTED LIKE '%$query%'\n" . "ORDER BY nowTime DESC"; } else { header( 'HTTP/1.1 400 Bad Request' ); die('Please use correct paramaters'); } $result = mysql_query($sql,$con) or die ('Error while executing query' . mysql_error() . "\n"); echo '<html><head></head><body> <table border=1><thead><th>Date</th><th>Away</th><th>vListed</th><th>vML</th><th>Home</th><th>hListed</th><th>hML</th></thead><tbody>'; while($row=mysql_fetch_assoc($result)){ if ($d == $row['Date']) continue; if($row['vListed']==$query){ $boldvml = "<strong>".$row['vML']."</strong>"; $boldvnm = "<strong>".$row['vListed']."</strong>"; $boldvtm = "<strong>".$row['Away']."</strong>"; $boldhml = $row['hML']; $boldhnm = $row['hListed']; $boldhtm = $row['Home']; } else { $boldhml = "<strong>".$row['hML']."</strong>"; $boldhnm = "<strong>".$row['hListed']."</strong>"; $boldhtm = "<strong>".$row['Home']."</strong>"; $boldvml = $row['vML']; $boldvnm = $row['vListed']; $boldvtm = $row['Away']; } $d = $row['Date']; echo '<tr><td>'.date("Y-m-d hh:MM",strtotime($row['Date'],'-4hours')).'</td><td>'.$boldvtm.'</td><td>'.$boldvnm.'</td><td>'.$boldvml.'</td><td>'.$boldhtm.'</td><td>'.$boldhnm.'</td><td>'.$boldhml.'</td></tr>'; } echo '</tbody></table></body></html>'; ... |
Additionally, the HTML table needs to have the cell with the starter’s name clickable. Jquery can do this:
$('.pitch').live('click',function() { window.open('php/linedb.php?pitch='+$(this).text().replace(" ","%").replace(".","")); }); |
Occassionally, Pinnacle has a listed starer in the format “F LAST” rather than “F. Last”, this usually occurs when there is a late change in the listed starter or the pitcher is making his/her first start. Hence, there are some minor whitespace and trimming issues that for now seem to be resolved with some of the above code.
Hopefully what all this accomplishes is a personal Pinnacle line service, one that updates every 60+ seconds without having to refresh the browser or re-run a query. One could easily integrate the PHP code for different sports. Obviously, basketball and football do not have listed starters, other than that the PHP code should work fine once pointed to the relevant Pinnacle XML file (or any other sportsbook).
I haven’t updated this in a while, but on my github account there is a “SP-DATABASE” project. More importantly, various PHP and MySQL files are provided that can be used independently of the html front-end, and provide a template to abuse Pinnacle.
NCAA Tourney KP vs Pinny
Posted by Rufio Magillicutty in Betting, NCAAB, Pinnacle on March 15, 2012
Same thing as conference tournaments. SEC Field hit at 3/1 odds, the other four lost. A brief survey of a hypothetical bankroll outcome demonstrated the prodigious and frightening force of the Kelly Criterion and all the emotional turmoil likely to beget its constituency. Flat bettors would have come away in the negative, but with an air of optimism and satisfaction having lingered for hitting a future.
KenPom’s LOG5 predictions are here. If you don’t know what that means, to wit:
LOG5 = (a – a * b)/(a + b – 2 * a * b)
“a” and “b” here are winning percentages. KenPom uses his pythagorean winning percentages calculated by PPP and tempo rather than just points scored for and against, with an exponent of around 12.
(Numbers in each cell represent percentages sans the non-obligatory “%” symbol).
| TOP 5 | |||
| REGION | CHAMP | ||
| Ohio St | 10.54 | Ohio St | 3.55 |
| Mich St | 7.64 | Wisconsin | 2.24 |
| Wisconsin | 6.97 | Mich St | 1.98 |
| Kansas | 6.8 | Kansas | 1.74 |
| Indiana | 3.38 | Indiana | 0.67 |
Mr. Pomeroy “likes” the Big Ten, Pinnacle doesn’t.
| SOUTH | ||||||
| KP | PINNY | KP-P | ||||
| TEAM | REGION | CHAMP | REGION | CHAMP | REGION | CHAMP |
| Kentucky | 47.9 | 19.7 | 47.4 | 27.78 | 0.5 | -8.08 |
| Wichita St | 11.8 | 2.6 | 8.43 | 2.32 | 3.37 | 0.28 |
| Indiana | 9.2 | 1.7 | 5.82 | 1.03 | 3.38 | 0.67 |
| Baylor | 10.9 | 1.7 | 12.08 | 2.82 | -1.18 | -1.12 |
| Duke | 9.5 | 1.7 | 12.08 | 4.8 | -2.58 | -3.1 |
| UNLV | 3 | 0.2 | 3.51 | 0.73 | -0.51 | -0.53 |
| Iowa St. | 1.7 | 0.1 | 1.31 | 0.42 | 0.39 | -0.32 |
| Notre Dame | 1.9 | 0.1 | 1.96 | 0.44 | -0.06 | -0.34 |
| Uconn | 0.9 | 0.06 | 2.58 | 1.07 | -1.68 | -1.01 |
| Xavier | 0.09 | 0.04 | 1.32 | 0.43 | -1.23 | -0.39 |
| S Dakota St. | 0.8 | 0.03 | 0.41 | 0.29 | 0.39 | -0.26 |
| VCU | 0.5 | 0.02 | 0.79 | 0.29 | -0.29 | -0.27 |
| Colorado | 0.4 | 0.01 | 0.67 | 0.29 | -0.27 | -0.28 |
| NMSU | 0.3 | 0.01 | 0.41 | 0.35 | -0.11 | -0.34 |
| Lehigh | 0.3 | 0.007 | 0.4 | 0.21 | -0.1 | -0.203 |
| WKY | 0.001 | 0.82 | 0.32 | -0.819 | -0.32 | |
| MIDWEST | ||||||
| KP | PINNY | KP-P | ||||
| TEAM | REGION | CHAMP | REGION | CHAMP | REGION | CHAMP |
| UNC | 28.5 | 6.6 | 32.95 | 13.64 | -4.45 | -7.04 |
| Kansas | 33.7 | 9.1 | 26.9 | 7.36 | 6.8 | 1.74 |
| Gtown | 9.7 | 1.4 | 7.31 | 1.45 | 2.39 | -0.05 |
| Michigan | 5.7 | 0.5 | 4.57 | 0.88 | 1.13 | -0.38 |
| Temple | 2.3 | 0.1 | 3.92 | 0.64 | -1.62 | -0.54 |
| SDSU | 0.9 | 0.03 | 2.65 | 0.52 | -1.75 | -0.49 |
| St. Mary’s | 1.2 | 0.05 | 2.65 | 0.59 | -1.45 | -0.54 |
| Creighton | 2 | 0.1 | 1.61 | 0.43 | 0.39 | -0.33 |
| Alabama | 3.1 | 0.2 | 2.04 | 0.57 | 1.06 | -0.37 |
| Purdue | 3.9 | 0.3 | 3.92 | 0.73 | -0.02 | -0.43 |
| NC State | 1.5 | 0.07 | 4.57 | 0.73 | -3.07 | -0.66 |
| USF | 0.3 | 0.008 | 0.81 | 0.66 | -0.51 | -0.652 |
| Ohio | 0.5 | 0.01 | 0.81 | 0.29 | -0.31 | -0.28 |
| Belmont | 4 | 0.03 | 3.92 | 0.85 | 0.08 | -0.82 |
| Detroit | 0.07 | 0.54 | 0.21 | -0.47 | -0.21 | |
| Vermont | 0.03 | 0.84 | 0.39 | -0.81 | -0.39 | |
| WEST | ||||||
| KP | PINNY | KP-P | ||||
| TEAM | REGION | CHAMP | REGION | CHAMP | REGION | CHAMP |
| Mich St | 35.2 | 12.4 | 27.56 | 10.42 | 7.64 | 1.98 |
| Missouri | 23.1 | 5.3 | 22.63 | 8.31 | 0.47 | -3.01 |
| Memphis | 8.2 | 1.7 | 5.67 | 1.61 | 2.53 | 0.09 |
| New Mexico | 7.1 | 1 | 7.84 | 1.33 | -0.74 | -0.33 |
| Marquette | 7.5 | 0.9 | 9 | 2.34 | -1.5 | -1.44 |
| Loserville | 4.7 | 0.5 | 9.08 | 2.61 | -4.38 | -2.11 |
| Florida | 4.4 | 0.5 | 3.97 | 0.8 | 0.43 | -0.3 |
| St. Louis | 3.4 | 0.5 | 2.2 | 0.57 | 1.2 | -0.07 |
| Virginia | 2.5 | 0.2 | 1.78 | 0.43 | 0.72 | -0.23 |
| Murray St. | 1.4 | 0.07 | 3.05 | 0.73 | -1.65 | -0.66 |
| LBSU | 1 | 0.06 | 1.3 | 0.29 | -0.3 | -0.23 |
| BYU | 0.5 | 0.02 | 3.91 | 0.97 | -3.41 | -0.95 |
| Davidson | 0.3 | 0.009 | 0.71 | 0.29 | -0.41 | -0.281 |
| Colorado St. | 0.4 | 0.008 | 0.52 | 0.29 | -0.12 | -0.282 |
| LIU | 0.003 | 0.39 | 0.17 | -0.387 | -0.17 | |
| Norfolk St | 0.0001 | 0.39 | 0.21 | -0.3899 | -0.21 | |
| EAST | ||||||
| KP | PINNY | KP-P | ||||
| TEAM | REGION | CHAMP | REGION | CHAMP | REGION | CHAMP |
| Syracuse | 17.5 | 4.4 | 18.22 | 5.72 | -0.72 | -1.32 |
| Ohio St | 45.9 | 19.3 | 35.36 | 15.75 | 10.54 | 3.55 |
| FSU | 3.9 | 0.5 | 9.29 | 4.08 | -5.39 | -3.58 |
| Wisconsin | 16.2 | 4.2 | 9.23 | 1.96 | 6.97 | 2.24 |
| Vanderbilt | 4.9 | 0.8 | 7.92 | 2.81 | -3.02 | -2.01 |
| Cincinnati | 1.8 | 0.2 | 4.39 | 1.03 | -2.59 | -0.83 |
| Gonzaga | 1.7 | 0.1 | 2.4 | 0.59 | -0.7 | -0.49 |
| Kansas St | 3.4 | 0.4 | 4.39 | 0.98 | -0.99 | -0.58 |
| S. Miss | 0.2 | 0.006 | 0.98 | 0.34 | -0.78 | -0.334 |
| WVU | 0.8 | 0.05 | 2.4 | 0.59 | -1.6 | -0.54 |
| Texas | 2.3 | 0.2 | 2.2 | 0.52 | 0.1 | -0.32 |
| Harvard | 0.7 | 0.04 | 1.11 | 0.29 | -0.41 | -0.25 |
| Montana | 0.09 | 0.002 | 0.79 | 0.29 | -0.7 | -0.288 |
| St. Bona | 0.6 | 0.03 | 0.53 | 0.29 | 0.07 | -0.26 |
| Loyola | 0.02 | 0.4 | 0.17 | -0.38 | -0.17 | |
| UNC-Ashe | 0.03 | 0.4 | 0.17 | -0.37 | -0.17 | |
AL MVP Update and Voting Trends
Posted by Rufio Magillicutty in Betting, MLB on August 27, 2011
The AL is actually much easier to deal with because there is no “Barry Bonds” factor. Regardless, the formula has been changing daily, and after some thoughtful and sensible analysis I’ve arrived at the conclusion that voters are not consistent evaluators of MVP candidacy. There are relationships to be found between the distribution of voting points and the metrics that we use to gauge player performance, but that is only because there are only about five players each season that could even be considered. From there the selection of the ultimate winner is mostly driven by the motives of the people voting, and where their loyalties lie (See 2006 AL MVP). To elucidate this concept, I created a graph showing WAR for each winner and average WAR for the top 5 since 1990. Now obviously I wouldn’t expect a straight line from left to right, nor a steady increase. The concept being elucidated is not one to show fault of the voters, but of the unpredictability of how voters view an MVP winner. It appears to change from year to year.
At first glance, one might think this simply is a representation of fluctuating talent. The statistic itself adjusts for league wide scoring trends for each particular season, and with each team having access to the diverse international talent pool, the average bio-mechanical limits of players are at a league-wide equilibrium, and have been since the talent pool expanded decades ago. Other than the steroid jerk from around 1998-2004, player ability, as betrayed by the left side of the graph above, hasn’t increased nor decreased drastically in any given year. The year 2000 appears to be the only anomaly on the graph, steroids notwithstanding, and Pedro Martinez and his ridiculous 10.3 WAR (4th MVP) is enough to explain the spike. Stephen Jay Gould would be proud.
Statistics are becoming more and more sophisticated, and writers/bloggers are doing whatever they can to appear more sophisticated. Thus many of them have embraced adopted WAR among other saber-stats. Because of this general propensity, I anticipated the lower WAR values for MVP winners to be from the 90s. To some degree this is true. Dennis Eckersley won the MVP in 1992, with a WAR of 3, outstanding for a relief pitcher (WAR is a counting statistic, so relievers have lower WARs by default). And Bill James will be happy to know swing-happy Juan Gonzalez has the lowest WAR for any MVP winner since 1990, at 2.8. But there is nothing else one can take from the graph other than randomness. Even the two highest WARs are from 1990 and 1991, Henderson and Ripken respectively. Obviously I didn’t expect with the creation of WAR comes an overall increase in player ability, which is just silly. I don’t know what I expected. Though it seems I should increase my sample size to span those years dating back to 1990, and probably further, rather than only using the last eleven seasons. Having said that, the current formula correctly selected eight of the last eleven MVP winners, so all the extra effort would probably be wasted energy. I’m only doing this to find value.
As I said in the previous post, I separated the MVP candidates into three groups: hitters, starting pitchers, and relief pitchers. This should be obvious enough, as the metrics used to define the best players in each category are drastically different.
I had been entertaining the idea of including WPA (Win probability added). Intuitively it makes sense that WPA is strongly linked to standard measures of offensive ability (AVG, HRS, RBI), as most events within a game occur when the run differential is within three runs. However, pitchers aren’t always in control of their statistical fate. At the same time WPA is taken directly from each individual event. Imagine a starting pitcher up 3-2 in the 7th inning with two outs leaving the game with runners on first and second. His replacement promptly surrenders a three-run HR. Two runs are charged to the starting pitcher, hurting his ERA, and he is now in line for the loss. At the same time, his WPA has not changed from another pitcher’s event. The last measurement taken for his WPA was whatever occurred with the batter before being replaced. Because of this, there is a conspicuous asymmetry in the relationship between raw statistics and WPA.
Obviously there are situations when hitters could see an increase in WPA while seeing a reduction in AVG, perhaps due to an error by the defense. But the impact is not as severe.
Team wins is another variable I had used, but are team wins indicative of voting trends or merely a by-product of the best players playing on the best teams? If I replace team wins with just a binary appropriation of playoff outlook (0 for no, 1 for yes), the table is more in agreement with intuition while possessing similar descriptive statistics. Take a quick gander at the last MVP update and you’ll understand why I replaced team wins with a yes/no playoff variable. Human thought can occasionally outwit statistics, as long as it suits one’s agenda.
Batter: Playoff, WAR, WPA, BA, HR, RBI, C
Pitcher: Playoff, WAR, WHIP, W%, C
| NAME | TEAM | PROB | ODDS |
| Adrian Gonzalez | BOS | 25.61% | 290 |
| Jose Bautista | TOR | 23.61% | 324 |
| Jacoby Ellsbury | BOS | 21.97% | 355 |
| Curtis Granderson | NYY | 19.79% | 405 |
| Dustin Pedroia | BOS | 15.84% | 531 |
| Robinson Cano | NYY | 13.25% | 654 |
| Miguel Cabrera | DET | 13.20% | 658 |
| Justin Verlander | DET | 12.32% | 711 |
| David Ortiz | BOS | 11.48% | 771 |
| Josh Hamilton | TEX | 11.03% | 806 |
| Michael Young | TEX | 9.27% | 979 |
| Jered Weaver | LAA | 8.65% | 1056 |
The variables above represent a trend from 2000-2010, therefore some statistics, like ERA, do not translate to voting points to a certain degree. On a couple of occasions, a pitcher with a 4+ ERA received voting points, and the only reason WHIP is included is due to its lower overall variance. Nonetheless it works much better in this particular formula, and I can’t control what the voters decide. Again, I’m trying to find value based on historical data. Those who think Verlander is too low consider I only used data from 2000-2010, which didn’t see any pitcher win the MVP. If/When I include seasons dating back thirty years, Verlander’s odds may increase slightly.
SP Line, WAR, and WPA
Posted by Rufio Magillicutty in Betting, MLB on August 23, 2011
Before I compared the three statistics (Line, WAR, WPA), I wanted to remove as many performance independent factors that go into a pitcher’s average line as I possibly could. There are some things that are just out of the pitcher’s control. A pitcher who started 10 games at home and 6 on the road will have about a 20% advantage in their vegas probability before anything else is taken into account. To adjust for home/road start discrepancy, I just multiplied the difference in home/road starts by .025, took the aggregate line, and divided by number of starts. Since HFA is set at 5%, each pitcher will have an increase or decrease of 2.5% in their line based on where they are pitching. I also had to adjust for opponents faced. This was fairly easy, the information is already in the SP report table, and an average pitcher has a vegas probability of .5. From there the calculation is elementary.
Obviously there are other things that go into line appropriation. One being public perception, which is hard to quantify. Linemakers have a panoply of information for which to draw from I would assume. I wouldn’t be surprised if there are some that keep a database of blink duration for each player, and any peaks or troughs in duration that a player may endure throughout the course of the year. Perhaps there is some relationship between change in blink duration and performance? Its a curious thing, simultaneous blinking. Five percent of our lives are spent walking around with our eyes closed. A sequential blinker may have an advantage in avoiding any impending danger projected, such as a spear or a rock. Why there are no sequential blinkers I don’t know. One would think sequential blinkers would reproduce differentially and would victor in pairwise contests with simultaneous blinkers. Or perhaps not? Maybe the sequential blinking mutation just never occurred. Its possible blinking sequentially is an impossibility, an incite deeply routed in the bilateral symmetry of vertebrates, or the eye protein of all organisms that are motile through a transparent spectrum.
A severe tangent, a devastating yet fascinating ramble. I can say whatever I want its my blog. I would actually be willing to do a research project on the correlation between blink duration and player ability, unfortunately nobody is stupid enough to commission such an important and groundbreaking research project, and I’m not going to do it for free.
The graphs below are actually pretty interesting, as the three statistics measure player ability from three different angles. I extracted the WAR and WPA stats from Fangraphs, using only qualified players to limit any variance and outliers. As expected, the three appear to be highly correlated with one another. WAR measures raw performance, WPA measures situational performance, and SP Line, though enigmatic, can be seen as a measure of public perception. Again, three different angles of assessing player ability. The R value is for all qualified players. Descriptive statistics at this point are limited by sample size but I don’t see any reason why with more data comes a lower proportion of variance that can be explained with the relationship. Especially with what statistics are being looked at here.
The graphs basically have the same topographical qualities, which is interesting because WPA explicitly handles quantifying specific events during the course of a game, and fundamentally does not resolve player ability unlike WAR. However, since most events during the game occur while the run differential is plus or minus three, a player’s statistics will in all likelihood indicate what kind of WPA is to be expected. There are exceptions, of course (cough cough Arod cough cough, it should be noted Arod’s best WPA season was in 2007, finishing first that year in WPA and winning his third AL MVP award, further validating my inclusion of WPA into the MVP odds formula).
AL/NL MVP Update with WPA
Posted by Rufio Magillicutty in Betting, MLB, MVP on August 8, 2011
The way the formula works, if a player is having an above average season on a great team, then they project favorably in the MVP predictor. Having said that, the formula allows for players having exceptional seasons on mediocre teams to make an impact on the distribution of voting points.
Last post was filled with out-loud ruminations on WPA and how it appears to correlate highly with the eventual MVP winners. Appearances can be converted to number form thanks to the invention of statistics. The linear correlation coefficient for the AL was .47, and the NL it was .54. That’s a statistically significant relationship which suggests at some point voting points and WPA diverge from being independent data sets. The same can be said for WAR, which has a coefficient of .49 for the AL, and .57 for the NL.
WAR (bref version) was already included in the set of variables used for regression, and adding WPA appears to resolve more of the variance in voting points than before.
Here is the new AL MVP table:
| NAME | TEAM | PROB | ODDS | WPA | WAR |
| Jacoby Ellsbury | BOS | 27.69% | 261 | 4.01 | 5.6 |
| Adrian Gonzalez | BOS | 27.68% | 261 | 3 | 5.1 |
| Jose Bautista | TOR | 24.03% | 316 | 6.14 | 6.8 |
| Dustin Pedroia | BOS | 19.82% | 405 | 2.42 | 6.2 |
| Curtis Granderson | NYY | 18.52% | 440 | 2.45 | 3.5 |
| Miguel Cabrera | DET | 16.29% | 514 | 3.85 | 4.1 |
| Mark Teixeira | NYY | 12.64% | 691 | 0.46 | 2.4 |
| Kevin Youkilis | BOS | 11.03% | 807 | 2.04 | 4.3 |
| Robinson Cano | NYY | 10.50% | 853 | 1.06 | 2.9 |
| Josh Hamilton | TEX | 9.26% | 980 | 3.84 | 1.9 |
| Michael Young | TEX | 6.29% | 1491 | 2.48 | 2.2 |
| David Ortiz | BOS | 6.21% | 1511 | 0.05 | 2.2 |
| Alex Rodriguez | NYY | 5.21% | 1819 | -0.16 | 3.2 |
| Paul Konerko | CHW | 3.21% | 3011 | 1.56 | 3.2 |
| Adrian Beltre | TEX | 1.64% | 6007 | 0.5 | 3.9 |
Jacoby Ellsbury has made quite a surge lately, and when compared to the formula that doesn’t use WPA, his probability almost doubles. I should point out that I recently added stolen bases to the equation as well, which would explain why his chances increase by 100%, while Bautista, who has the highest WPA in the AL, only increases 2%.
NL:
| NAME | Team | WAR | WPA | PROB | ODDS |
| Ryan Braun | MIL | 5.4 | 7.61 | 23.08% | 333 |
| Prince Fielder | MIL | 4 | 5.63 | 20.09% | 397 |
| Matt Kemp | LAD | 6.5 | 9.32 | 20.05% | 398 |
| Ryan Howard | PHI | 2 | 2.84 | 17.59% | 468 |
| Hunter Pence | PHI | 3.3 | 4.69 | 13.10% | 663 |
| Lance Berkman | STL | 3.6 | 5.07 | 11.95% | 737 |
| Albert Pujols | STL | 3.4 | 4.79 | 11.72% | 753 |
| Shane Victorino | PHI | 4.3 | 6.11 | 9.77% | 923 |
| Jonny Venters | ATL | 3.3 | 4.65 | 9.65% | 936 |
| Justin Upton | ARI | 3.6 | 5.12 | 8.52% | 1073 |
| Jimmy Rollins | PHI | 3 | 4.26 | 6.87% | 1356 |
| Vance Worley | PHI | 2.3 | 3.27 | 6.83% | 1363 |
| Joey Votto | CIN | 4.3 | 6.11 | 5.87% | 1603 |
| Ryan Madson | PHI | 1.6 | 2.27 | 5.80% | 1624 |
| Cole Hamels | PHI | -0.2 | -0.28 | 5.53% | 1708 |
| Matt Holliday | STL | 4.6 | 6.48 | 5.43% | 1742 |
| Roy Halladay | PHI | -0.4 | -0.57 | 4.56% | 2092 |
| Brian McCann | ATL | 2.7 | 3.80 | 4.01% | 2396 |
| Antonio Bastardo | PHI | 0 | 0.00 | 3.98% | 2412 |
| Cliff Lee | PHI | 0.4 | 0.57 | 3.43% | 2817 |
| Troy Tulowitzki | COL | 4.5 | 6.34 | 0.86% | 11546 |
| Freddie Freeman | ATL | 1.4 | 1.97 | 0.84% | 11787 |
| Rickie Weeks | MIL | 2.7 | 3.80 | 0.47% | 20961 |



Recent Comments