Loading...


NBA Data from Covers

Here is a little PHP script that grabs NBA data from covers.com (for example, here) and throws it over to a MySQL database. Each game is identified with a unique ‘GmNAME’ so duplicates are ignored. However this doesn’t happen until the MySQL query. It would be ideal if PHP handled this operation, would speed up the script considerably, but I just threw it together and only thought of that while writing this.

Open the php file in a text editor, enter your MySQL connection parameters, and the desired span of years for which to get data. Uncomment the lines that create the table structure.

I may mess around with it at length.

nba_covers

No Comments

AL/NL Cy Young

Again, last post on this.  I’ll attach necessary files at bottom.

Hit both pitchers last year.  Kershaw was sharp, he won us money (be weary of “max payout” disclaimers on Cy Young futures emailed to you after the fact by some unspecified sportsbooks, though no worries we got paid in full).

AL and NL formulas:

 V_{AL} = 9.22(WAR) - 21.94(ERA) + 0.25(K) - 108.5  V_{NL} = 8.71(WINS) + 6.72(WAR) - 39.88(ERA) + 0.14(K) - 46.79

And a different formula is required for NL relief pitchers:

 V_{NLR} = 14.72(WAR) + 2.19(SV) - 108.24(WHIP) - 7.95

bWAR is baseball-reference WAR projected to the end of the season.

AL CY YOUNG

NAME TEAM bWAR PROB ODDS
Justin Verlander DET 8.70 47.30% 111
Chris Sale CHW 7.76 46.00% 117
David Price TBR 5.80 43.85% 128
Matt Harrison TEX 7.14 31.07% 222
Jered Weaver LAA 4.95 26.24% 281
Yu Darvis TEX 3.48 5.39% 1755
CC Sabathia NYY 3.62 0.14% 70129

NL CY YOUNG

NAME TEAM bWAR PROB ODDS
J. Cueto CIN 7.07 27.21% 267
C. Kimbrel ATL 3.1 25.96% 285
R. Dickey NYM 5.34 25.49% 292
S. Strasburg WSN 4.53 18.79% 432
W. Miley ARI 4.65 17.01% 488
G. Gonzalez WSN 4.01 17.00% 488
C. Hamels PHI 4.26 13.77% 626
L. Lynn STL 3.62 12.65% 690
M. Bumgarner SFG 2.76 12.56% 696
M. Cain SFG 3.79 10.33% 868
A. Chapman CIN 3.27 6.90% 1350
J. McDonald PIT 3.83 6.03% 1559
J. Zimmermann WSN 6.97 4.51% 2119
K. Jansen LAD 2.9 1.79% 5493

CY_YOUNG

No Comments

NL/AL MVP

Sportsbooks haven’t convened MVP odds yet because I haven’t posted them myself.  This is an obvious observation to anybody that visits this blog on a yearly basis.  I think we’d all agree on this.  (I use the terms “we’d all” and “nobody in particular” interchangeably).

The formula behind setting a probability on a given player’s chances can be expressed as:

 P(v_i) = \begin{Bmatrix}  \displaystyle \frac{2v_i}{\sum _j^n v_j}, & \mbox {if } v_i>0 \\  0, & \mbox {if } v_i \leq 0  \end{Bmatrix}

If a player doesn’t register a positive number of MVP points, the variable v, then he is simply ignored.  The points are calculated slightly differently in the NL and AL, and the years 2000-2010 were used to fit the data.  This has already been explained on multiple occasions.

For AL batters and pitchers:

 V_{ALb} = 57.28(PLAYOFFS) + 12.58(WAR) + 11.08(WPA) + 1.81(HR) +  1212.25(AVG) + 0.58(RBI) - 500  V_{ALp} = 50(PLAYOFFS) + 25(WAR) + 15(WPA) - 25(ERA) - 100

The “PLAYOFFS” variable is either 1 or 0, and in season playoff projections are essentially current standings.

For all NL batters and pitchers:

 V_{NLb} = 78.41(playoffs) + 8.98(WAR) + 10.97(WPA) + 975.41(AVG) + 3.79(HR) + 0.93(RP) + 1.07(SB) - 553  V_{NLp} = 30(playoffs) + 10(WPA) + 25(WAR) - 150

The motivation for using WAR and WPA as primary coefficients stemmed from this post, which I found quite interesting.

At the bottom of the post I’ve attached some relevant excel files.  I’m not going to post anymore about this (I’ll do Cy Young this weekend and attach the necessary files), there really shouldn’t be any reason for me to have to.  I also never want to have to use or look at an excel file ever again.  But if I get enough requests via twitter/email/comments I’ll make a dedicated page that updates daily, probably using my own WAR calculations instead of bRef’s mess of drivel, and some server-side scripting.

Last year the formula picked Ryan Braun and Miguel Cabrera.  Verlander I think can we all agree should not have won the MVP.

NL MVP

NAME Team bWAR WPA PROB ODDS
Andrew McCutchen PIT 5.1 3.2 52.15% -108
Ryan Braun MIL 3.9 3 33.73% 196
Joey Votto CIN 4.5 5.2 33.42% 199
Melky Cabrera SFG 3.8 2.7 18.64% 436
Johnny Cueto CIN 4 2 15.18% 559
Carlos Gonzalez COL 1.6 1.8 8.73% 1045
Carlos Beltran STL 2.3 1.8 6.49% 1441
Matt Holliday STL 3.6 2.8 6.32% 1482
Buster Posey SFG 2.8 1.5 5.44% 1738
Ian Desmond WSN 2.3 3.5 5.41% 1748
Pedro Alvarez PIT 2 1.1 5.29% 1790
Jay Bruce CIN 1.1 0.2 4.13% 2321
Giancarlo Stanton MIA 3 2.5 2.34% 4174
Ryan Vogelsong SFG 2.8 2.1 2.18% 4487
Brandon Phillips CIN 2.2 0.8 0.55% 18082

AL MVP

NAME TEAM bWAR WPA PROB ODDS
Mike Trout TBR 5.3 0.5 34.66% 188
Robinson Cano NYY 5 1.6 31.16% 221
Josh Hamilton TEX 3.2 1.2 22.88% 337
Adrian Beltre TEX 3 1.6 22.13% 352
Mark Trumbo TBR 3.2 0 18.48% 441
Josh Reddick NYY 3.8 4.2 14.96% 568
Alex Rios TEX 2.6 1.7 14.61% 584
Miguel Cabrera DET 3.5 2.2 14.42% 593
David Ortiz BOS 2.7 2.5 6.46% 1448
Matt Harrison TEX 4.1 2.5 5.95% 1581
Fernando Rodney TBR 1.9 2.8 5.87% 1604
Justin Verlander DET 5 3.1 3.51% 2749
Chris Sale CHW 4.7 3 3.18% 3045
Edwin Encarnacion TOR 3 2.5 1.72% 5716

Here are the files. The “NLMVP_ODDS” and “ALMVP_ODDS” files require a data refresh and some sorting.  Feel free to change the coefficients, I don’t care.  Some files may be irrelevant, not sure.  I just threw a bunch of seemingly related files in an archive.

MLB_MVP_FILES.tar

No Comments

Iterative Simultaneous Kelly Solution

I like to use LaTeX, so this is mostly an excuse for me to practice my LaTeX skills. Whether one is familiar or unfamiliar with the Kelly Criterion, the concepts outlined below help to give insight to how one can make possible improvements when certain conditions are met. Generally, improvements from a sequential to simultaneous adjustment of the Kelly stakes are slight. However, as will be shown, slightness can turn considerable when extrapolated over longer time frames.  Feel free to just jump here.

Why a Log-Utility

The Kelly Criterion generalizes a model that maximizes the growth of one’s bankroll using a logarithmic utility. Why does this work? Well, its due to the inherent properties of logarithms. For instance, log2 (8) tells you how many times to divide eight by two to get one. Keep dividing by two in perpetuity and the number never reaches zero. Thus, using the Kelly Criterion, one has infinite bankroll, assuming there is no minimum placed on wager sizes. With infinite bankroll, there is no maximum placed on how big the bankroll can grow to.

Using a log-utility, bankroll is now the sum of the probabilities of the log-returns for any wager. Concretely, the return is simply the payout times the percentage of bankroll wagered. For one wager, with p as the win probability, w as the payout, and x as the wager size, the rate of growth is:

 G(x) = p \log (1+wx) + (1-p) \log (1-x)

The payout, w, on a win is the decimal odds – 1. Since a losing bet results in w = -1, the return on a losing bet is 1-x. And when x = 1, the logarithm is undefined, and ruin is guaranteed. (Financial conditions when x > 1 have yet to be of concern.)

The optimal wager size, or stakes, for any given bet is now the partial derivative of the function G with respect to x.

 \frac { \partial G}{ \partial x} = \frac { pw }{1 + wx} + \frac { 1-p } { 1-x }

Conveniently, the function G is a concave function, and at some point on the interval  0 \le x \le 1, function G converges to a maximum, the tangent slope of the curve equals zero. Setting the partial derivative to zero gives:

 x = \frac { pw - 1 + p }{ w } = \frac { dp - 1 } { w }

(d = decimal odds)

This can also be expressed in terms of probabilities, since the win probability divided by the implied probability, 1/d, is roughly equal to the return on a winning wager:

 1 + wx \approx \frac { p }{ p_{imp}}

And the optimal stakes can also be found with the formula:

 x = \frac { p - p_{imp} } { 1 - p_{imp} }


Simultaneity

For sequential stakes, the previous holds true. However, making wagers simultaneously results in a set of stakes that does not maximize log-utility. Since the bankroll isn’t re-adjusted every time a wager wins or loses, an adjustment has to be made. For example, with two independent simultaneous events (wagers placed at the same time), the function G equals:

 G(x _{1}, x_{2}) = p _{1}p _{2}log(1+w _{1}x _{1} + w _{2}x _{2}) + q _{1}p _{2}log(1 - x _{1} + w _{2}x _{2}) + p _{1}q _{2}log(1+w _{1}x _{1} - x _{2}) + q _{1}q _{2}log(1 - x _{1} - x _{2})  q _{1}, q _{2} = 1 - p _{1}, 1 - p _{2}

The size of possible joint outcomes is mn, m being the single game outcomes (typically binary), and n the number of simultaneous events. In this case, there are four possible joint outcomes. When n becomes larger than two, the space of joint outcomes becomes too large to write down completely, so in more compact notation, the function G can be reduced to:

 G(x _{1}, x _{2}, \dots, x _{n}) = \sum _{i}^{m^n} \textbf{P} _{i} log(1+\sum _{j}^n w _{ij} x_{j})

Where P is an mn x 1 vector, and  \prod _{j}^n p _{j} = P _{i}

In Vectorized form:

 G = \textbf{P}^\intercal \log(1 + \textbf{WX})

W is an mn x n matrix,  w \in \textbf{W} _{i}, x \in \textbf{X}

Iterative Solution

As mentioned before, there are implied constraints placed on x. The sum of all stakes can not exceed one, and negative stakes are yet unknown to commerce. With that in mind, if one is fortunate enough to have a rather size-able edge across multiple simultaneous bets, the stakes can be re-calculated with elementary math:

 x _{i} := \frac {x _{i}}{\sum _{i}^{n} x_{i}}

Whether this is a necessary step or not will be obvious given a set of events. Regardless, the assumption here is the function G is not maximized using single game stakes, because after a result is final, the initial bankroll changes. Any number r <> ar if a <> 1. (When the stakes are zero, the number of simultaneous events simply becomes n – 1, and does not effect the current bankroll).

Through a derivative-free, iterative solution, the goal is to move up a curve until the point reaches an apex, so arbitrary steps are suitable enough to reach the desired goal. The result will be the optimal kelly fraction across all stakes, which has the benefit of maintaining proportionality. Conveniently, the time-complexity of such an algorithm has a lower bound of 1 and an upper bound of N, number of iterations, which should terminate, at the most, when the kelly fraction = 0.

Assuming the matrices P, W, and X have been populated (pseudocode):

alpha = 1
N = 100
G_temp = -1
for i = 1 to N
    X := X * alpha  ##alpha = fraction
    G = P_transpose * log(1+W*X)
    if G < G_temp
        break
    else
    G_temp = G
    alpha = alpha - .01  ##Sufficient step size, alpha terminates at zero
    end
end

This should guarantee a solution. Two things to mention, the statement “Assuming the matrices P, W, and X have been populated” is of major importance, and should not be reduced to assumptions. However one chooses to populate an array, matrix, etc…, this adds considerably to the time-complexity of the algorithm. The size is enormous, and assigns  \Theta (c*m^n) the new lower bound.

I’ve created an example with random probabilities and payouts, to visualize the algorithm.

 \begin{tabular}{|l|l|l|}  \hline  \textbf{implied p}&\textbf{p}&\textbf{stake}\\\hline  0.5776&0.6250&0.1122\\  0.5560&0.5988&0.0965\\  0.5994&0.6427&0.1080\\  0.5273&0.5682&0.0866\\  0.5235&0.5719&0.1015\\  0.4070&0.4375&0.0514\\  0.4964&0.5431&0.0927\\  0.4738&0.5323&0.1112\\  0.4175&0.4691&0.0886\\  0.4150&0.4676&0.0899\\\hline  \end{tabular}

 

In Octave, the entire process took 0.03 seconds. That was worth a .07% increase in bankroll.  The more events there are the more one will benefit.

As mentioned before, the improvements may be slight. But setting G as the rate of growth in the equation, Y = ert, can translate into drastic increases in bankroll. A brief example of the time it takes to triple bankroll given r:

 3 = e^{rt}

 r = .030, \ t = \frac {\ln(3)}{.030} \approx 36.62

 r = .025, \ t = \frac {\ln(3)}{.025} \approx 43.94

A 0.5% increase in bankroll in this case means if one makes a series of wagers simultaneously, each meeting similar conditions, approximately seven such wagers are unnecessary to reach three times initial bankroll. If n = 10, that’s 70 different bets. Performed weekly and that’s over two months worth of extra money ever year. Over ten years and a bettor can secure over 18 months of additional bankroll had one considered simultaneous outcomes.

Direct Calculation

The iterative solution has the advantage of maintaining proportionality. This can be restated as using some scalar, s, to optimize stakes, X:

 s \begin{pmatrix}x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix} \Rightarrow \textbf{max}(G)

Now the function, G, becomes:

 G(s) = \textbf{P}^\intercal \log (1 + \textbf{W} \times s \textbf{X})

And the derivative of G:

 \frac { dG}{ ds} = \textbf{P}^\intercal (\textbf{WX} \div (1+\textbf{W} \times s \textbf{X}))

Here, division is element-by-element division. In most programming languages, this is syntactically equivalent to “(W*X)./(1+W*X).”

I’ve reduced G to a one variable equation, so there should be a direct solution without having to iterate by setting the derivative of G to zero and solving for s, the kelly fraction.  (N = mn )

 s = \frac { \sum _i^{N} P_i \sum _j^{n} w_{ij} x_j}{ \sum _i^{N} P_i ( \sum _j^{n} w_{ij} x_j)^2 }   = \frac { \textbf{P}^\intercal (WX)}{ \textbf{P}^\intercal (WX \circ WX) }

(Hadamard Product)

Mutual Exclusivity

Much work has been done on simultaneous events whose outcomes are dependent on each other (i.e. horse racing, futures). A general algorithm for kelly stakes across mutual exclusive events is outlined at wikipedia.

Further Reading
SBRForum Kelly Calculator
Algorithms for optimal allocation of bets on many simultaneous events
Optimal Betting Strategies for Simultaneous Games

No Comments

Pitch F/X with PHP and Octave

Generally, sabermetric nerds use R (or Excel lmfao) to appropriate their desire to waste time. I, however, prefer Octave, an open-source alternative to MATLAB. Its more intellectually stimulating than R, in my opinion. By that I mean, there isn’t a built in mechanism to download thousands of user submitted R functions and scripts (Believe it not, there is even a pitch fx library to use with R). There are degrees of laziness, and while I fall in the upper percentile, people who use R really would rather remain rigid regarding routes to regale their ridiculous yet reasonably remunerative run value research.

Actually, the primary reason I’m using Octave is its the program of choice for ML-Class. Its also more syntactically convenient.

This isn’t important. What is important, however, is something that is very easy to implement. The motivation behind using PHP with Octave is the lack of MYSQL compatability with Octave. I was having trouble getting it to work, as have others, based on a google search. Fortunately, one can call Octave, or open up a shell to essentially run anything, from PHP, with:

system('path/to/bin path/to/file');

Visit php.net for more info.

For Octave users on Linux (Windows users may have to specify the full path to Octave):

system('octave -q path/to/file');

Octave script files use the extension “.m”, and in PHP we can easily write to a file and save as “filename.m”:

$octave_file = fopen('octave_file.m','w');
	fwrite($octave_file,"This is octave code to do stuff really cool");
	fclose($octave_file);

This is elementary, though vital, PHP code.

I’m going to assume the reader has a Pitch F/X database already. If not, grab my script at github here (its setup to grab from yesterday, comment that section and change the start and end dates), or download here (pbp2.zip).

Octave is going to be used for all the math, so the Octave script will include code to open a data file, a file that is initialized in the PHP code:

$file = 'pfx_data.txt';
unlink($file);
$f = fopen($file,'a');

It is of major importance that the ‘unlink’ function is called, and the ‘a’ property be assigned to the file. The data is going to be appended to the file each time the PHP code loops through the query results. And when the PHP code is run the file itself is going to be deleted before written to.

After connecting to a database, here is a rather dull example query:

$query = "SELECT p.px, p.pz, SUM(IF(p.event IN ('Single', 'Double', 'Triple', 'Home Run'),1,0))/SUM(IF(p.event NOT LIKE '%Sac %',1,0)) AS BA FROM pitches as p JOIN (SELECT id FROM batters WHERE name_display_first_last LIKE '%$player_name%' LIMIT 1) AS t ON t.id = p.ab_id WHERE LEFT(p.gameName,4) BETWEEN ".$yr_start." AND ".$yr_end." AND p.type = 'X' AND p.px IS NOT NULL GROUP BY p.px, p.pz";

This is extracting a specified player’s contact Batting Average. (Different from BABIP, which excludes Home Runs, hence the phrase “Balls in Play.” Any sort of “Sacrifice” will be undefined, and ignored below.)

Subsequently on each loop, the query results are appended to the file “pfx_data.txt”.

$result = mysql_query($query);
while($row = mysql_fetch_assoc($result)){
	if ($row['BA']=='')
              continue
        else
              fwrite($f,$row['px'].','.$row['pz'].','.$row['BA']."\n");
mysql_free_result($result);
mysql_close($connection);
fclose($f);

I’ll provide a file so one can decipher all the Octave code themselves. For now, here’s an example plot in Octave (default is the gnuplot graphing library):

data=load('".$file."'); 
X = data(:, [1, 2]); 
y = data(:, 3); 
hits=find(y==1); 
outs=find(y==0); 
plot(X(hits,1),X(hits,2),'rx','MarkerSize',7)
hold on
plot(X(outs,1),X(outs,2),'bo')
title('".$player_name." pitch position on cBA (".$yr_start." - ".$yr_end.")');
set(gca,'xlim',[-2,2]); set(gca,'ylim',[0,5]);
legend('hits','outs'); 
xlabel('Horizontal position (ft)'); 
ylabel('Vertical position (ft)'); 
text(-1.8,4.8,['N = ',num2str(length(data))]); 
text(-1.8,4.5,['hits = ',num2str(length(hits)),', outs = ',num2str(length(outs))],'fontsize',6)

Obviously, some of the variables are PHP assignments, and wouldn’t be used in Octave. In PHP, just condense what would have been written in a script or in the Octave terminal into the call to ‘fwrite’, over-writing any text that was there before:

$file1 = 'pfx_octave.m';
$f1 = fopen($file1,'w');
fwrite($f1,"data=load('".$file."');\nX=data(:,1:2); y=data(:,3);\nplot(X(hits,1),X(hits,2),'rx','MarkerSize',7)\nhold on;\nplot(X(outs,1),X(outs,2),'bo')\ntitle('".$player_name." pitch position on cBA (".$yr_start." - ".$yr_end.")'); set(gca,'xlim',[-2,2]); set(gca,'ylim',[0,5]);\nlegend('hits','outs'); xlabel('Horizontal position (ft)'); ylabel('Vertical position (ft)'); text(-1.8,4.8,['N = ',num2str(length(data))]); text(-1.8,4.5,['hits = ',num2str(length(hits)),', outs = ',num2str(length(outs))],'fontsize',6)

(The ‘\n’ character is just to make ‘pfx_octave.m’ look pretty. One can still use the file after closing PHP.)

The result is something like this:

I threw together a virtual strike zone and some hot and cold zones (TY to Mike Fast for his regularized strike zones, the file at the bottom provides code for LHB and RHB strike zone deviations):

hold on
"plot([-1,1],[1.5,1.5],'k--')
plot([-1,1],[3.5,3.5],'k--')
plot([-1,-1],[1.5,3.5],'k--')
plot([1,1],[1.5,3.5],'k--')
plot([-1,1],[1.5+2/3,1.5+2/3],'k--')
plot([-1,1],[1.5+4/3,1.5+4/3],'k--')
plot([-1+2/3,-1+2/3],[1.5,3.5],'k--')
plot([-1+4/3,-1+4/3],[1.5,3.5],'k--')
hold off

To calculate and plot hot/cold zones:

z=x=x1=x2[];
for i = 1:3
x=find(X(:,1)>=-1 & X(:,1)<=-1+2/3 & X(:,2)>=1.5+(i-1)*2/3 & X(:,2)<=1.5+i*2/3);
x1=find(X(:,1)>=-1+2/3 & X(:,1)<=-1+4/3 & X(:,2)>=1.5+(i-1)*2/3 & X(:,2)<=1.5+i*2/3)
x2=find(X(:,1)>=-1+4/3 & X(:,1)<=1 & X(:,2)>=1.5+(i-1)*2/3 & X(:,2)<=1.5+i*2/3);  
z=[mean(y(x,1)) mean(y(x1,1)) mean(y(x2,1)) z];  
end
hold on
text(-.8,3.5-1/3,num2str(z(1),3));
text(-1+2/3+.2,3.5-1/3,num2str(z(2),3));
text(1-2/3+.2,3.5-1/3,num2str(z(3),3));
text(-.8,2.5,num2str(z(4),3));
text(-1+2/3+.2,2.5,num2str(z(5),3));
text(1-2/3+.2,2.5,num2str(z(6),3));
text(-.8,1.5+1/3,num2str(z(7),3));
text(-1+2/3+.2,1.5+1/3,num2str(z(8),3));
text(1-2/3+.2,1.5+1/3,num2str(z(9),3));
hold off

Here is a multi-plot figure with a surface plot of Jeter’s data:

Rick Reed’s strike zone from 2008-2011:

Just an example of some of the things Octave can do. Copy the code directly into the PHP file and insert into a variable or directly into ‘fwrite’ and the Octave script can be opened directly from PHP.

All of this should work fine with MatLab, or even FreeMat (another open-source alternative).

pfx_octave_example

No Comments