Archive for category NCAAF

Posted NCAAF Home Field Advantage

Extracted from the 2002-2009 Seasons.

HFA is found by finding the difference between the average line and the average home line, and weighting for the number of games.

The numbers are deceiving since the HFA is contingent on the home/road schedule.  120 teams in NCAA FBS breeds an unbalanced schedule.  For example, LA Monroe plays SEC teams out of conference on the road and then welcomes their Sun Belt rivals at home during conference season.

The asymmetry is undeniable, which leaves the calculated HFA as it is not a very manageable statistic.  What I should do is constrain the data to conference games, to get a better indicator of how Vegas sees a team’s home advantage versus similar competition.

The page is here.

, , , , , , , , , , ,

No Comments

College Football Line Regression

I mentioned earlier how the difference in team philosophies skew the aggregation of line data, and to account for this I created a z-score for each team in regards to rushing attempts and passing attempts.  I’ll discuss that approach later.  With the remaining 892 teams in the sample of eight years of football, I continued the regression with one singular framework.

I decided to isolate the different statistical variables I have collected into three distinct groups.  Isolate in essence anyway.  The factor that distinguishes one variable from another are based on quality, and value content.  The isolation was somewhat arbitrary.  Its arbitrary in the sense that the delimiters between one group and another are preferential, though arbitrary in this case certainly does not imply without reason.

I consider a variable to be any statistic that, after being exposed to the conditions for which it functions the most, persist long enough to serve as a unit of regression.  Uniform color is not a variable.  Though I should admit with turnovers and penalties, based purely on my judgment, I consider those to be accumulations of a wide array of intangibles, and randomness.  Also they were omitted for convenience.  Additionally I did not include special teams, for when I first built the database I forgot about special teams, then upon attempting to include special teams it was clear there was little correlation.  Not enough for me to go through the trouble of modifying my database to fit in special teams.  Though in a way, through an indirect stream of logic, yards per point is shaped by a team’s performance on special teams.  The reasons are fairly obvious and I don’t feel it necessary to expand on that point here.

Moving on, variables that are ‘indicative’, and in this instance the only variable being points per game, conspicuously determine the fundamental concept of the line itself.  Obviously the only factor that determines an ATS win or loss is the final score.  And the point differential of how much any team beats its opponents reflects the particular appropriation of the respective team’s line.  This is simple enough, but felt it essential to place the variable ‘ppg’ on a state of which the other variables coalesce to.

Now it would be too easy just to regress average line on ‘ppg’, and doing such results in a sound enough measure of an interval with which one might expect a team’s line to fall.  But a single variable regression can ultimately be refined by its co-adaptive performance enhancers.  These are basically the other statistics that can lead to the realization of a team’s ‘ppg’ differential.

I call these variables ‘imperative’.  ‘Imperative’ statistics provide a framework for the nature of a team’s viability.  ‘Indicative’ statistics are susceptible to various degrees of randomness and luck.  ‘Imperative’ variables are a marker of team performance that strongly correlate to wins and losses, but without the luck factor.  When one thinks of efficiency measures of performance, these can be considered ‘imperative’.  A more glaring facsimile would be to WHIP and FIP in baseball, which are measures of pitcher acuity devoid of luck and randomness.  For example, Pitcher A has a sequence of batters with the following results:

Single, Strikeout, Homerun, Groundout,  Strikeout

Pitcher B produces similar results with a slight variation in order:

Single, Double Play, Homerun, Strikeout

These two scenarios result in the exact same WHIP, and Pitcher A even has a higher strikeout ratio over his sequence of batters, but the one extra strikeout actually penalized Pitcher A compared to B, and therefore the ‘indicative’ performance was not an accurate reflection of the ‘imperative’ performance, when the two pitchers are placed side by side.

Aside from this severe digression, the ‘imperative’ variables have high quality information content, and can be used to assess team viability.  And I assigned the following to the label ‘imperative’:

Yards, Rushing Yards per Attempt, Rushing yards, Completion Percentage, Passing Yards per Attempt,  Passing yards, Yards per play, Yards per point, Plays.

This seems reasonable enough.  So I’ve given authority to a number of variables.  And now take into consideration each of the aforementioned statistics have three different levels.  Offense, Defense, and Differential.  This creates a very complex and sophisticated multifaceted regression schematic.

The other statistics I call ‘sterile’.  While ‘sterile’ statistics can give some indication of a team’s systematic gameplan, or philosophy, the numbers themselves have zero privilege over the nature of team viability.  Passing attempts, Passing completions, Rushing attempts, amongst some others.  (Even penalties I consider ‘sterile’.  However to the contrary, as has been said before, a high number of penalties could lead one to think that a particular team is very aggressive, i.e. Southern Cal)  Now I guess the argument can be made that a team that has a high amount of rushing attempts per game controls the clock, does not turn the ball over, and can shorten the game, but this argument is combated by the powerful mechanism known as correlation.  I posted the correlation matrix in one of my earlier college football dirges, and the concepts were explained to capacity previous.

Three levels of each of the twelve variables, which makes for a combinatorics spectrum that excedes the scope of human comprehension.  To decode every possible permutations by hand is virtually unattainable.   That is where the serrying of different groups has its advantage.  Even with the distinction into desirable groups, the permutations of possible elements is extremely large, and would still be exhausting to decode each scenario.  Therefore, with a little luck, I had to find the ‘zone of attraction’, or where certain ‘imperative’ values betray a higher sense of determination.

I won’t expound further on the process, and let the wonders of Stata convey the results through its ingenious data processing system.

      Source |       SS       df       MS              Number of obs =     892
-------------+------------------------------           F(  9,   882) =  600.03
       Model |  64934.9358     9  7214.99287           Prob > F      =  0.0000
    Residual |  10605.4221   882  12.0242881           R-squared     =  0.8596
-------------+------------------------------           Adj R-squared =  0.8582
       Total |  75540.3579   891  84.7815465           Root MSE      =  3.4676

------------------------------------------------------------------------------
     avgline |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         ppg |  -.2327069   .0268559    -8.67   0.000    -.2854158    -.179998
        ydsd |    .035118    .005302     6.62   0.000     .0247119    .0455241
        yppo |    -4.1972   .2556919   -16.42   0.000    -4.699036   -3.695365
       rypad |   1.016908   .2712062     3.75   0.000     .4846229    1.549193
       pypad |   .5005141   .2467422     2.03   0.043     .0162437    .9847845
       yppto |   .4438111   .0748342     5.93   0.000     .2969371     .590685
       ypptd |  -.4937034   .0727098    -6.79   0.000    -.6364077    -.350999
        pctd |   6.502121   3.446404     1.89   0.060    -.2619893    13.26623
      playso |  -.1727749   .0300915    -5.74   0.000    -.2318342   -.1137157
       _cons |   11.73068   3.509905     3.34   0.001     4.841943    18.61943
------------------------------------------------------------------------------

A brief survey of the table and the results are very encouraging. The particulate combination of variables as I said is similar to the ‘zone of attraction’ method. Some ‘imperative’ variables offered a greater sense of co-adaptation and mutual assistance with the ‘indicative’ points per game. The brilliance of the Stata program lies in the quickness and efficiency with which different regressions can be run. For those fortunate to have Stata on their computer, I provided the file below so you can manipulate the data with however you so please.

Now how do we decipher these coefficients? Let’s look at points per game. It seems logical that an decrease in points per game differential would decrease the line (or in fact increase the line since we are dealing with a scale of negative for a favorite to positive for an underdog). The average line, dependent variable, will increase or decrease by each given value of the coefficient for every one value increase or decrease in each variable. And the model demonstrates by the coefficients that the results are governed by reason as well as the optimal combination of independent variables that creates a number that closely resembles the dependent variable.

What I wanted to accomplish was to manifest a new line whose descriptive statistics embody a strong resemblance to the descriptive statistics of the average line. I think the resulting product is highly encouraging. Not only shaped up by the regression results, but by the value of the content as well. Here are the descriptive statistics comparing the average line and the regressed line.

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
     newline |       892    .4234546    8.536902  -24.99476   28.78546
     avgline |       892    .4234641     9.20769     -24.69      29.29

The mean, standard deviation, and range are almost identically with only the slightest discrepancies. Here is the absolute average difference between the actual line and the new line.

	 Mean estimation                     Number of obs    =     892

--------------------------------------------------------------
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    line_mad |   2.746412    .069846      2.609331    2.883494

Using the coefficients above, this is the equation to formulate an accurate estimation of a team’s average line:

Average Line = -2.327069ppg + .035118*ydsd + (-4.1972)yppo + 1.016908rypad + .5005141pypad + .4438111yppto + 6.502121pctd + (-.1727749)pctd + 11.73608

The remaining 57 teams have been left untouched by the previous regression. How did I isolate these 57 teams? I found the z-score of each team’s offensive passing attempts and rushing attempts and then separated those with a z-score greater than two with either measure.  Two being again rather arbitrary, though it should be said passing attempts resemble a normal distribution as the sample size increases, allowing the z-score of two to be an accessible marker. Rushing attempts are more asymmetric, though perhaps as the sample approaches infinity the central limit theorem applies.  Overtime the league changes as a whole and the replication of ideas oscillates from one extreme to another, therefore an asymptomatic system is probably the average.  Regardless, I used two as the line of demarcation between “typical” and an “atypical” gameplan. To find the z-score, divide the difference between value xi and mean of the population by the standard deviation of the population.

Just as a frame of reference, here are the z-scores for Air Force for each season from 2002-2009 (‘ra’ – rushing attempts, ‘att’ – passing attempts):

year zra zatt
2002 3.95 -3.06
2003 4.34 -2.68
2004 2.7 -2.17
2005 3.17 -2.3
2006 3.44 -2.1
2007 3.27 -2.35
2008 3.75 -2.89
2009 2.87 -2.48

Then, using similar methods from above, finding a ‘zone of attraction’ which relies upon logic, luck,  and the content of the variable, related to the aggregation of teams, this is the equation for predominantly running offenses (Air Force, Navy, etc…):

      Source |       SS       df       MS              Number of obs =      34
-------------+------------------------------           F(  4,    29) =   38.41
       Model |  1406.92746     4  351.731865           Prob > F      =  0.0000
    Residual |  265.570477    29  9.15760265           R-squared     =  0.8412
-------------+------------------------------           Adj R-squared =  0.8193
       Total |  1672.49794    33  50.6817557           Root MSE      =  3.0262

------------------------------------------------------------------------------
     avgline |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         ppg |  -.5942952    .117771    -5.05   0.000     -.835164   -.3534264
         ryd |   .0578676   .0316136     1.83   0.077    -.0067894    .1225246
        rypa |   2.399145   1.128734     2.13   0.042     .0906247    4.707666
        ydso |  -.0430914   .0162273    -2.66   0.013    -.0762799   -.0099029
       _cons |   7.322725   5.051622     1.45   0.158    -3.009002    17.65445
------------------------------------------------------------------------------

Its not necessary for me to point out the massive flaws in logic here. The system appears to break down with teams inclined to rush the football more often than not. Its counter-intuitive to think a high rushing yards per attempt differential can have an inverse effect on the line, certainly with a running team.

Fortunately, the model surrounding teams with a high z-score in relation to passing attempts however, are more constrained to thoughtful and sensible train of thought.

      Source |       SS       df       MS              Number of obs =      23
-------------+------------------------------           F(  6,    17) =   54.49
       Model |  1612.32956     6  268.721594           Prob > F      =  0.0000
    Residual |  83.8426096    17  4.93191821           R-squared     =  0.9506
-------------+------------------------------           Adj R-squared =  0.9331
       Total |  1696.17217    23  73.7466163           Root MSE      =  2.2208

------------------------------------------------------------------------------
     avgline |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        ppgo |  -.3231369   .2326192    -1.39   0.183    -.8139204    .1676467
        yppd |   7.833663    1.62701     4.81   0.000     4.400972    11.26635
        yppo |  -1.753175   1.266587    -1.38   0.184    -4.425439     .919089
        ydsd |  -.0242419   .0198519    -1.22   0.239    -.0661258    .0176419
         pyd |  -.0657155   .0261637    -2.51   0.022    -.1209162   -.0105149
        yppt |   1.137832   .4093928     2.78   0.013     .2740884    2.001575
------------------------------------------------------------------------------
Average Line = (-.3231369ppgo) + (7.833663yppd) + (-1.753175yppo) + (-.0242419ydsd) + (-.0657155pyd) + (1.137832yppt)

The previous two conditions, based on the z-score of passing and running schemes, are immanently tethered to the sample size, or lack thereof. Notwithstanding, when discriminating the data using a z-score, that results provide a more precise equation compared to the league as a whole. Those teams with a high z-score using a re-configured regression tactic produce a new line that is closer after a regression to the actual average line, closer than using the base framework applied to the 892 teams that have more or less a gameplan in line with the prevailing wisdom of the eight year sample.

One may think I operate in a spectrum where one takes solace in overly devoting one’s self to correlations.  Irregardless I should point out that I’m still standing.  I place wagers everyday and I’m still here, its been a year and a half.  Most would have already quit ten times over by now.

At length I’ll try to put into practice this new equation by selecting a few games at random over the last eight years and create a line based on the line difference between two teams in a given matchup, similar to what I did here, with the BCS bowls.  Additionally I plan on regressing some mixture of pythagorean, line pythagorean, and linear line to win prediction, with some year n+1 wins.  Maybe do something similar with the NFL.

Stata file:

NCAA Football 2002-2009.dta

, , , , , , , , , , , , , , , , , , ,

No Comments

NCAAF Correlation Coefficients

Dynamics for the following provided here and here

When undergoing a rigorous statistical evaluation of a wide array of factors, there will obviously be variables that directly impact other variables.  Imagine a team executing an offensive scheme that only calls for a dozen rushing plays per game, in proportion the rest of the plays have to come from somewhere, probably via the forward pass variety.  Therefore rushing attempts and passing attempts have a fairly indirect relationship.

Now where these relationships breakdown is when variables are thrown into a spectrum of a multi-correlated underpinning.  Multiple variables when isolated can highly correlate to one particular variable, but when evaluated together they can lead to an incompatible aggregation.

I’ve discovered this ineffably unwelcomed behavior occurs whenever an array of college football factors are forced to interact in the realization of how a line is appropriated.

Below are the highest average line correlation coefficients for each year from 2002-2009, mostly using the offense to defense differentials:

One thing that is very encouraging is some of the coefficients above can be seen is measures of efficiency.  Yards per attempt, yards per play, yards per point, completion percentage, all have effectual connotations.  This is good.

Where and why might these inconsistencies of significance occur?  To simplify, not every team uses the same offensive or defensive philosophies.  Let’s use Texas Tech and Army.  Diametric opposites in terms of offensive style.  For Texas Tech, passing offense is far more crucial to team wins and team performance, where as Army’s option style offense induces very little passing, much more rushing, and unfortunately far fewer wins.  This is just one example, but teams throughout the league all have their different styles, offensively and defensively, so different measures of performance carry more weight.

When mixing a set of independent variables into a system to calculate an alternative possibility for the value of one single variable (average line), the aggregation undergoes an intense struggle for explanatory superiority.  The explanatory factors are easily manipulated by the inclusion of one other variable which has a high correlation, and in this scenario there is no delicate balance between any combination of variables that would provide a nice equilibrium.  The one constant that preserves its level of consistency is PPG differential, which is very unsettling since its overly simple, when complexity is what whets the appetite.

Again with Texas Tech, passing statistics easily dominate rushing statistics on the scale of significance, yet the opposite is true for other teams.  And this isn’t just true for passing and rushing metrics.  Imagine a team with immaculate special teams play, a horrible offense, and a solid defense, or in other words, Virginia Tech.  Virginia Tech is often rated out of the top 100 in all offensive numbers.     However, their superb special teams and overall defense accounts for a solid showing year in and year out. (I should note, special teams statistics throughout the league as a whole are not highly correlated with average line or winning percentage.)  Though when running their statistics into a regression model, very large inconsistencies materialize, and to extrapolate an average line based on highly correlated statistics offers zero indication of what the linesmakers think of the Hokies, or what the line shows, or even how good the Hokies are.  Now we have reached a point where offense and defense are approaching mathematical singularity.  Therefore it would seem regressions, or other tests of determination of the average line variable, would have to belie the actual philosophy and makeup of each individual team.  Or perhaps I should serry the teams into categories based on style of play, offensive and defensive schemes.  This would in turn breed smaller sample sizes, so we’ve disappointingly arrived at a paradoxical stipulation.

More at length.

, , , , , , , , , , , , , , , , , , ,

1 Comment

NCAAF 2002-2009 BCS Bowls in Retrospect

This is by no means data snooping, I can assure you. 

Looking ahead to College Football provides a salient oasis amidst the unbearable grind and dilatory nature that is the marathon baseball season. And now with having a CFB database via statfox appropriated and organized, I can use the data as a respite for the few relentless summer months that are of yet to be endured.

For now, I’m just curious to see how the matchups in the BCS  bowl games over the last eight season materialized under the conditions provided by the database I have at hand.  I’m not falling into the trap of fitting statistics that may only have best suited the structure of randomness of any given time.  What I am merely doing is pacifying my appetite for curiosity, using all the statistics I have aggregated.

What I did was only use regular season lines (essentially the line leading up to the BCS bowl game), and created a line based on matchup and location, without the inclusion of any other variables (i.e., USC vs Illinois in the Rose Bowl is worth around 3.5 pts for USC’s HFA, Georgia is given 2.5 pts vs WVU in the Sugar Bowl).  By doing so I make a plethora of assumptions however (injuries), this is for the sake of time, this because I didn’t feel like recreating the exact array of tangibles and intangibles that surrounded the respective BCS game for the given seasons.

The tables below are for each year from 2002-2009, showing the BCS Bowl matchups and the average line, adjusted line, predicted line, and the actual line for each team.  The last column is the result of the game.  Again the line is the average line for each team leading up to the game. (Cells in Green/Red indicate whether or not the predicted line is higher or lower than the actual relating to the result)


BOWL GAME AVG LINE ADJ LINE PRED LINE GAME LINE RESULT
ROSE OHIO ST -14.92 -15.71 -2 4.5 26
OREGON -7.92 -10.64 2 -4.5 17
           
FIESTA BOISE ST -22.25 -18.74 0 7 17
TCU -20.59 -18.43 0 -7 10
           
SUGAR CINCINNATI -13.83 -13.85 13.5 12 24
FLORIDA -24.17 -24.81 -13.5 -12 51
           
ORANGE GEORGIA TECH -4.96 -5.84 0 -6 14
IOWA -5.14 -7.37 0 6 24
           
ALABAMA -14.08 -16.88 10 -3.5 37
TEXAS -23.85 -25.53 -10 3.5 21
 
2008
BOWL GAME AVG LINE ADJ LINE PRED LINE GAME LINE RESULT
ROSE PENN ST -17.32 -17.01 11.5 10 24
USC -25.92 -25.44 -11.5 -10 38
           
FIESTA OHIO ST -11.59 -15.07 4.5 9 21
TEXAS -16.92 -17.75 -4.5 -9 24
           
SUGAR ALABAMA -10.81 -11.82 -5 -9.5 17
UTAH -13.36 -10.63 5 9.5 31
           
ORANGE CINCINNATI -4.21 -4.85 0 2.5 7
VIRGINIA
TECH
-2.96 -2.96 0 -2.5 20
           
FLORIDA -19.08 -21.01 -3.5 -4 24
OKLAHOMA -17.54 -21.00 3.5 4 14
2007
BOWL GAME AVG LINE ADJ LINE PRED LINE GAME LINE RESULT
ILLINOIS -3 -4.88 15.5 13 17
USC -18.4167 -16.69 -15.5 -13 49
           
FIESTA OKLAHOMA -21.69 -19.97 -4 -8 28
W
VIRGINIA
-18.83 -18.03 4 8 48
           
SUGAR GEORGIA -4.55 -7.07 3 -8 41
HAWAII -19.65 -13.65 -3 8 10
           
ORANGE KANSAS -13.50 -8.20 3 3 24
VIRGINIA
TECH
-8.08 -10.82 -4 -3 21
           
LSU -18.5 -18.82 -5.5 -3.5 38
OHIO ST -15.36 -16.38 5.5 3.5 24
2006
BOWL GAME AVG LINE ADJ LINE PRED LINE GAME LINE RESULT
-15.50 -17.82 1.5 2.5 18
USC -14.04 -15.48 -1.5 -2.5 32
           
FIESTA BOISE ST -16.82 -13.43 -1.5 7 43
OKLAHOMA -9.88 -10.83 1.5 -7 42
           
SUGAR LSU -18.33 -19.16 -9 -9 41
NOTRE DAME -12.46 -13.79 9 9 14
           
ORANGE LOUISVILLE -18.17 -15.14 -14.5 -10.5 24
WAKE FOREST -0.21 -0.57 14.5 10.5 13
           
-12.08 -17.15 3.5 7 41
OHIO ST -19.33 -20.84 -3.5 -7 14
2005
BOWL GAME AVG LINE ADJ LINE PRED LINE GAME LINE RESULT
TEXAS -24.29 -24.49 3.5 7 41
USC -25.46 -26.50 -3.5 -7 38
             
FIESTA NOTRE DAME -7.73 -10.15 6.5 5 20
OHIO ST -13.23 -16.95 -6.5 -5 34
             
SUGAR GEORGIA -11.33 -12.37 -9 -7 35
W
VIRGINIA
-5.15 -5.73 9 7 38
             
FLORIDA ST -7.73 -10.87 -1 9 23
PENN ST -11.23 -11.76 1 -9 26
2004
BOWL GAME AVG LINE ADJ LINE PRED LINE GAME LINE RESULT
OKLAHOMA -24.58 -25.13 -2 -1 19
USC -22.38 -23.09 2 1 55
           
SUGAR AUBURN -14.59 -14.99 -10.5 -6 16
VIRGINIA
TECH
-7.05 -6.04 10.5 6 13
             
FIESTA PITTSBURGH 0.10 0.37 20 14 7
UTAH -18.55 -18.00 -20 -14 35
             
MICHIGAN -10.05 -13.38 4.5 7.5 37
TEXAS -16.00 -16.59 -4.5 -7.5 38
2003
BOWL GAME ADV LINE ADJ LINE PRED LINE GAME LINE RESULT
MICHIGAN -15.08 -14.34 4.5 7 14
USC -15.75 -15.87 -4.5 -7 28
             
ORANGE FLORIDA ST -15.25 -16.20 2 -1.5 14
MIAMI -17.33 -18.22 -2 1.5 16
             
FIESTA KANSAS ST -16.04 -17.53 -7 -7 28
OHIO ST -9.50 -12.92 7 7 35
             
LSU -10.88 -10.85 9 6 21
OKLAHOMA -22.77 -23.65 -9 -6 14
2002
BOWL TEAMS AVG LINE ADJ LINE PRED LINE GAME LINE RESULT
FLORIDA ST -15.61 -18.73 -5 7.5 13
GEORGIA -8.88 -11.90 5 -7.5 26
             
FIESTA* MIAMI -22.14 -23.40 -9 -11 24
OHIO ST -14.53 -14.22 9 11 31
             
ROSE OKLAHOMA -18.27 -17.35 -7 -5 34
WASHINGTON
ST
-7.82 -8.26 7 5 14
             
IOWA -9.88 -9.01 1.5 4.5 17
USC -6.71 -11.26 -1.5 -4.5 38

I feel it necessary at this point to digress and direct the reader’s attention to a post I made at Cappers Mall during my formative yet more ambitious years, as a write-up to the 2007 title game between the Gators and Buckeyes, which a brief survey of the tables above will show the game as being one of a certain green.

Results seem promising. There is a lot of green illuminating from the contiguously right-hand portion of the tables. However, one can’t really find any resolution in translating a model to bowl games. Its a different environment altogether. Teams have over a month to prepare, motivation may be a factor, some teams may be operating under new coaches at the time of the game. Bowl games are an atypical scenario when compared with regular season.

One thing that is at times conspicuously ill-configured is the adjusted line, it may need some major tweaking. The adjustment will invariably be a work-in-progress, and this is betrayed by my predicted line in last year’s championship game, amongst some others (Hawaii vs Georgia lol).

I anticipated the table to exhibit some irregularities with what average line shows, and to magnitudinal levels. But had one taken the concept of the regular season average line as an appendage to gambling on the games, the results would have shown consequential profit.

Since I have embraced the average line as the method of pedagogy (me being as well the benefactor), I feel the resilience to pecuniary dispersion is to a very low degree ephemeral, and an inclination to cluster around a break even point or better is probably my long term expectation. I’ve said before the average line is an indirect, yet constructive way of tapping into the sophisticated warehouse of linesmaker information.

, , , , , , , , , , , , , , , , , , , , , , , , ,

1 Comment

College Football Spread Correlations and Riff Raff for 2010

This is how I’ve decided to approach any and all measures of performance, statistics, and various other factors.  I use the line that Vegas convenes and triangulate all my data to the line.  I did it with Basketball, Baseball, and now the next step is football.  Here I’m focusing on College Football.  Not sure I’ll even attempt to model an NFL database, the sport is a different animal.  For the NFL, I’ll just stick with intuition and finding good fades on the internetz.  And the fades emanate with resplendent fervor if you search the forum spectrum for CAPS LOCKS AND !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Back to College Football.  First I want to find how substantial is the line itself.  Meaning how does a team’s overall line compare to team wins.  Since for gamblers, the spread is the only thing that matters, this may seem meaningless.  But in order to appropriate a spread with precision and a marked sense of sharpness, how well a team performs over the season is highly correlated to wins and losses.  Which subsequently can be expressed, with a measure of consequence, via margin of victory, using the proverbial Pythagorean Win Percentage calculation.

Having extracted data from Statfox using the methods described here and here (I can assure you, regardless of the tendency for my computer to show a personified resentment by way of heating up, because of all the work it has been given, this is still a very economical method of extraction), the next step is to serry the data in a fashion that is conducive to evaluation and analysis.  And for me this involves calculating the average line for each team, then translating the line into wins using a least squares line.  The relationship between average line and wins should obviously be highly correlated, and with a low margin of error, the wins formulated via where a line falls on the highly linear trend can be used as a starting point.

With the actual wins, and the linear wins from average line, there are a couple more ways to find expected season wins.  A direct Pythagorean expectation, points scoredx / (points scoredx + points allowedx), and the Pythagorean formula replacing the observed points scored with the average line.  (I did this with college basketball, and explained the calculation to capacity here, its pretty straight forward and obvious, average the team points scored and points allowed, and subtract or add the half the average line margin, if average line is below zero, add half the line to points scored, etc…).

The one thing with Pythagorean expectation that is imminently taken under consideration is the value 0f the exponent.  Some prefer a constant using observed past data, here I used the exponent that corresponded to the lowest absolute average difference between observed and expected, moving from season to season.  Different seasons induced various exponent, higher scoring environments generally relate to higher exponents, and vice versa.

In order to isolate the most ideal exponent for each season, the lowest absolute average difference (often referred to as the mean absolute difference, or MAD), is used rather than the actual difference because what I am concerned with is the error difference from actual percentage, rather than a true difference.  I want everything to compare to zero.  So a difference of -.05 and .05 won’t average to zero, instead both will have an absolute difference of .05.  (Again I laid all the framework for this out in my college basketball dirges).  There are other ways to find the best exponent, but MAD, despite its inferences, is a sound mathematical tool.

All the data from 2002-2009 were calculated with this exact framework in mind.  One more thing I did in order to rate the teams from the last eight years.  This will be just a slight digression, but one that I think is very meaningful and explanatory.  Instead of only finding he average line, I decided to do an adjustment to opponent’s average line.  I extracted every opponent’s line for each team (more VBA code), found the average line for that opponent, and subtracted the difference.  Its overly simple, and a more precise formulation is probably waiting to be found, but regardless, once sorted by adjusted line, the order of the teams took on a different look, similar to what I did with Starting Pitcher Line Weight.  For example, in 2009 instead of Nevada being in the top 25 in average line, once adjusted their ranking moved down to 40 or so.  Same with Boise State or TCU, the teams in the second tier conferences move down a few slots after the adjustment, which is what I set out to do.  The formula is direct and simple, and the result is substantial enough to warrant an inclusion of the formula into a sort of ranking system.

I’ll just show the table for 2009, but here is the average line and adjusted line top 10:

AVERAGE LINE 2009

ADJUSTED LINE 2009

(For you Hokie fans out there, VaTech was rated 11th in both with and without adjustment)

The lightish purplish pinkish (or how about just pink) cells show the descriptive statistics.  Its amazing, and disgusting at the same time, that the average line for 120 NCAAF teams approaches zero.  Linemakers are the least appreciated operation in terms of level of sophistication and intelligence on the market.  Another thing should be mentioned is variance of the opponents line.  For the most part, teams schedule neutral teams, or the schedule is such where there is an equillibrium between the degree of difficulty and the degree of cupcake.  Maybe a team like Troy will schedule Georgia and Florida, and be underdogs of 20 to 30, but once they get into Sun Belt conference play their average conference line may be around 7-10 point favorites.  The weight of conference games is merely showing its effect here.  Because distribution of league wins (and the average line) are for the most part Gaussian (a tendency to cluster around a 50% winning percentage, or a line of zero), I can make the assumption that 2/3 of the teams schedule opponents that range from a line of 2.27 to -2.27, which is an indication of how well balanced conference play is in my opinion.

With the above data merely as a nice ranking system, and perhaps a starting point for future team metric formulations, here is all the actual vs expected win measures from 2002 top 2009 (which I mentioned before the digression, BTW I used the actual average line to find expected wins, rather than adjusted, for arbitrary reasons) sorted by adjusted line, only showing the top ten.  The exponent used for each season is in the yellow header beside the year.  Its pointless to wait for the 2010 season to end to find the most optimal exponent, since doing so would imply the season has already ended and team evaluations finished without exploiting the data for the sake of gambling on the teams.  Judging by the table, I think an exponent of around 2.24 should suffice in Pythagorean calculations to assess team by team scenarios for the 2010 season.  If you want all the data email me and we will have to work something out.  Perhaps a data swap or sign up for one of the affiliates and I’ll send you some of my excel sheets.  I need list of returning starters (already have 2008 and 2009) and preseason sportsbook future win totals back to 2002 (have 2009 for 47 teams).

*The Pink Cells are average for all teams for the respective seasons

*Texas and USC just beat the shit out of everybody in 2005

*Surprised by Arizona, me too.  Here is their 2008 schedule
*Vatech finished 11th in average adjusted line in three of the last eight seasons.

*Games scheduled against teams of which no line was placed on the game were excluded from win evaluations.  Therefore a Team A with a higher line than Team B can still have less pythagorean line wins and linear line wins.

A brief survey of all the data demonstrates that team wins predicted by least squares fit is, for the sport as a whole, the best way to compare to actual wins, or at least the more accurate method.  This is clear purely by looking at the average wins.  The Pythagorean formulas seem to invariably overshoot expected wins throughout all teams on average for each season.  I’ve always thought the wins founded by the linear relationship between line and wins is one of the best ways to measure how over or underrated a team is.   Like I said  before, this sort of model worked very well for college basketball and the NBA, even though one year success rate could be completely random fluctuation of variance, and has performed adequately enough in MLB.  (Remember I was down 15x before implementing the all-encompassing starting pitcher behemoth of which I only allow a handful of people to use)

For Vegas (or nowadays Pinnacle) has a collection of the best and most sophisticated equalizers for team performance that even seems to transcend how the team performs and what the team thinks of themselves.  They’ve manifested an entirely new way to judge how good teams are, though their results are largely ignored by the MSM because of the bad stigma surrounding sports gambling.  Vegas knows best, always.

Tomorrow Eventually, I’ll post how the different statistical variables impact, or correlate to, the average line (yards per play, yards per point, rushing yards per attempt differential?), and perhaps find a nice and easy formula using coefficients as weights to find an expected line.  Then I can start running year to year regression, how each variable extrapolates to the following season’s average line, etc…

This could (will?) allow me to disclose an ATS W/L record that might translate to an optimal level of expectation vs the observed record (again I did something similar with college basketball in trying to predict ATS records, though not nearly as involved).

, , , , , , , , , , , , , , , , , , , , , , ,

2 Comments