By the Numbers: Is Corsi or Goals a better evaluator of defensive success?

Abstract

All truth passes through three stages. First, it is ridiculed. Second, it is violently opposed. Third, it is accepted as being self-evident.

Arthur Schopenhauer (1788-1860)

Schopenhauer's statement is in regard to how people naturally resist dramatic changes in knowledge. However, examination and re-examination are not negative qualities. It is by discussion, scrutiny and evidence that we come to separate truth from fiction.

Recently I've had some good back-and-forth debate with a fellow Jet fan by the name of Scott. The discussion was about marginal improvements between the multiple third pairing Jets defensemen and we came at a conflict over statistical evaluation on a defensemen's impact on a game defensively. The conflict at its core boiled down to the typical looking at goals versus including all shot attempts (ie: Corsi) as an evaluator. The hook being which one was a better predictor of future goals against.

The most interesting bit was Scott had some numbers in regards to the correlation between shot attempts and goals against for one year versus goals against for the following season. I wondered if this data was influenced by score-effects and I then asked if he minded I repeat the study with score-close data, and he gave me his blessing.

Here is a comparison on how Corsi events and goals perform in predicting future goals, specifically for events against the player's team.

Background Information

Long ago, there were a few underground bloggers who were trying to learn more about hockey; how to make accurate predictions of future success and create context on how players were used. They started off with trying to make improvements with the traditional +/- statistic, using goals for and against while the player was on or off the ice.

Then at the start of the 2007-08 season, NHL.com started increasing the amount of statistics available to the public. The newly available data included missed and blocked shot attempts. The biggest breakthrough was discovering that Corsi differentials predicted future goal differentials (and wins) more accurately than past goal differentials. This breakthrough lead to Corsi eventually taking over goal differentials in the hockey analytics realm and the start of the Behind The Net era.

Soon thereafter, it was discovered that shot metrics were influenced by team's changing strategies when leading or trailing significantly. From these score-effects CorsiTied and CorsiClose arose. They are simply looking at Corsi only for situations when the score is either tied or close (close being within two points in the first two periods or tied in the third). By doing this, analysts were able to increase the power of shot metrics.

Procedure

Five full seasons of data was pulled from stats.hockeyanalysis.com, being from 2007 to 2012. The data was cut down to only 5v5 Corsi events against (CA/20) and 5v5 goals against (GA/20) per 20 minutes during score-close situations. The data was also exclusively for defensemen with 400+ minutes of icetime.

Then each defensemen's CA/20 for one season was plotted against their GA/20 for the following season. For example one graph would include every defensemen's 2007-08 CA/20 versus their 2008-09 GA/20.

This was then repeated for GA/20 in the earlier season instead of CA/20. Thus creating two graphs for each season.

The process then was repeated using two successive seasons for each player as opposed to one, for example 2007-09 was compared to 2009-11, but the ice time threshold was increased from 400+ to 800+ minutes.

All data from the single seasons was then pooled into one aggregate of single seasons. Then again for both the successive comparisons. A third aggregate data pool was created combining all sets, for both single and successive seasons.

All together this procedure created eighteen plots in total for nine possible comparisons between Corsi and goals against ability to predict future goals against in close-game situations.

Final Data

*Not all raw data is included here, but if you wish to see the spreadsheet which includes all information feel free to ask (warning: it is a bit messy)

Figure 1: Coefficient of determination, sample size and percentage of Corsi's predictive strength relative to goals

Data Set	CA/20	GA/20	Players in Sample	Corsi Strength
2007-08	0.08383	0.00747	140	1122%
2008-09	0.03275	0.01009	140	324%
2009-10	0.08419	0.01614	141	522%
2010-11	0.03910	0.01120	142	349%
2007-09 (2YR)	0.04889	0.02248	97	217%
2008-10 (2YR)	0.07308	0.02379	98	307%
SINGLE SEASONS	0.04243	0.00760	563	558%
SUCCESSIVE SEASONS	0.05725	0.02275	195	252%
ALL DATA	0.04401	0.00952	758	462%

Figure 2: One graphical example, 'ALL DATA' CA/20 versus future GA/20

Discussion

It seems to be pretty unanimous. In all cases, a defensemen's success in repressing shot attempts against is the stronger predictor of how successful they will be in repressing future goals against. Now, this does not mean you can simply look at a players Corsi against and claim one defensemen is better defensively than another on the lower Corsi against number on its own. It does mean they will likely have lower goals against given no changes in usage, ceteris paribus.

While Corsi's coefficient of determination is significantly larger, neither is extraordinarily strong. In part this is likely due to the shortness in sample size. Jared Likens previously used larger sample sizes for comparing Corsi% and goals%, improving the correlation by a full decimal point but with similar results in Corsi being by far the superior.

Another reason is simply goals are extremely variable. While Corsi tends to be stable, shown to have a split-half season reliability around 70%, goals are around 40% for the same sample. This is essentially the root reason why Corsi is a stronger predictor. When you are making predictions of goals based off of goals, you are trying to predict a heavily variance influenced variable with another heavily variance influenced variable. When you try instead with Corsi, you at least have one variable that is steady in shot attempts.

The 2007-08 data set seems to be an extreme, with Corsi having one of its stronger predictive seasons, while goals having its worst in the set. If you want to place blame for those years, Adrian Aucoin, Andrew Ference, Chris Campoli, Sean O`Donnel, Dion Phaneuf, Chris Pronger, and Mathieu Schneider were the most extreme change offenders for GA/20 in 2007-08 to 2008-09 that stood out with a quick glance over. I am not sure why though this season was worse than the others for GA predicting GA.

With the two successive season data sets, the difference between goals and Corsi is cut nearly in half. It should be noted that while each individual player's sample size improves in the successive season data sets, the amount of players in the sample is diminished. Further investigation would be needed to see at what sample size goals become more effective, if ever. There is a chance though that the sample space needed may be so large that aging curves end up disrupting the data.

It would be nice to repeat Jared Liken's data-mining experiment specifically for against or for numbers as opposed to differentials. Although, it is unlikely to bring different results. While Liken's data was looking at differentials for teams as opposed to exclusively one side for a player, a player's differential is simply their own against (or for) numbers being compared to their oppositions against (or for), and a team is just the aggregate of all the players. Also, the added information is probably quite minimal as the aim in hockey is not to simply score more goals or limit the other team, but rather out-score the opponent by doing a combination of both.

What does this mean for the Winnipeg Jets?

Mark Stuart, Adam Pardy and Keaton Ellerby have seen similar sample sizes of ice time and have been battling for places on the third pair.

Figure 3: 2013-14 Score-Close 5v5 Corsi for Pardy, Ellerby and Stuart

Name	TOI	CF	CA	CF/20	CA/20	CD/20	CF%
Adam Pardy	131.9	133	107	20.167	16.224	+3.942	51.36%
Keaton Ellerby	136.3	128	132	18.782	20.625	-1.843	49.23%
Mark Stuart	146.9	131	172	17.835	22.260	-8.424	43.23%

We can therefore estimate that Stuart is most likely to have the worst goal differential and goals against of the three defensemen at years end, ceteris paribus. But, as mentioned previously, this is only perfectly comparable if their situations are all perfectly the same, which they are not… but we can take into account much of their contextual nuances.

All three have seen similar percentage for quality of teammates when looking at Jets forwards and defense TOI. Their zone starts however are quite different with Pardy having the easiest zone starts and Stuart having the toughest. The quality of competition is also different, with Ellerby seeing more top 6 matchups and Pardy the least.

The effect of zone start can be reduced by looking only at open play, deleting any events that occur within 10 seconds directly after an offensive or defensive zone face. This gives enough time for the the comparative advantage of a zone start to severely diminish. The results though do not change much if you do so.

Currently there is no provenly reliable way to adjust for match-up difficulty. While there is a difference between the three defensemen, it is very minimal and all three have predominately been sheltered third pairing defensemen for their usage.

If the Jets truly wanted to dress the best players, Stuart would be sat in favour for Pardy and Ellerby. While Stuart does bring intangibles with veteran presence, shot blocking and playing with an edge, those qualities play a far smaller role than out-chancing opponents. Intangibles are real but should be considered more of a tie-breaker than a deal-breaker. He may still be a 5/6 defensemen he has not been the Jets 5th or 6th best defensemen.

You don't get points for being stable or calming presence or a veteran. You don't get points for blocking shots or giving out hits or having heart or being gritty or being good in the dressing room. These are all abilities and tools that a player may possess in making them a better overall player in assisting the team in winning; however, it is the on-ice results which are what causes the wins.

Style is the how you seek results. The results are whether you are effective or not.

Conclusion

Discussions are not limited to the areas of academia. Discussions occur between you and your friends over drinks in the bar or with the friendly stranger sitting next to you at the arena. For an out-of-town fan as myself, discussions commonly occur over the internet in comment sections, message boards and twitter. I am a passionate person and so I debate passionately, but this does not mean I'm always right. Whether wrong or right though, I always remain open to learning and this mindfulness is where I find the most self-improvement.

Recent advancements of hockey analytics have gone and still are going through this process constant re-evaluation. Many actively resist or question the power of "advance statistics" and what they represent. While the front lines in hockey analytics currently are in researching contextual nuances –such as what makes a strong "possession player" with looking into zone entries and exits or how to properly evaluate players given different situations in deployment and usage–, effort should always be included in revisiting concepts to help solidify their meaning.

In this way we can help educate others in new discoveries and improve our own understanding by checking for errors or outliers.

Sources

Objective NHL – Shots, Fenwick and Corsi by Jared Likens

Objective NHL – Predicting Future Success by Jared Likens

Arctic Ice Hockey – More on Score Effects by Gabriel Desjardins

Broad Street Hockey – How to Evaluate Defensemen by Eric Tulsky

Follow @arcticicehockey Follow @SBNationNHL