Beating save percentage to death: Logistic Regression.

I reran my Even Strength Save Data using logistic regression. The results again show a significant difference between goalies.

First, I converted the data to sparse format (each line is a single observation). There are over 600k shots in the database. When I tried to run a logistic regression on the sparse data R politely coughed up a hairball and died. Unable to allocate a 1.6Gb data vector? Weak!

So I went back to the dense data and constructed the logits (ln(p/(1-p)) myself. I first corrected the observed save percentages for year.

> LinearModel.1 <- lm(CalcLogit ~ last.first, data=Dataset, weights=ESA)

> anova(LinearModel.1)

Analysis of Variance Table

Response: CalcLogit

Df Sum Sq Mean Sq F value Pr(>F)

last.first 231 6041.3 26.153 1.6455 6.833e-07 ***

Residuals 693 11014.5 15.894

Even with the inherent selection bias, the weighted analysis shows a difference that is highly statistically significant. The difference in goalies accounts for about 35% of the variability seen in the calculated logits.

One problem with logistic regression is the difficulty in translating regression results into effects sizes. Beyond that is the question of practical significance in the differences.

Conclusion

Using weighted logistic regression, there is a statistically significant differences between NHL goalies in even strength save percentages over the period 1997-2010.

Talking Points