Evaluate on test set
28 Nov 2021Evaluate on test set
Evaluate these models on the holdout test set you set aside in part 1.
Question 1
Test your 5 models on the untouched 2019/20 regular season dataset. Do your models perform as well on the test set as you did on your validation set when building your models? Do any models perform better or worse than you expected?
Here are the graphs of the models with the specified test data:
Model | accuracy | log loss | AUC | mutual info | f-score |
---|---|---|---|---|---|
XGBoost | 0.890 | 0.397 | 0.701 | 0.000 | 0.193 |
NN | 0.910 | 0.276 | 0.757 | 0.000 | 0.081 |
SVM | 0.592 | 0.621 | 0.721 | 0.000 | 0.268 |
Log Reg on distance | 0.905 | 0.297 | 0.702 | 0.005 | 0.000 |
Log Reg on angle | 0.905 | 0.314 | 0.508 | 0.003 | 0.000 |
Log Reg on distance & angle | 0.905 | 0.297 | 0.703 | 0.003 | 0.000 |
We can see that our 2 preferred models keep on performing well on their respective best scores. As we discussed before, the logistic regression models are still predicting only shots to yield about 90% accuracy but f-scores of 0. There are a few factors to consider when analyzing these results. the XGBoost is still performing pretty well in the reported metrics. As there was no game of the 2019 season in the training set, we are generating average ranks for all new rookie players. Since we saw previously that the 2 ranking features that we created were important, it is to be expected that we would lose a bit of precision with all the events concerning new shooters and new goalies. We also know that the 2019-2020 season was shorter than usual due to Covid-19, so we don’t have the same representativity of a whole season than what would have had for a full normal season. As reported in the table above, the best accuracy is 0.91 and the best AUC is 0.757. Both are from the neural network. This model outputs more or less the same metrics as on the validation set. The best f-score (0.268) was obtained with svm. This was already the case on the validation set.
Question 2
Test your 5 models on the untouched 2019/20 playoff games. Discuss your results and observations on this test set. Are there any differences to the regular season test set or do you get similar ‘generalization’ performance?
This test set is very different from the training set for 2 main reasons:
- the post-season games have a few different rules as games will go on overtime until a team scores a goal.
- the playoffs of the 2019-2020 were played during in August (!), after a 5 months break (!!)
So just between those two points, it is to be expected that our models might not perform as well as they would with a test set of a regular season.
Here are the graphs of the models with the specified test data:
Model | accuracy | log loss | AUC | mutual info | f-score |
---|---|---|---|---|---|
XGBoost | 0.890 | 0.383 | 0.711 | 0.000 | 0.206 |
NN | 0.910 | 0.264 | 0.767 | 0.000 | 0.068 |
SVM | 0.592 | 0.600 | 0.728 | 0.001 | 0.264 |
Log Reg on distance | 0.909 | 0.285 | 0.710 | -0.000 | 0.000 |
Log Reg on angle | 0.909 | 0.304 | 0.536 | 0.000 | 0.000 |
Log Reg on distance & angle | 0.909 | 0.285 | 0.711 | -0.004 | 0.000 |
The results are almost exactly the same as with the regular season (question above):
- The best models are still NN for AUC and accuracy, and svm for f-score.
- Logistic regression models keep predicting only shots.
- XGBoost is still very good in all metrics.
It is quite surprising to observe these since there are already many differences between the 2 test sets and the training set. Furthermore, considering the multiple particularities of 2019-2020 playoffs games, it differs even more from the original training set. However, this means that the models have a good generalization.