Advanced Models

22 Nov 2021

Advanced Models

Question 1

XGBoost classifier using the same dataset using only the distance and angle features. Briefly discuss your training/validation setup, and compare the results to the Logistic Regression baseline.

roc

XGBoost trained on:	ROC AUC:
angle	0.638
distance and angle	0.717
distance	0.7
baseline	0.5

reliability

Links to the experiments in comet:

We simply used the train_test_split function from sklearn with a validation percentage of 20. In this basic model, we have not applied any extra step such as normalization or encoding.

Differences between these models and the logistic regression baseline:

XGBoost is already performant with only the angle feature (not the case for the baseline).
The improvements coming from adding the angle feature to the distance one are greater in the XGBoost model than in the baseline.
The best basic XGBoost’s performance is higher than the baseline but not that much (difference of AUC= ~0.01).

Question 2

XGBoost classifier using all of the features you created in Part 4 and do some hyperparameter tuning to try to find the best performing model with all of these features

roc

Model	ROC AUC
XGBoost trained on all features	0.725

reliability

To tune the hyperparameters (max depth, eta, and alpha), a bayesian search was performed. At the start of the simulation, we were using the accuracy as metric. However, since the data is known to be unbalanced and we found out early enough that the accuracy was not a good metric to use by itself in this situation. Therefore, we chose the f-score as metric. This choice will be detailed in section Give it your best shot.

We can see that the XGBoost model with all features has a slightly better AUC (.02 more). So we are slowly getting closer to a model that can correctly predict goals.

The maximum probability (0.9) that this model outputs is higher than the baseline (0.7) and higher than any other model implemented so far. This can be seen on the calibration plots. However, this model is not as well calibrated as the previous XGBoost.

Here is the code snippet for the tuning:

config={
    "algorithm": "bayes",
    "spec": {
        "maxCombo": 0,
        "objective": "maximize",
        "metric": "f_score_val",
        "minSampleSize": 100,
        "retryLimit": 20,
        "retryAssignLimit": 0,
     },
     "parameters": {
            "max_depth": {"type": "integer", "min": 2, "max": 25},
            "eta": {"type": "float", "min": 0.1, "max": 1},
            "alpha": {"type": "float", "min": 0, "max": 1}
         },
    }

Here are the best hyper parameters resulting in a validation f-score of 0.195 and a validation accuracy of 0.889 (link to experiment):

max_depth = 11,
eta = 0.9101,
alpha = 0.4451 .

And here are the validation errors of the bayesian search: bayesian_search_val_error bayesian_search_f_score

Link to the experiment with the logged model: XGBoost tuned

Question 3

Explore using some feature selection techniques to see if you can simplify your input features.

To evaluate different features, we decided to use filter methods as wrapper methods would have been too computationally expensive. In that idea, we used chi-square, variance estimators (with threshold set at 0.1), anova test, and mutual information. We used cross validation though the process.

	0	1	2	3	4
Variance Threshold	shot_distance	rebound	change_shot_angle	goalie_rank	shooter_rank
Chi2	time_from_last_event	shot_distance	rebound	goalie_rank	shooter_rank
Anova	coord_x	coord_y	shot_distance	goalie_rank	shooter_rank
mutual information	coord_x	last_event_coord_x	rebound	None	None

Except for threshold variance, we used select k-best from sklearn with a parameter of 5 in order to see which features will be selected by the different algorithms. While the results are similar, there is some disparity in the selected features indicating that we can not blindly select the features from these results. Note: Due to the nature of these algorithms, we preferred to avoid encoding categorical features as either one-hot or ordinals as they would have had a mean and variance bias compared to the other features (shot_type and las_event_type).

To assess the performance of these subsets of features, we apply them to our previous (tuned) XGBoost model.

Using SHAP we can see the impact of each feature on the prediction.

Shap applied on XGBoost trained on features selected by variance threshold shap Shap applied on XGBoost trained on features selected by Chi-square shap Shap applied on XGBoost trained on features selected by anova shap Shap applied on XGBoost trained on features selected by mutual information shap Shap applied on XGBoost trained with all features shap XGBoost features selection shap

By analysing the table above and these SHAP, we were able to unselect some features and keep the following ones:

“game_seconds”,
“coord_y”,
“coord_x”,
“time_from_last_event”,
“shot_angle”,
“distance_from_last_event”,
“shot_distance”,
“rebound”,
“change_shot_angle”,
“goalie_rank”,
“shooter_rank”.

Here are the plots of XGBoost with this subset:

roc

Model	ROC AUC
XGBoost trained on subset	0.769

reliability

This XGBoost model performs better than the previous ones. The calibration curve is very close to the perfectly calibrated benchmark. This time, the gain in AUC is much bigger (+0.044). The accuracy is

From the goal rate plots, we can observe that the goal rate at the highest percentile is still bigger than the previous experiences (baseline: 22%, tuned XGBoost: 31%, XGBoost with subset: 38%).

To this best subset, we also tried to add the following 3 categorical features (that we previously removed):

“team_info”,
“shot_type”,
“last_event_type”,

These decisions were based on our domain knowledge and the previous milestone. It is known that different teams have different strategies and some teams are more aggressive which lead to more goals scored e.g. the Tampa bay Lightning team (185 more goals than the average on the training set). We also observed in Milestone 1 that the shot type is strongly correlated to a shot being a goal or not.

Link to comet experiment: XGBoost trained on best subset + 3 categorical features (added by domain knowledge)

However, after experimenting with these additional subsets, we found out that the best AUC scores were obtained by not adding any of them (0.757 vs 0.769). So the most optimal set of features regarding the best AUC score (0.769) is the one described above. Its f-score is 0.119 on the validation set.

Link to comet experiment: XGBoost with best subset

IFT6758 Project A blog about an NHL data analysis

Advanced Models

Advanced Models

Question 1

Question 2

Question 3