) from a “real” D-C model, but it is a hack. Short of designing a better class of model, that might be the best you can do. Let me know if you have any better ideas on this front.
Exactly how much worse does the model perform compared to the xG-resimulation approach (i.e. without the xG per shot information)?
As before, I ran a backtest to evaluate model performance. In a backtest, we train each model for each matchday in the past. We then make predictions using those models and see how well they would have performed, using the information they had available at the time. The following chart shows the backtest results for the English Premier League between 2016/17 and 2020/21, inclusive.
When looking at outcomes (home/draw/away), the totals-only model performs very slightly worse than the model with full xG-resimulations. This makes sense; it has access to less information about teams’ shooting patterns. However, it still performs a lot better than using goals.
Interestingly, the xG-totals approach performs much worse relatively on total goals than on per-match goal difference. For each match prediction, I took each model’s average predicted home and away goals. I then calculated total goals (home + away) and goal difference (home - away). Over the whole backtest dataset, I took the mean squared error for these two quantities.
The xG-totals model performs almost exactly as well as the xG-resimulation model in predicting match goal difference. However, it performs markedly worse on total goals predictions (still much better than the goals-based model).
This suggests that the xG-totals method almost as good at estimating the relative strength of each team as the xG-resimulation approach. However, it is worse at predicting match outcomes because of some bias introduced; the xG-totals will underestimate the total number of goals being scored in any given match.
My hunch is that you could reduce some of this underperformance by using the number of shots in a game (also widely available) to get a mean xG per shot value. You could then use this xG per shot with the number of shots to simulate the match outcomes as though you had the individual shot data.
You could also gain some extra performance by ensembling a xG-totals model with a Goals model.
But at the end of the day, if this kind of performance drop is a big issue for you, it’s probably worth investing in a model with better foundations.
If you would like to reproduce this analysis, or dig a bit more into the technical details, the code can be found in my wingback
repository.
This repo is built on top of understat-db
, a project for pulling data from Understat.com into a database. It uses a Python library called mezzala
for modelling team-strengths and match predictions.
I still have some Liverpool 2019/20 “data scarves” left for sale (shipping to the UK). If football+data is your thing, take a look
This is something I’ve touched on before, and I think there’s a lot of potential for incorporating different data sources, like non-shot xG, post-shot xG, betting odds, and so on, into the model. ↩︎