Modelling the World Cup with regista
Recently, I started working on an R package for soccer analytics and modelling, regista, and the upcoming World Cup seemed like aa good an opportunity as any to demo what regista does and how it can be used.
So, in this post, I’m going to go through using regista to estimate the strengths of international teams.
About regista
At the time of writing, functionality is limited to estimating team strengths. However, I’m working to expand and extend the package. If you have any ideas on things you’d like to see, file an issue on github or drop me a message on twitter.
regista is not currently available on CRAN but can be downloaded from github like so:
if (!require(devtools)) { |
The data
Mart Jürisoo has handily compiled some international soccer data over on Kaggle. It’s free to download (you need a kaggle account but that’s free as well) and goes back all the way to 1872.
library(tidyverse) |
## # A tibble: 15 x 6
## date home_team away_team home_score away_score hfa
## <date> <chr> <chr> <int> <int> <lgl>
## 1 2018-06-02 El Salvador Honduras 1 0 FALSE
## 2 2018-06-02 Thailand China 0 2 TRUE
## 3 2018-06-02 England Nigeria 2 1 TRUE
## 4 2018-06-02 Montenegro Slovenia 0 2 TRUE
## 5 2018-06-02 Mexico Scotland 1 0 TRUE
## 6 2018-06-02 Sweden Denmark 0 0 TRUE
## 7 2018-06-02 Ireland USA 2 1 TRUE
## 8 2018-06-02 Belgium Portugal 0 0 TRUE
## 9 2018-06-02 Austria Germany 2 1 TRUE
## 10 2018-06-02 Niger Uganda 2 1 TRUE
## 11 2018-06-02 Iceland Norway 2 3 TRUE
## 12 2018-06-02 Zambia Namibia 0 0 FALSE
## 13 2018-06-02 Lesotho Swaziland 1 0 FALSE
## 14 2018-06-02 Latvia Estonia 1 0 TRUE
## 15 2018-06-02 Kenya New Zealand 2 1 FALSE
Of course, we don’t want to use all the data from 1872. Instead we’ll use matches from the past two years.
Two years may seem like a long time, but unfortunately there just aren’t that many games in international football. Using two years worth of games allows us to build a passable number of games for most major international teams.
library(lubridate) |
We don’t want to estimate strengths for teams with a really small number of games played, either. We only want teams with enough games played to get a decent estimate of their strength. For this reason, I’ve chosen to exclude teams with fewer than 20 games played.
teams <- |
## Number of teams: 80
## Number of games: 615
The model
Then, we have to fit the model. At the moment, regista implements the Dixon-Coles model. The original paper explaining the model is available online.
I recommend reading the paper to fully understand what’s going on under the hood. However, there are many online resources outlining the model available via search if the explanation provided doesn’t work for you.
The most important thing to understand about the model for now is that each team’s attack and defence are estimated separately (along with home advantage). This allows a variety of teams to be modelled more accurately.
For instance, Atlético Madrid and Tottenham Hotspur might be similarly good teams, but Atlético are more defensive (score fewer goals and concede fewer goals). The Dixon-Coles model allows these differences to be accounted for.
With that in mind, let’s fit the model for the international teams over the past 2 years.
We’re going to use the dixoncoles_ext
function to fit the model, which allows more flexible model specification with formulae. This is necessary because of the variable home advantage in this dataset. However, a simpler version (dixoncoles
) is available too.
# Use regista to fit the model |
Team strength
You can think of the resulting parameter estimates as the expected number of goals scored or conceded versus the average team in the dataset.
For example, we would expect a team with an attack parameter of 1.4 to score 1.4 goals (on average) versus the average defence (a defence parameter of 1).
So, what do the estimates look like?
The following plot shows the attack and defence estimates. Higher (top) attack parameters and lower (left) defence paameters are better.
estimates %>% |
There seems to be a rough correlation between attack and defence, which is encouraging; we broadly expect good attacking teams to be good defending teams as well. However, the output is still a bit noisy. Peru in particular stand out.
I think part of this is down to the data. International teams don’t play each other that often, and when they do, they tend to play more games against teams in the same region. Ideally we want more ‘cross-pollination’ when estimating team strength.
Home advantage
We’ve also estimated a home advantage parameter. This is a simple multiplicative factor on goalscoring. In other words, teams score hfa
times as many goals when playing at home:
hfa <- exp(res$par[["hfa"]]) |
## Home field advantage: ~1.33
Problems and improvements
There are a few issues that it’s worth highlighting here:
- Extra time and penalties
- The dataset used doesn’t discriminate between 90 minute games and 120 minute games. This is clearly an issue; model assumes all games are the same length.
- Time discounting
- Games in the distant past are clearly not as relevant to estimating team strength as games played yesterday. In fact, a method for incorporating this is included in the original Dixon-Coles paper. The ability to add custom weighting based on time (or other factors) is available with
regista::dixoncoles
’sweight
parameter. One application of the game weighting is demonstrated here.
- Games in the distant past are clearly not as relevant to estimating team strength as games played yesterday. In fact, a method for incorporating this is included in the original Dixon-Coles paper. The ability to add custom weighting based on time (or other factors) is available with
- Friendly weighting
- Friendly games should probably not be weighted as highly as competitive games. Again, this is perfectly feasible with regista, but has not been performed here for the sake of simplicity.
Wrap-up
I hope this has been interesting and/or useful. The regista package is in early development and will continue to improve. Any and all contributions are welcome, from bug reports to pull requests.
Thanks for reading.