
The 2018 FIFA World Cup started Thursday, pitting Russia (the host country) against Saudi Arabia in the tournament’s opening match. On the whole, most bookmakers are picking Brazil as the odds-on favorite to outright with the World Cup. One model that combines most of the major bookmakers’ odds—the so-called “bookmakers’ consensus model”—puts Brazil’s chances at 16.6 percent, followed closely by Germany at 15.8 percent, Spain at 12.5 percent, and France at 12.1 percent. These days, bookmakers use a wide variety of statistical techniques and qualitative expertise to pin down the most realistic odds for each team.
Those odds are established by human statisticians and experts, and in recent years, they have begun to slowly inject artificial systems into their analytical techniques to support the numbers they settle on. We live in a time now where much of our data-crunching is done by machines, which begs the question: Why don’t we put machines in charge of generating odds for sports matches? After all, even if the systems aren’t perfect, it’s a low-risk endeavor that could only be bolstered through continuous real-world testing.
There are quite a few groups spearheading this approach. One in particular, comprised of a group of scientists hailing from Germany and Belgium, just released World Cup odds derived from a machine learning system they’ve created and tested, and their conclusions put them at, well, odds with most bookmakers around the world.
“The collaboration really is a joint effort by putting our respective ideas together: a ‘Belgian idea’ and a ‘German idea,’” says Christophe Ley, a researcher based at Ghent University and the coauthor of a new paper that details the work behind this new predictive machine learning system. The two German researchers behind the system have worked in the past to develop models that could predict big soccer tournaments.
But those, like most other conventional models, are rife with limitations of some kind of another. One solution is to use an approach called random forest—which is what brought in the Belgian researchers to the project. A random forest is a machine learning method that’s basically an extension of decision trees, in which future events are predicted based on a step-by-step basis, in which the outcome at each “branch” is determined by referencing a particular data set. Conventional decision tree approaches are plagued by a big problem: overfitting, whereby the analysis of a particular situation caters a little too closely or exactly to that data set, and struggles to make adjustments in the presence of additional data or unusual circumstances. The random forest approach attempts to correct this issue by analyzing random branches many times at once, instead of individual branches in order. The final predicted outcomes are essentially the average of thousands and thousands of decision trees constructed randomly under a random forest approach.
Besides correcting for overfitting, the random forest approach can also highlight which factors play the biggest role in the outcome of a branch—in the case of the FIFA World Cup, the outcome of each match. In the case of Ley’s and his colleagues’ new paper, those factors include country of the team’s GDP per capita, population size, and FIFA rank, the relevant team states include the number of players that were in the Champions and Europa League semifinals, maximum and second number of players playing in the same team, age structure of the players, age and tenure of the coach, and a few more factors. According to Ley, the estimated ability of each team—statistically derived—ends up being the most important factor, and for that reason, they call the approach random-forests-with-abilities. “Random forests alone, without the abilities, cannot beat the bookmakers,” says Ley.
“Trying out our model on previous World Cups, the model has beaten the bookmakers, which is extremely rare,” he says. “Our model stands out, as it not only allows predictions like ‘win-draw-loss,’ but we can predict precise match outcomes.”
Okay, interesting enough, but the real question is, what team does random-forests-with-abilities predict will win? After 100,000 simulations of the 2018 tournament, the approach pegs Spain as the mostly likely winner, with a 17.8 percent probability of taking home the World Cup, followed by Germany with 17.1 percent, Brazil with 12.3 percent, France with 11.2 percent, Belgium at 10.4 percent, and all other teams below 8 percent.
But there’s a potential wrench that can upend this prediction: Germany. See, Germany is going to have quite a tough road ahead if it makes it out of the group process—the random tree approach only gives it a 58 percent chance of reaching the quarter-finals if it makes it out of the group phase (Spain meanwhile, has a 73 percent chance of getting to the quarter finals). But if it makes it to the round of 16 (the knockout round), both teams mutually eat at each other’s odds of winning as a result. Essentially, Spain is favored at the beginning of the tournament, but if Germany makes it to the quarter-finals, it becomes the favored victor.
The research team has already received quite a bit of buzz over its paper. All that’s left now, is seeing whether the random-forests-with-abilities predictions hold true. “We never expected this and are happily overwhelmed,” says Ley. “Now we hope even stronger that our predictions are good!” If not, we might have to go back to using psychic octopuses to have our soccer fortunes read. Or, if things pan out well enough, psychic cats.