PredictWise Blog

I now have the full World Cup probabilities listed. This includes the likely outcome of every game and likely outcome of every team making it to every round. The predictions are as accurate as possible, based on historical correlations in both the World Cup specifically and my methods in general, and answer the question that most stakeholders ultimately care about: who is going to win. The prediction are updating every few minutes, allowing us to examine the impact of events during the game and early games on later games. There are many other predictions to follow, but I am confident in my accuracy and continuously updating make my predictions the most useful to interested stakeholders.

Two predictions that have been forwarded to me several time are by Bloomberg and Goldman Sachs. First, let me concede, while I try my best, these two strictly dominate me in style; these are really pretty reports! But, I do beat them as far as being a useful prediction.

These are pure fundamental models. Bloomberg does not provide much detail of this model. But, Goldman uses most of the same variables I use in my fundamental model: Elo rankings, goals for, goals against, dummy for type of match, home field, and home continent. There are a few small differences: I include friendly matches, I do not include home continent, etc. All of this is going to add up to noise for the predictions (i.e., not really make much of a difference). In short, both of our methods are completely sound.

The problem with pure fundamental models is that even the best fundamental models are lacking because the World Cup is an event held just once every four years without any regular season: there is a lot of idiosyncrasy in the event which is hard to capture in historical datasets. Goldman states, “To be clear, our model does not use any information on the quality of team or individual players that is not reflected in a team’s track record. For example, if a key player who was responsible for a team’s recent successes is injured, this will have no bearing on our predictions.” While this can be corrected for, to an extent, with individual-level data, they are referring to the idiosyncratic data that is included in the prediction market data. Thus, my prediction market-based predictions are going to be more accurate and updatable as the event progresses.

Goldman goes not to say that “There is no role for human judgment as the approach is purely statistical.” They should be applauded for that, but chided for not recognizing that the data exists; they do not need to add human judgment to note the effect of an injury.

This point is reflected well in the scatterplot of my forecasts for the 32 teams in reaching the round of sixteen and the round of eight. If the predictions were the same, they would run on the 45 degree diagonal; by definition the average prediction is 50% for reaching the round of 16 and 25% for reaching the round of 8. Notice that predictions from Bloomberg and Goldman are much flatter than mine: favorites are less favored and underdogs are more favored. This makes sense, this means we are all well calibrated in that the fundamental-based models accept that they have less information and more uncertainty.

There are two further peculiarities about the Goldman report: they compare their predictions to a broker not a market and they have Brazil as 48.5% likely to win. First, as one of the leading investment banks in the world, I am surprised they would compare their probabilities to the bid price of broker. Ladbrokes needs to make a profit and they do so by selling their predictions for more than they are worth. To guarantee a $1 return you would need to invest $1.18 to buy all of the teams to win. Second, the very, very under identified home field and continent advantage is what drives their prediction for Brazil to 48.5%. There are not that many World Cups to it is hard to identify the true advantage of hosting one. It is similar to home state in the presidency, which is also poorly identified. Brazil is extremely likely to win, I have them at 23%, but would Goldman advise their clients to buy Brazil long at 48.5%?!?

With that context, let me rewrite the entire column in a different way; if Goldman Sachs had a model for the price of an asset (e.g., MSFT stock in a month or Columbia to win the World Cup), but something just happened that shifts the underlying value of that asset far away from the model (e.g., a new CEO for MSFT or an injury to Columbia’s star player) would Goldman advise their clients to value the asset at the model’s price or the price on the open market? I would go with the market price …

Full coverage of the World Cup at:

Team-by-Team Predictions

Bookmark and Share

There is no new data here, but some new organization. I thought it would be interesting to view the tournament by team with all games and likelihoods in one table. I hope to add the likelihood of reaching any given round by Monday morning:

World Cup at a Glance

Predicting the World Cup: a quick primer

Bookmark and Share

Predicting the World Cup is not that much different from predicting other sports outcomes or even economic indicators or awards shows. First, we determine what the stakeholders want to know. For the World Cup, we determine that it is the likelihood of win, loss, or draw for either team in any game and the likelihood of any team advancing to any round (including winning the tournament); for reasons of tie-breakers and expediency we also consider goal differential. Second, as always, we ensure that these forecasts update as the games progress. Finally, we always consider the same set of data to ensure accuracy.

In the course of our regular forecasting we always review four different data types: fundamental data, online and social media, prediction markets, and polls of experts. Online and social media data are not significant for the World Cup, at this point. This type of data clearly provides value in understand the support and interest of people from around the world, but lacking historical context, it is impossible to identify if it has any predictive power relative to more traditional data. And, while polls of experts can be useful in predicting sports, we are going to keep things simple and transparent for this World Cup and focus on fundamental data and prediction markets.

I am going to walk through the fundamental data in some length before describing the prediction market data quickly. That is because the fundamental data is much more interesting and the prediction market data is the same as it always is, in all domains. But, it is a lot more predictive than the fundamental data and, despite my fun in running the fundamental data, prediction market data forms the basis of all of the forecasts we are going to generate.

Using fundamental data to predict how teams will do across a season, or in an upcoming game, is a relatively stable task across major sports. The key fundamental variables are always the same: scoring differential, home and away, and wins/losses in past season. Of course, different sports counts scores in different ways (e.g., American football has scores that range from 1 point to 6 points) and count wins/losses differently as well (e.g., soccer has outcomes that range from 0 to 3 points). Generally, home and away (and strength of schedule) are balanced, but that too is not always the case (e.g., baseball loads the schedule heavily with teams in the same division); home field has a huge advantage in soccer (e.g., in a fun example, this article notes that injury time in Spain heavily favors the home team). That being said, give me the scoring differentials of each team from the previous year, their schedule including home and away, and their final outcome in wins/losses and I can predict both season and game-by-game outcomes with precision.

We can improve upon this baseline prediction in several ways: account for shifts in personnel and factor out luck. All of the major sports now have models of wins or points above replacement; an idea that was generated out of baseball’s sabermetric community. This metric describes how valuable a certain player is compared with a baseline player in his/her position. There is still some debate on this metric and it varies a lot by sport, but a reasonable version of it will allow a researcher to get pretty close to quantifying the impact of a substitution of one player for another. Further research has examined the role of luck in the wins/points of any team in a given year to factor out what was in the control of the players and what was either lucky or unlucky.

Soccer is a standard case as I just described: goal differential, home/away, previous year’s points will get you pretty far predicting future outcomes. Add in the wins over replacement in changes in the team and factor in luck and you can be as good as anything.

Playoff predictions are just compilation of game-by-game predictions, using the current regular season’s data. There are two small quirks to consider, effort and playoff design. First, in certain there are definable times when teams are not at full strength or maximum effort, such as a late in the season for teams with nothing to play for; in those situations we need to account for this differential effort. Second, compiling the likelihood of a team advancing in any given round depends on the design of the playoffs. Single eliminations are straight forward applications of the likelihood to win a game formula between two teams, but best of seven series and round robins have their quirks (e.g., NBA teams are more likely to win game 2 if they lose game 1, than if they won game 1).

In short, major team sports from around the world are all pretty similar in predicting regular season and playoff success, but the World Cup has one crazy quirk; it as no regular season. There are direct comparable variables, but they are noisier (i.e., much less precise). Countries compete in three types of matches with other countries on a semi-regular basis between World Cups: friendly matches with other countries, regional tournaments, and World Cup qualification tournaments. All of the games combined are a fraction of what a regular season is in most leagues.

These games provide similar data to what we normally have: there is a goal differential, there is home/away, and, in lieu of past season wins/points, we have world rankings (complied by FIFA based on team’s performance in the last four years) and elo rankings (which is based on head-to-head matches). Unlike a regular season where the choice of opponents and location are balanced (or the choice set is transparent), the schedule of any team is endogenously chosen by the countries to maximize the return for their team, and more wins in a tournament means more games against better teams. Also, there are major personnel changes over any four year period, especially with players going in and out for friendlies and lesser tournaments.

Specifically we start with the following:

1) Average goal differential broken up by home/away/neutral, and friendly/tournament/World Cup qualifier. The friendly/tournament/World Cup qualifier split lets us examine the predictive power of game that are likely to have lower effort and more variable personnel.

2) World ranking and elo score act as the equivalent of points/wins from previous years and the elo score absorbs the strength of the schedule a team has played.

We take this data for past World Cup cycles and regress this on all of the World Cup games to get coefficients for the various variables. We can then plug in the 2014 data to get baseline forecasts for any given game going into the World Cup, both goal differential and likelihood of win, loss, or draw in any game.

The differences in goal differential swamp the rankings in both predicting goal differential and probability of victory in any game. This is not surprising as these rankings are just reflections of win/loss/draw (slightly coded by strength of oppenent), which is trumped by goal differential. Further, the away games are slightly more predictive than home games, which is not surprising, as there is just one home team in the World Cup.

Yet, these predictions for the World Cup games are a lot less precise than the predictions for a regular season or playoff soccer game. With all of the idiosyncratic variables of a World Cup, where teams with no regular season play at neutral sites, the fundamental data is going to provide forecasts of scores with larger margins of error and probability of victories that tend more towards toss-up than we would normally produce.

That is where prediction market data comes into play; it does its best when there is idiosyncratic data to incorporate. Prediction markets buy and sell contracts that are, canonically worth $1 if true and $0 if not. Thus, the price on a contract for Brazil to win the World Cup or any particular game is highly predictive of the probability of the outcome occurring. Massive amounts of historical data helps us translate raw prediction market prices into very precise probabilities of outcomes; this especially true in World Cup, where the prediction markets have very robust action on all games.

Armed with fundamental data and prediction market-based forecasts for every game, we jump into the actual World Cup action. The tournament setting for the first round is a round robin with four teams playing three games each for a total of six games. After that there is a standard 16 team single elimination tournament where the winner of a paired group plays the second place of the other paired group (e.g., the winner of group A plays the second of group B and the winner of B plays the second of A.)

The easiest way to think about the round robin is that there are 729 possible outcomes in a six game round robin (3 outcomes over 6 games is 3^6). Assuming independence between games (that the outcome of one game does not affect the outcome of another) we can easily determine the likelihood of any of the 729 possible outcomes from the likelihood of any of the three outcomes of the six games.

At that point we have the second round set, with certain probability, and can determine the likely wins between potential second round teams and so forth. Thus, providing both the likely outcome in any game and the likelihood of any team reaching any given round.

Of course, independence is not necessarily the correct choice for the World Cup; early games in the round robin affect later games in the round robin. I already noted that in the NBA some teams are more likely to win after a loss (due to either increased effort or referee’s calls). The opposite effect would be that we may learn that a team is better than we thought ex-ante due to them winning an earlier game. In the NBA they play 82 regular season games so we do not learn much if they happen to win a game in the playoffs, but in the World Cup they play 0 regular season games, so we learn a lot when the win a game. Thus, the consensus in our data is that we should slightly update teams after they win in the group stage. This is not significant in the later rounds, where all teams are winners, but it is in the round robin.

Prediction markets shine when there is a lot of idiosyncratic data making imprecise fundamental predictions. That is when we need the wisdom of the crowd to quantify the likely outcome. Thus, while we work through both the fundamental data and prediction market-based forecast, we put the weight of our prediction on the prediction market data.

Check out all of our World Cup coverage at:

Conditional Probability and the NBA

Bookmark and Share

The San Antonio Spurs are 48% likely, and the Miami Heat are 40% likely, to win the NBA championship. But, the Heat at (1-1) in the Eastern Conference finals are just 77% likely to make the NBA finals and the Spurts at (2-0) in the Western Conference finals are 92% likely to make it to the NBA finals. What does that mean if the two teams make it past the Indiana Pacers and Oklahoma Thunder respectively? They will likely enter the finals with the Heat as the slightest of favorites.

The probability that the Heat win the NBA finals, should they make it, is derived by taking their likelihood of winning and dividing by their likelihood of making it. That makes them 51% likely to win the finals, conditional on making it. The same math makes the Spurs 52% likely to win the finals, conditional on making. But, there is an 8% chance the Heat face the Thunder and a 23% chance the Spurs face the Pacers.

The Pacers are extremely unlikely to win, should they make the finals, while the Thunder are slightly more likely than the Spurs to win, should they make the finals. Thus the Spurs 52% likelihood is inflated relative to the likelihood of them winning against the Heat and the Heat’s 51% is deflated relative to the likelihood of them winning against the Spurs.

The estimate from the current numbers is that the Heat will be the slightest of favorites if they play the Spurs in the NBA finals; this is true despite the Spurs having home court advantage. But, the NBA plays a 2 home-3 away-2 home schedule in the finals, which is not as favorable as the 2-2-1-1-1 played in the previous three rounds.

Follow my NBA coverage at:

World Cup: US's Group Stage

Bookmark and Share

The United States is in a group with Germany, Portugal, and Ghana. The two of four teams with the most points after a round robin will advance to the second round. I am giving the US about 25% to advance out of the group stage and this seemed high to some of my readers. This initial reaction is not surprising when you consider that Germany and Portugal are ranked two and three respectively in the FIFA World Rankings and Ghana beat the US in two straight World Cups. But, the numbers make sense.

This 25% is actually a very complicated calculation and it starts with the six individual games that will be played in Group G’s round robin. A win gets a team 3 points and a draw 1 point.

Germany and Portugal are heavy favorites in their games, but this is not American football or best of seven series in baseball, hockey, or basketball; single low scoring games leave open reasonable probabilities of upsets or draws. Thinking about these games independently, Germany is between 65 and 75% to beat Ghana and the US. While Portugal is between 55 and 60% to beat Ghana and the US. But, this is soccer where draws happen and Germany is about 20% to draw Ghana and the US. While Portugal is also between 20 and 25% to draw Ghana and the US.

There are six games with three possible outcomes each, leaving a total of 729 possible overall outcomes after the games are played. Knowing the independent likelihood of any of these three outcomes for the six games, I can compute the probability of any of the 729 overall outcomes and which teams would qualify in any of them. With this independence assumption, the US and Ghana are both a little over 25% to qualify with Portugal at about 65% and Germany at about 85%.

Thinking about the progression of games, the first set of games is the US versus Ghana and Portugal versus Germany. Germany is about 50% likely to win, Portugal is about 20% to win, and there is a 30% likelihood of a draw. The US is about 33% likely to win versus Ghana.

1) If the US loses against Ghana they have a negligible chance of advancing.

2) If the US draws against Ghana they are less about 15% to advance.

2) But, there is a 33% likelihood they beat Ghana and if they do, they are about 50% likely to advance! Two more points will guarantee they advance and one more point puts them at just over 50% to advance. No team with five or more points has failed to advance (think about it; that means they are either 3-0-0, 2-0-1, or 1-0-2) and a little over half the teams with four points advance (they are 1-1-1). The US will play Portugal then Germany. Let’s do the math:

Guaranteed to Advance (36%): win and win (2.7%), win and draw (4.7%), draw and win (2.7%), draw and draw (4.7%), win and loss (13.9%), or loss and win (7.4%). A win and win, win and draw, and draw and win would also be enough to advance if they tied Ghana. A win and loss and loss and win would give them about 50% likelihood of advancing if they tied Ghana.

Over 50% to Advance (26.5%): draw and loss (13.8%) or loss and draw (12.7%).

Guaranteed to not-Advance (37.5%): loss and loss (37.5%).

I will talk more about possible deviation from independence in future blog posts (i.e., how I expect my game-by-game predictions to shift as the earlier games unfold).

Check out all of our World Cup coverage at: