PredictWise Calibrated Polling

Horse Race with PredictWise methodology

As you have seen elsewhere, PredictWise’s track record when it comes to horse race predictions is, to say the least, very good. In general, there is a number of key cornerstones to successful predictions, we believe:

  • Our data collection, based on Random Device Engagement, meaning that we target Ad IDs on cell-phones throughout the country, pick up respondents where they are organically (i.e. engaged in cell-phone applications). This also means we can leverage the sample size necessary, with a coverage of >3 Million unique respondents in the US and growing (fast), to produce granular estimates quickly

  • Our ability to parse out turnout from sentiment, or the question who you will vote for conditional on turnout. We think this year, the second question is easier to answer. The political environment, heading into the midterms, is highly nationalized. Whether you will turn out or not, you will likely have made up your mind by now. The fraction of voters who intend to vote Democratic, who can still be persuaded to vote Republican late in the cycle, is always small, but it is especially small this year. So, our ability to treat turnout and directional vote conditional on turnout as separate means that we can “focus” uncertainty on turnout

  • Our analytics, and the ability to parse out true swings from noise related to different people answering our poll at different times. You might remember this chart from the 2012 election cycle

gallup.gif
  • This chart tells you more about what kind of respondents answered the poll each day than about who is leading in the horse-race. The 7 ppt swing in a number of days in April is almost definitely an effect of the former, not the latter. Our methodology is able to correct for that.

Introducing: PredictWise Calibrated Polling

But, our predictions come with differing levels of accuracy. In our most accurate projection – the Montana special elections 2017 – we have done one thing differently. As a reminder, here is our – timestamped – prediction of the MT special election. The actual result was Gianforte 49.7, Quist 44.1.

Figure1-1.png

What did we do differently here? For one, we had a real-world outcome that we could calibrate against: the 2016 gubernatorial elections. That allowed us to address uncertainty in the turnout space by calibrating the space in such a way that it reproduced the “true” margin in the gubernatorial race 2016, assuming that the partisan breakdown of the Montana electorate had not changed from 2016 (though, of course, we know absolute turnout numbers certainly changed – after all, 2017 was “just” a special Congressional election, not a presidential election).

Now, we have improved and formalized this method we dub PredictWise Calibrated Polling. And, we have something uniquely good to calibrate against: Early-vote data, updated in real time. Take Texas for example: More than 5 Million early votes have been cast in TX, as of this morning. Of course, we know nothing about what candidate early voters voted for, but, we have detailed demographic information (though keep in mind that Texas does not keep partisan registration), taken from the voter file. And, we have our national horse race model (generic ballot), built on more than 3 Million responses from 200,000 survey respondents, and well over 30 Million behavioral data points. So, here is what we do:

  1. We match early voting data back to the voter file, and source the demographic information available. For demographics that are not kept on the voter records (think education, party), we use probability scores derived from large-scale survey models.

  2. We use the demographic information to predict probability of voting Democratic, based on the coefficients from our Bayesian dynamic horse race model. Again, this is built on more than 3 Million responses from 200,000 survey respondents, and well over 30 Million behavioral data points. For demographics for which we only have probability scores, we use partial “random effect” coefficients weighted by the probability scores. For example, if we think our respondent is 60% likely to be African American and 40% likely to be White, we use .6*beta[race=Black]+.4*beta[race==White]. In reality, this is more complicated because our Bayesian model includes uncertainty, so the Betas are really matrices, not vectors. In essence, this gives us a pretty good idea of the partisan breakdown of early votes. Here, we have modeled out all early votes that were cast around the same time our survey was in the field: October 30 – October 31, more than 4 Million votes in total (To date, almost 5 Million votes have been cast in Texas). We think that the Generic Republican candidate is ahead (slightly) among early voters. 

TX1.png
tx_early_vote2.jpg
  1. We use this information on early vote to calibrate our actual poll of the Senate race, Beto O’Rourke vs. Ted Cruz. Specifically, we include an item on early vote on the survey side and asked early voters who they voted for in the House of Representatives. This gives us a way to calibrate (read: readjust) the demographics of our estimated composition of the turnout space for early voters (remember: a lot of this is modeled as opposed to ground truth), in such a way that our modeled actual early voting margins match the early voting margins we are getting from our survey respondents. In practice, this is complicated because we are calibrating all the demographic probability scores in the turnout space, derived from the voterfile, at once. So, we are dealing with a multidimensional optimization problem. But thankfully, R leverages pretty powerfull off-the-shelf general-purpose optimization based on Nelder–Mead algorithms (in our case).

  2. Finally, we apply these demographic adjustments (especially on partisan identity) to our likely voter space in Texas. And, this adjustment is doing a fair amount of work in our Beto-Cruz poll.

PredictWise Calibrated Polling; TX, 10-30-10-31-2018: Beto down by Two

Applying this methodology to our polling in Texas, capturing 1,000 likely voters between October 30 and October 31, we have Republican Ted Cruz up by less than 2 percentage points, (or to be more precise, 1.2 percentage points). We will likely poll Texas again, starting tomorrow. Stay tuned for more!

TX3A-If-the-election-for-the-U.S.-House-of-Representatives-was-today-who-would-you-vote-for3F.png

When it comes to racial breakdown, Cruz dominates among Whites, though not as much as in some other polling that breaks estimates down by race in TX. On the other hand, Beto dominates among African Americans, and, especially, Hispanics

Results broken down by age are a little bit more surprising. Sure, Cruz leads the 55+ age category comfortably, as expected, but 18-24 year olds, by our estimates, will not be the most pro-Beto age bracket. In fact, his support is strongest among 25-34 year olds (58%), followed by 34-44 year olds (52%).

Tobias Konitzer

Tobi co-founded PredictWise out of a desire to bring disparate streams of data and machine learning together to help build bespoke audiences for targeting in the progressive ecosystem.

Previous
Previous

Pre-Election Thoughts on Campaign Data Space