Forecasting the New Hampshire Republican Primary with LLMs

By Wojciech Gryc on January 24, 2024

Introduction

The vision behind Emerging Trajectories is that if we have a semi-competent analyst with access to all of the world's information — polling data, articles, models, and more — they should be able to generate better forecasts than experts, superforecasters, or professional analysts.

As part of this exploration, we've built a forecasting platform to log LLM-powered forecasts and track real-world events. Our goal was to see how well LLMs could do in forecasting the results of an upcoming election or event.

Our Experiment

Emerging Trajectories is a relatively new project, so bear with us! We began forecasting via ChatGPT and GPT-4 (gpt-4-1106-preview) on January 17 and 18, respectively. We generated forecasts every day around 7pm ET. We specifically asked both models to predict (a) the proportion of votes to be cast for Trump, (b) the proportion of votes to be cast for Haley, and (c) the difference between the two.

A few important points:

Both models had access to the Internet. In the case of ChatGPT, we were using GPT-4 powered by Bing, while GPT-4 used a PhaseLLM's web search agent that performed a Google search prior to generating results.
We kept prompts the same day over day; there was no user intervention in terms of what the workflow looked like.
We asked the LLMs to justify their predictions, in the hopes of understanding (a) what data they were using, and (b) to see what insights we could generate from these results.

Results

This was an interesting week, with Ron DeSantis dropping out of the race on January 21. Nikki Haley was also gathering momentum over the course of the week.

The results for both Trump and Haley are shown below, respectively.

Forecasts for Trump's % of Vote

Forecasts for Haley's % of Vote

Analysis

Interestingly, our approaching using Google Search results and PhaseLLM/GPT-4 performed better than ChatGPT. While we don't know what's happening under the ChatGPT hood, so to speak, we can analyze the responses from ChatGPT to understand a bit more about what's going on.

ChatGPT focused on polling numbers, while PhaseLLM/GPT-4 explored broader topics. This is seen in the general text responses from both models. ChatGPT used Bing to find polling results and summarized them, while PhaseLLM/GPT-4 discussed momentum, news, developments, and other data in addition to polling numbers.
Google versus Bing. ChatGPT used Bing search results while PhaseLLM/GPT-4 used Google's search API. As a result, differences in data could have yielded the different results above. This is particularly important to note given the importance of recency in content updates — even a 12-hour diference in content being updated on Google versus Bing would likely result in vastly different forecasts, given the developments in the primary race itself.
Both models didn't take into account 'current events'. Notice that predictions for Trump and Haley didn't add up to 100%, but were closer to 90%. An analyst would have likely explicitly updated their results to account for Ron Desantis' departure. This sort of logic is incredibly difficult for LLMs to do.
Summarizing analysis versus doing analysis. Looking at the sites that were referenced by the LLMs, it's clear that those sites do a lot of analysis and make their own projections. What's unclear is how much of the LLMs' results are driven by any sort of internal thinking versus simply summarizing the sites they are provided. Running experiments to test how much of the forecast is coming from the LLM, versus from the content it's provided, would be useful.
Updates based on new information. Forecasts were updated once per day. A better approach would be to update forecasts if we receive any new information — this might mean running forecasts multiple times per day, or waiting for several days in the case of longer-term forecasts.

Conclusion and Next Steps

Both ChatGPT and PhaseLLM/GPT-4 were able to make forecasts regularly, and incorporated information provided to the prompts to generate their predictions. The information was, however, very limited — content from top ranking search results, and nothing more. It's very possible that the results of this forecasting process would have been much better with information from specific news outlets, a deeper RAG-based fact extraction from content, and even using external world models to help validate and guide the LLMs to make conclusions.

We will be running more thorough experiments around other upcoming events in the next few weeks. Stay tuned as we develop a more thorough framework and modeling approach!

Questions?

If you have any questions or if you want to get involved, please email us at hello --at-- phaseai --dot-- com.