By Wojciech Gryc on April 15, 2024
Over the course of March 2024, we pitted large language models against each other to see which ones can summarize facts and information in the most productive way — and thus predict CPI figures for March 2024 most accurately. Llama 2, GPT-4, Claude 2.1, and Gemini 1.0 Pro where pitted against each other, updating forecasts whenever new information was found via search engines. The Bureau of Labor Statistics released official CPI numbers on April 10.
The official figure for CPI for March 2024 was 0.4%, on a seasonally adjusted basis. Below were the final predictions made by LLMs on March 31:
Perhaps what is more interesting is how and when the LLMs achieved their forecasts. The chart of predictions over time is below.
It's important to note that pretty much all models converged on the perspective that inflation would be higher than expected, though clearly had different viewpoints on what the numbers would be. The speed at which these models made those projections is also interesting — Gemini and Llama both projected within a small range relative to their final prediction as of March 7, while GPT-4 and Claude both showed a lot of variance in their forecasts early on. All models had access to the same underlying facts, so this implies the actual models tend to behave differently when making forecasts.
A few important observations that will impact our work and roadmap moving forward...
Model self-moderation and guard rails can prevent forecasting. Models, particularly Claude and GPT-4, will self-moderate and avoid speculative content. This prevents them from consistently making forecasts and is a good example of why open models (or more flexible guardrail options) are so important.
Models do better at summarizing experts than generating new ideas. This shouldn't come as a major surprise given the criticisms of LLMs across academia and industry. LLMs are very good at uncovering facts and summarizing them into a forecast, but don't come up with new hypotheses or ideas. This is also a function of our fairly simple forecasting process, where we provide facts and have one prompt that asks the LLM to summarize its prediction. Turning this into a more iterative and agentic process could lead to more creativity.
Forecast justifications vary in quality, and they appear to correlate with forecast accuracy. Models that provide justifications for their forecasts are more accurate. The narratives are also more helpful in driving an understanding of why a forecast was reached. Finally, they are helpful in simply uncovering patterns or observations for those consuming the forecasts, regardless of the final prediction.
Models tend to fall into a "linear update trap". Models will regularly increase or decrease their forecasts by the same amount each time. For example, five successive forecasts will each change the projected inflation by 0.03% and use provided facts to justify this change. This is a pattern that seems to appear across a number of forecasting exercises. This can be addressed via prompt engineering and development of sub-forecasts.
The biggest opportunity around improving forecasting agents is to provide more flexible fact bases, and enable the agent itself to generate sub-forecasts, scenarios, etc. We're working on a revamped front-end that should support some of this in the coming weeks.
In the meantime, we'll be running a number of other forecasting exercises, which are listed here. If you have forecasts you want to explore, don't hesitate to reach out!