Testing Time: Assessing LLM Performance on Time-Series Reasoning for Intelligence Analysis

October 8, 2025

Overview

AI technologies have already transformed how businesses and organizations operate by rapidly processing large datasets and generating targeted analysis, with models quickly scaling their reasoning capabilities over the last few years. This makes artificial intelligence seem like a natural tool for the complex work of intelligence analysis. 

However, today’s AI models struggle with temporal reasoning—the ability to draw logical conclusions based on timing and how data points relate in time. Any effective intelligence analyst must be capable of reconstructing sequences of events, identifying causality, and anticipating future developments using imperfect information.  

Despite this shortcoming, AI is developing rapidly; its current inability to reliably reason across time does not mean it will never be able to do so, and national security professionals must monitor its progress on this front to derive the full benefit of this emerging technology. So, how can intelligence analysts regularly measure AI models’ temporal reasoning abilities and guide responsible development? And where do we see specific paths to improvement from where this technology stands today?  

NSPDI researchers address these questions and more in this new paper assessing the time-series reasoning of 13 large-language (LLM), vision–language (VLM), and time series–language models (TSLM). While our researchers find that many AI systems struggle with the temporal sequencing that is central to national security and intelligence analysis, we also introduce a novel tool for evaluating progress in this domain called BEDTime, or Benchmark for Automatically Describing Time Series. As AI models continue to advance, BEDTime will be a crucial diagnostic aid to help analysts consider the temporal reasoning capabilities of future AI models for intelligence tradecraft.  

Key Takeaways

  • AI model judgments often become more convoluted and unreliable when they are given more time to reason, and AI models struggle to articulate uncertainty and document how they arrived at their conclusions.
  • When BEDTime tested the 13 AI models on recognition, differentiation, and generation tasks, popular language-only models performed the worst, VLMs performed best (though imperfectly) at describing time series since they process visual patterns effectively, and TSLMs demonstrated improvement over LLMs but have a long way to go.  
  • BEDTime offers the intelligence community a standardized, interpretable, and low-barrier method to assess how well AI models conduct temporal reasoning, a vital ability for operational deployment.