Temporal Logic Enhancement System for AI Agents

By Basile Etienne, Pierre Runavot
December 18, 2024

Abstract

We present TLES (Temporal Logic Enhancement System), a specialized framework designed to enhance Large Language Models' (LLMs) performance on temporal reasoning, targeting some of Google’s Test of Time (ToT) challenges [1] as well as more complex temporal reasoning tasks. We stumbled upon the inaccuracy of LLMs in translating natural language dates expressions (we’ll call them NL dates) to absolute dates in a specified format for generating API requests. This is especially challenging for LLMs when the date is relative (like “yesterday” is relative to today’s date by a -1 day delay). TLES provides very significant improvements in temporal reasoning results.

Introduction

In the realm of LLM-driven API interactions, accurate temporal reasoning is crucial. TLES is a date system that provides a sophisticated solution for handling a wide range of date-related queries and calculations. This system is designed to process both absolute and relative date expressions, offering flexibility and precision in temporal operations.
Our focus on data security and confidentiality led us to notably use the Llama v3.1 family of models (among other open source LLMs) hosted inhouse to test TLES, with a closed source solution as a benchmark for comparison.
Though we specifically target one aspect of temporal reasoning, that is to get a precise absolute date (yyyy-mm-dd) from a natural language expression, it could be adapted to other tasks seen in the ToT benchmark.

This development resulted from the following observations:

  • LLMs cannot reliably compute, but can reliably use a calculator with appropriate techniques and tools (see structured outputs from Openai / function calling / grammars for open source LLMs).
  • Temporal reasoning is a kind of computation. Eg : “In 476 days” => compute a delay of 476 days from today’s date.
  • The first approach seen in Google’s ToT article is to let the LLM “reason” to get the right absolute date. But this doesn’t tackle the LLM’s inability to reliably compute, though its reasoning could be valid.

Our approach was to build a tool for the LLM to use to reliably compute a date.

How it works

TLES functions as follows:

  • It has a fixed structured input with an anker date and different delay period markers available (day, week, month, quarter, and year).
  • Each marker is accompanied by a START/END modifier.
  • Additional modifiers to get specific month and weekday allow for more complex computations than simple delays.
  • Besides, we built a labeled list of NL dates (e.g. yesterday, tomorrow, in 3 months, etc). These NL dates are associated with the corresponding TLES input, which allows for few shot training or fine-tuning, but also for thorough testing and evaluation.

We tested TLES on available samples in the ToT benchmark that expected a yyyy-mm-dd format. Further modifications to get a specific component of the date or another format could be easily implemented upon needs. While this benchmarks gives a nice idea of the improvement of performance, it remains quite trivial at its core as the available samples are only simple delays (e.g. “Knowing today is 1st of February 1934, and X will happen in Y days/weeks/months/years, what date will it be ?”).

As a result of this observation, we tested TLES on our own dataset of more complex temporal reasoning tasks. In addition to ToT-like delays, it notably includes:

  • Periods (e.g. “last year” knowing we are in 2024 => start_date=”2023-01-01”, end_date=”2023-12-31”)
  • Weekdays (e.g. “on Monday” when we are Tuesday the 19th of November 2024 => “2024-11-18”)
  • Month names (e.g. “in April” knowing we are in 2024 => “2024-04-01” to “2024-04-30” )

TLES Architecture

Compounding Logic Process

TLES accepts a structured input which consists of several key components. It processes these components sequentially, allowing for complex date calculations, while keeping a fixed structure to minimize failure rate:

  1. Reference Date:

    • If provided, it uses the specified reference date, else defaults to the current date.
    • Format: "YYYY-MM-DD"
    • Purpose: Serves as the starting point for calculations or specifies an absolute date when set alone.
  2. Date Delay:

    • Shifts the reference date using the specified marker and delay.
    • Positive or negative integers can be used to move forward or backward in time.
    • Components:
      • marker: Specifies the unit of time (amongst “year”, “month”, “quarter’’, “week”, “day”).
      • delay: The number of units to shift (signed integer).
      • START_OR_END: Determines whether to return the start or end of the resulting period. Preserves the day when set to “NONE” (e.g. if today is the 9th of February, a -1 month delay with “NONE” set will yield the 9th of January, “START” will yield the 1st of January and “END” will yield the 31st of January).
  3. Specific Month:

    • Moves to the specified month while preserving the year from the previous step.
    • Components:
      • month: The name of the month in lowercase.
      • START_OR_END: Determines whether to return the start or end of the specified month (same logic as previously, with “NONE” preserving the day).
  4. Additional Week Delay:

    • Refines the date by targeting applying an optional additional week offset.
    • Selects in this new reference week a specific day of the week.
    • Components:
      • weekday: The desired day of the week.
      • week_delay: The number of weeks to shift (signed integer).

Default Behavior and Fallback Mechanism

To ensure robustness, the system implements a fallback mechanism:

  • If no reference date is provided.
  • If the date system encounters an error during processing.

We implemented a verifier that handles these cases by making a feedback loop to the LLM with a set maximum iteration. In case of failure in spite of this, the system automatically defaults to using the current date (today's date) as the result. See our article on the API requests generation pipeline [2] for more information on this verifier.

Flexible Application Examples

Natural language date Example of TLES Input         Description
January 24, 2024 reference_date : 2024-01-24 Absolute date specification.
The third Monday of April in two years date_delay :
     - marker : year ,
     - delay : 2 ,
    - START_OR_END : START ,
specific_month :
     - month : april ,
     - START_OR_END : START ,
additional_week_delay :
     - weekday : monday ,
     - week_delay : 3
Complex Relative Date Calculation. The reference date here is today by default.
The second Friday of April 2024 reference_date : 2024-01-24 specific_month :
     - month : april ,
     - START_OR_END : START ,
additional_week_delay :
     - weekday : friday ,
     - week_delay : 2
Mixed Absolute and Relative Calculation.
The 2nd Monday of February reference_date : strawberry Fallback scenario. Here the reference_date is an invalid date. Actual result: Today's date (fallback mechanism activated due to invalid input).

Key Advantages

  1. Unified Framework: Handles both absolute and relative date expressions within the same system.
  2. Flexibility: Allows for many combinations of absolute and relative date components.
  3. Precision: Enables exact date specification when needed, while still supporting complex relative expressions.
  4. Robustness: Always returns a valid date, even in error scenarios, ensuring system stability.
  5. Intuitive Mapping: Easily translates natural language date expressions into structured input.

Note on temporal ambiguities

Temporal expressions in natural language are inherently ambiguous, requiring sophisticated systems like TLES to handle their interpretation reliably. This ambiguity manifests in several fundamental ways that challenge both human understanding and computational processing.
The most common form of ambiguity occurs in reference frame interpretation. When someone mentions "on Monday" in conversation, the intended Monday depends heavily on context. If it's currently Wednesday, does "on Monday" refer to the previous Monday that just passed, or the upcoming Monday? The answer typically lies in a combination of contextual clues: the tense of the sentence, and the broader conversation context. Similarly, when discussing months, saying "in April" could reference different years depending on the current date and conversation context. If it's December, "in April" likely refers to the upcoming April, but if it's May, it might refer to either the past April or the next one.

Another significant dimension of temporal ambiguity involves boundary definitions for time periods. When someone refers to "last month," they might mean the entire previous calendar month, or they could be describing a 30-day period counting backward from today. The same applies to yearly references – "last year" could indicate the previous calendar year, fiscal year, or a rolling 12-month period. This boundary ambiguity becomes particularly relevant in business contexts where fiscal periods might not align with calendar periods.

Cultural and regional differences add another layer of complexity to temporal interpretation. Different cultures may have varying definitions of when a week starts (Sunday versus Monday), different fiscal year periods, or distinct academic calendar structures.

TLES doesn’t currently completely address these challenges and this work will be part of a future implementation.

Performances

The experiments run here consist of 3 prompt types:

  • The first one requires the LLM to reason about its answer before giving it. This is the base setup provided in Google’s article, and will be referred to as the Chain-of-Thought (CoT) approach.
  • The second one adds TLES (referred as CoT+TLES).
  • The third one adds Few Shot (FS) Learning to improve the performances further (helps the model understand better the input structure of TLES - especially visible for smaller models). Note that the samples used for FS are not from ToT dataset, but from our own dataset. It will be referred to as CoT+TLES+FS.

In our study, we utilized the Wilson score interval [3] to compute the 95% confidence intervals for model accuracy results across different datasets and configurations, taking into account the rather small sample size of the ToT dataset. The Wilson score interval is particularly advantageous for estimating the confidence of binomial proportions, especially with small to moderate sample sizes, owing to its ability to provide more accurate interval estimates compared to the traditional normal approximation interval. [4]

These intervals provide a statistically sound basis for evaluating model performance, highlighting the effectiveness of TLES in enhancing temporal reasoning capabilities in LLMs. The use of Wilson score intervals thus substantiates the reliability and stability of our results, ensuring that the observed improvements are not artifacts of random variability but are indicative of genuine performance gains.

Note that for security and confidentiality purposes regarding financial data, the generative AI products that Kyriba is developing will be based on open source models that can be hosted on our servers, as opposed to OpenAI’s closed source solution. Our goal is to match the performance of the benchmark set by GPT-4o, while guaranteeing the safety of our customers’ data.

Google’s ToT dataset results

We focus here on the AddSubtract category in the Test of Time (ToT) benchmark, which represents a fundamental dimension of temporal arithmetic reasoning, focusing on the ability of Large Language Models (LLMs) to perform basic temporal calculations involving the addition or subtraction of time units (days, weeks, months, etc.) from reference dates. In experimental evaluations from the article, this category demonstrated relatively robust performance across multiple frontier models, with GPT-4 achieving 76.28% accuracy, followed by Gemini 1.5 Pro at 71.14%, and Claude-3-Sonnet at 58.57%.

However, we were not satisfied with this raw performance as it is not robust enough for a production level application. We found the same results in our own tests that focus on other models (GPT-4o, LLama-405B, LLama-70B and LLama-8B, all Llama models being v3.1).

The AddSubtract component of temporal reasoning in Large Language Models (LLMs) demonstrates significant performance variations across model architectures and enhancement techniques. In baseline Chain-of-Thought (CoT) implementations, performance scales notably with model size, ranging from 25% accuracy in Llama 8B to 76% in GPT-4o. However, the integration of TLES yields remarkable improvements across all models, with both GPT-4o and Llama 405B achieving perfect accuracy in our tests.

The addition of few-shot learning further enhances performance, particularly for smaller models, enabling even Llama 8B to achieve 90% accuracy, a drastic improvement from its baseline 25%. This pattern suggests that temporal arithmetic capabilities are not solely dependent on model scale, but can be significantly augmented through structured approaches to temporal reasoning. The consistent improvement pattern (CoT < CoT+TLES < CoT+TLES+FS) across all model sizes indicates that temporal arithmetic reasoning benefits substantially from explicit computational frameworks like TLES and few-shot learning.

Model name CoT CoT + TLES CoT + TLES + FS
Openai gpt-4o 76% 100% 100%
Llama 405 B 58% 100% 100%
Llama 70 B 49% 92% 100%
Llama 8b 25% 53% 90%

Our dataset results

To extend beyond simple temporal calculations, we developed a dataset incorporating complex date expressions through TLES's structured framework. The dataset tests four key temporal patterns: relative dates (like “yesterday”), period expressions (e.g. "last year" → "2023-01-01" to "2023-12-31"), weekday references (e.g. "on Monday" → contextual date calculation), and month-based queries (e.g. "in April" → full month range). It also includes complex compositions like “the third Wednesday of October”. Unlike the ToT benchmark's focus on basic delays, this dataset tests LLMs' ability to process natural language date expressions while maintaining standardized yyyy-mm-dd outputs. The inclusion of a labeled natural language date corpus facilitates both few-shot learning and systematic evaluation of temporal reasoning capabilities.

The generation of labelled questions was structured as followed regarding dates:

  • A set of natural language expressions related to dates.
  • The corresponding TLES input.

The questions include variations in tone, politeness, verbal/non-verbal sentences, and other API-related variations.
For more information on how the dataset was generated, see our related article on the API request generation pipeline.

Note that for now, we only tested the performances of OpenAI’s GPT-4o model. We will provide updates on future testings on open-source models.
Contrary to the previous section, we didn’t test Chain-of-Thought (primarily for latency reduction in our API request generation application. This would need further testing).

With these more complex temporal reasoning tasks, the raw performance of GPT-4o is drastically worse than its ToT benchmark results, from 76% to 49%, though the lack of Chain-of-Thought may play a role here. The use of TLES gets the accuracy to 85%, and the addition of few shot samples improves the model’s ability to use the tool, getting the accuracy to 93%, which gets the model way closer to production level performance.

Model name Raw TLES TLES + FS
Openai gpt-4o 49% 85% 93%

NB: these results include a few questions that yielded the right date prediction but the wrong parameter name. This occurred when several date parameters were available for the endpoint (e.g. updateDate and creationDate), but the question was ambiguous on which to use. These samples have been labelled as mistakes, so the sole temporal reasoning performances would be slightly higher here.

Conclusion

TLES addresses ToT-Arithmetic tasks by decomposing complex temporal queries into atomic operations and utilizing specialized algorithms for each temporal reasoning pattern. In experimental evaluations, TLES-augmented LLMs demonstrated significant improvements over baseline models: achieving 100% accuracy on ToT-Arithmetic (vs. 76% baseline).

The system showed particular strength in handling complex timeline queries and multi-operation arithmetic problems, previously identified as challenging areas. Our results suggest that specialized temporal reasoning modules can effectively complement LLMs' general reasoning capabilities, providing a promising direction for enhancing AI systems' temporal understanding.

This enhanced date system, with its sophisticated compounding logic and built-in fallback mechanism, provides a powerful and reliable solution for processing a wide range of temporal expressions. It bridges the gap between precise date requirements in APIs and the varied ways users might express dates in natural language. As a core component of our LLM-driven API interaction architecture, it significantly enhances the system's ability to handle complex, time-sensitive queries and operations.

Several future improvements could be brought to make TLES an even greater tool. These include the handling of ambiguities we mentioned earlier that could be user/customer dependant, the verification of the limits of the compounding logic of the system (TLES would not be able to represent the sentence “3 days after the 3rd Wednesday of April 3 years ago”), an extensive labelled dataset in multiple languages that would encompass the subtleties and expressions in different cultures.

Annex

The input structure of TLES follows the following schema:

{
    "reference_date": "YYYY-MM-DD",
    "date_delay": {
        "marker": ["day"|"week"|"month"|"quarter"|"year"],
        "delay": integer,
        "START_OR_END": ["START"|"END"|"NONE"]
    },
    "specific_month": {
        "month": string,
        "START_OR_END": ["START"|"END"|"NONE"]
    },
    "additional_week_delay": {
        "weekday": [
"monday"|"tuesday"|"wednesday"|"thursday"|"friday"|"saturday"|"sunday"
        ],
        "week_delay": integer
    }
}


Detailed results for ToT and our dataset with confidence intervals (p=95%):

Dataset Model Method Accuracy (%) CI_Lower (%) CI_Upper (%) # samples
ToT GPT-4o CoT 76 68.7 82.0 156
ToT GPT-4o CoT+TLES 100 97.6 100.0 156
ToT GPT-4o CoT+TLES+FS 100 97.6 100.0 156
ToT Llama405B CoT 58 50.2 65.5 156
ToT Llama405B CoT+TLES 100 97.6 100.0 156
ToT Llama405B CoT+TLES+FS 100 97.6 100.0 156
ToT Llama70B CoT 49 41.3 56.8 156
ToT Llama70B CoT+TLES 92 86.7 95.3 156
ToT Llama70B CoT+TLES+FS 100 97.6 100.0 156
ToT Llama8B CoT 25 18.9 32.3 156
ToT Llama8B CoT+TLES 53 45.2 60.7 156
ToT Llama8B CoT+TLES+FS 90 84.3 93.8 156
Ours GPT-4o raw 49 44.6 53.4 500
Ours GPT-4o TLES 85 81.6 87.9 500
Ours GPT-4o TLES+FS 93 90.4 94.9 500


Detailed results from our API oriented dataset:

Endpoint Parameter Filename Accuracy
bank-balances date GPT-4o (raw) 69%
GPT-4o with date system 92%
GPT-4o with date system and few shot 98%
cash-balances (POST) datePeriod.endDate GPT-4o (raw) 27%
GPT-4o with date system 78%
GPT-4o with date system and few shot 80%
datePeriod.startDate GPT-4o (raw) 27%
GPT-4o with date system 76%
GPT-4o with date system and few shot 80%
cash-balances (GET) endDate GPT-4o (raw) 38%
GPT-4o with date system 91%
GPT-4o with date system and few shot 99%
startDate GPT-4o (raw) 41%
GPT-4o with date system 92%
GPT-4o with date system and few shot 99%
cash-flows endDate GPT-4o (raw) 57%
GPT-4o with date system 88%
GPT-4o with date system and few shot 100%
startDate GPT-4o (raw) 63%
GPT-4o with date system 86%
GPT-4o with date system and few shot 96%
accounts closingDate GPT-4o (raw) 66%
GPT-4o with date system 94%
GPT-4o with date system and few shot 94%
creationDate GPT-4o (raw) 48%
GPT-4o with date system 76%
GPT-4o with date system and few shot 85%
updateDate GPT-4o (raw) 50%
GPT-4o with date system 80%
GPT-4o with date system and few shot 100%

References

  1. Fatemi, B., Abbasi, A., Komeili, M., Sarkar, A., & Cheung, J. C. K. (2024). Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning.

  2. Etienne, B., Runavot, P. (2024) GenAPI: A Robust Framework for Generating API Requests from Natural Language.

  3. Wilson, E. B. (1927). Probable Inference, the Law of Succession, and Statistical Inference. Journal of the American Statistical Association, 22(158), 209-212.

  4. Agresti, A., & Coull, B. A. (1998). Approximate Is Better than 'Exact' for Interval Estimation of Binomial Proportions. The American Statistician, 52(2), 119-126.