Generative AI Models for Cash Forecasting

By Imad Barakat, Abdellah Kaissari and Pierre Runavot

In today's dynamic financial landscape, while generative AI models are gaining attention for their potential in cash forecasting and liquidity planning, established specialized models continue to demonstrate their value and effectiveness. Recent research by industry leaders has highlighted the emergence of probabilistic time series forecasting using foundation models. However, our analysis reveals that carefully crafted domain-specific solutions, particularly those developed with deep understanding of cash flow patterns, often outperform these newer approaches in their base form. This finding underscores the importance of domain expertise in financial forecasting, while also pointing to the future potential of foundation models when properly fine-tuned.

The evolution of cash forecasting technology represents a balance between proven methodologies and innovative approaches. While generative AI models bring promising capabilities through their ability to learn from vast datasets, specialized models like Kyriba's Liquidity Planning (LQP) and Cash Management AI (CMAI) demonstrate superior performance in many scenarios, particularly when dealing with specific seasonality patterns in cash flows. This reality suggests that the future of cash forecasting may lie in hybrid approaches that combine the strengths of both traditional and emerging technologies.

Why Generative AI Models Stand Out

  • Universal Application: Generative AI models are designed for versatility, applying a single model framework across a diverse range of clients and industries. This universality eliminates the need for training a new model for each specific time series, thereby streamlining the forecasting process and reducing operational costs.

  • Performance and Efficiency: These models are not only performant but also offer rapid inference times, making them ideal for applications requiring real-time data processing and decision-making. Their efficiency is crucial for businesses that need quick turnaround times for financial forecasting.

  • Fine-Tuning Capabilities: The ability to fine-tune generative AI models enhances their accuracy and reliability. Fine-tuning allows these models to adjust to specific forecasting needs, improving their performance in varied financial environments and making them adaptable to the unique challenges faced by different organizations.

  • Incorporation of External Data: Some models support the integration of additional covariates, enabling more nuanced and refined forecasts by incorporating external data sources. This capability allows for a more holistic view of financial forecasting, taking into account broader influences that might impact cash flows.

Benchmarking foundation models

Nixtla has introduced the foundation-time-series-arena, a comprehensive benchmarking platform evaluating foundation models across more than 30,000 time series from diverse domains and frequencies. The primary metric used is the Mean Absolute Scaled Error (MASE).


Where:

  • y_t is the actual value at time t.
  • ŷ_t is the forecasted value at time t.
  • n is the number of observations.

Nixtla’s proprietary model, TimeGPT-1, is compared against the latest foundation models, including TimesFM, Chronos, Moirai, and Lag-Llama. This benchmark highlights the potential of foundation models, with some outperforming established statistical, machine learning, and deep learning models.


Benchmarking on treasury datasets

Building upon Nixtla's benchmark, we extended the evaluation to include trasury time series, focusing on daily forecasts. We assessed 172 time series, categorizing them into Strong Seasonality, Weak Seasonality, and No Seasonality, to ensure a comprehensive evaluation. The composite Symmetric Mean Absolute Percentage Error (SMAPE) metric was employed, combining daily, weekly, and overall period forecasts over 30, 60, and 90-day horizons.

The Symmetric Mean Absolute Percentage Error (SMAPE) is calculated as follows:

Where:

  • y_t is the actual value at time t.
  • ŷ_t is the forecasted value at time t.
  • n is the number of observations.

Our Agg. SMAPE is calculated as follows:


Where:

  • SMAPE_daily is the SMAPE calculated using daily forecasted and actual values.
  • SMAPE_weekly is computed by first summing the daily forecasts and actual values over each week, and then calculating SMAPE on these weekly aggregates.
  • SMAPE_period is determined by summing the daily forecasts and actuals over the entire forecast period, then computing SMAPE on these total sums.

This approach allows for a comprehensive evaluation of forecast accuracy across different time scales, providing a balanced view of performance from daily to longer-term periods.

The evaluation encompassed statistical, machine learning, deep learning, and foundation models.

Forecasting models overview

Statistical Models

  • Model Architecture: Common architectures include ARIMA (AutoRegressive Integrated Moving Average), Seasonal Decomposition, and Exponential Smoothing.

  • Training Approach: These models rely on identifying patterns such as trends and seasonality in the time series data and can be trained on the fly.

  • Hardware Requirement: Generally lightweight and can be run on standard CPUs, making them accessible and efficient for smaller datasets.

  • Models evaluated: AutoARIMA, Prophet, LQP_Model

The Liquidity Planning model (LQP) is an enhanced version of Prophet, designed to accommodate not only daily, weekly, monthly, and yearly seasonality but also custom seasonal patterns specific to each time series. For instance, it can capture unique periodicities, such as a 19-day cycle, if relevant. Additionally, LQP incorporates holidays into its forecasts and employs a sophisticated post-processing step to ensure that forecasts reflect zero values during weekends and holidays. This post-processing intelligently analyzes historical patterns to determine whether zero values have consistently occurred during these times. If such patterns are present, the model applies them; otherwise, it allows the model to predict naturally.

Statistical models have long been a staple in time series forecasting due to their simplicity and interpretability. However, they often struggle with capturing complex patterns in large datasets or those with irregular seasonality.

Machine Learning Models

  • Model Architecture: Popular models include Random Forests, Gradient Boosting Algorithm, and linear regression.

  • Training Approach:

    • Models are trained on historical data with features engineered from the time series, like lagged values and moving average for future predictions.
    • Require feature engineering and hyperparameter tuning but can generalize well across different datasets.
  • Hardware Requirement: Can be computationally intensive but typically require less power than deep learning models, often running effectively on CPUs.

  • Models evaluated: AutoLGBM, CMAI

The CMAI model is built on the LightGBM framework, leveraging lagged values and moving average windows over different periods, such as the past month, 14 days, and a week. It also incorporates date-based features, including the day of the week, day of the month, and month. Initially, CMAI runs a hyperparameter optimization to identify the best parameters, followed by a final refit to enhance accuracy. The model is specifically trained to predict the next value in the series. It starts by predicting the next value, which is then added to the historical data. This process involves recalculating features and predicting subsequent values iteratively until the desired forecast horizon is achieved.

Machine learning models bring the advantage of flexibility and can handle non-linear relationships better than statistical models. However, they require significant preprocessing and feature engineering, which can be resource-intensive.

Deep Learning Models

  • Model Architecture: The architecture is based on transformers, which are known for their efficiency in capturing temporal patterns.

  • Training Approach: A unique model is trained for each time series, ensuring tailored predictions for each dataset.

  • Hardware Requirement: Training these models effectively requires GPU resources to handle the computational demands efficiently.

  • Models evaluated: AutoTFT

Deep learning models, especially those based on transformers, have shown significant promise in capturing complex temporal patterns. They are particularly useful in scenarios where traditional models fall short.

Foundation Models

  • Model Architecture: These models often utilize transformer architectures, similar to large language models like GPT or LLama (with less parameters, 710M for chronos-large vs 175B for gpt3).

  • Training Approach: Initially trained on generic data, these models can be fine-tuned with specific time series data for forecasting.

  • Hardware Requirement: Training and fine-tuning these models require substantial computational resources, typically utilizing multiple GPUs or TPUs for efficiency.

  • Models evaluated: TimesFM, Chronos, ChronosFT (fine-tuned), Moirai, LagLlama

These models are capable of zero-shot forecasting, generating predictions for time series they have never encountered before. The approach is analogous to providing a large language model with the beginning of a sentence and receiving the continuation. Fine-tuning is also possible, allowing models to be tailored to specific data. We fine-tuned Chronos on 138 time series with Strong or Weak Seasonalities, comparing it with zero-shot Chronos and other models.

Foundation models represent a significant leap forward in forecasting technology, offering unprecedented flexibility and scalability. Their ability to handle diverse datasets without extensive retraining makes them particularly appealing for enterprises looking to streamline their forecasting processes.

Results

We present the results across the three horizons. The best model in each category is underlined, while the overall best model is highlighted in bold.


  • Strong Seasonality: ChronosFT leads, followed closely by Kyriba models, with TimesFM and Chronos also performing well.

  • Weak Seasonality: LQP emerges as the best model, with ChronosFT second and CMAI third, while TimesFM shows promise.

  • No Seasonality: None of the models performed satisfactorily.

In conclusion, foundation models like Chronos and TimesFM demonstrate promising results with zero-shot forecasting. However, fine-tuning enhances their performance significantly, particularly for series with strong seasonality. Our focus now shifts to Chronos, a foundation model developed by Amazon, whose fine-tuning has yielded promising results.

Chronos

Chronos is built on the T5 architecture, commonly used in large language models (LLMs) and proven effective in text generation and natural language processing. It achieves robust forecasting by transforming time series into tokens through scaling and quantization, generating subsequent tokens from a vocabulary, effectively converting the time series into a "language." This innovative approach is detailed in the paper "Learning the Language of Time Series".


Similar to LLMs, Chronos predicts the distribution of the next token and samples a specific number of values based on this distribution, adjusted by a parameter called temperature. This parameter controls the randomness of predictions—lower values result in more predictable outputs, while higher values allow for greater creativity. After sampling values, the model computes quantiles and uses the median as the forecast for each time step, continuing this process in an auto-regressive manner.

When working with probabilistic models like Chronos, stability in predictions is crucial for reliable cash forecasting. In this context, "stability" refers to the consistency of forecast results across multiple runs. To test this, we conducted experiments varying the "number of samples" parameter, which influences the model's prediction process.

Testing stability with the number of samples

The "number of samples" parameter determines how many predictions the model generates for each time step. By sampling multiple times, the model effectively explores a range of potential future values, allowing it to capture the inherent uncertainty in time series data. The median of these samples is typically used as the final forecast for each time step, which helps in reducing the impact of outliers and achieving a more stable prediction.

For each configuration of the number of samples, we conducted 100 prediction runs using Chronos. This multiple-run approach allowed us to observe the variance in predictions, which serves as an indicator of stability. We also calculated the average Symmetric Mean Absolute Percentage Error (SMAPE) across runs to assess whether increasing the number of samples improves the predictive performance.

Here are the results on one of the time series:


Key observations

  • Variance Reduction: As the number of samples increases, the variance of the predictions at each time step decreases. This indicates that with more samples, the model's outputs become more consistent, reducing the risk of erratic forecasts.

  • Improved Metrics: Both the average SMAPE and its variance improve with a higher number of samples. This suggests that more samples contribute to more accurate and reliable forecasts, as the model averages out anomalies and captures the general trend more effectively.

  • Inference Time Trade-off: While increasing the number of samples enhances stability and accuracy, it also leads to longer inference times. This trade-off is important to consider, especially for applications requiring real-time predictions.

  • Consistent Results Across Series: The improvements in stability and performance were observed consistently across all tested series, highlighting the robustness of this approach.

In conclusion, the default parameter of 20 samples strikes a good balance between prediction stability and inference time. However, organizations with the computational capacity and need for even greater stability might consider increasing this parameter, recognizing the associated increase in computational demand.

This plot shows the variance of the predicitions at each time step:


We can see that the variance at each time step decreases when we increase the number of samples and converges to a certain minimum. We also notice that for points where we have peaks in the time series, the variance seems to be higher than other time points, which means that the model predicts high values some of the time, but these values are averaged out by taking the median of the samples.

Testing stability with the temperature

The "temperature" parameter is another critical component in probabilistic forecasting models like Chronos. It controls the randomness of the predictions during sampling, influencing the model's exploration of potential outcomes.

In the context of generative models, temperature adjusts the distribution from which predictions are drawn. A lower temperature results in predictions that are closer to the most probable outcome, leading to more deterministic forecasts. Conversely, a higher temperature allows the model to explore a wider range of possibilities, introducing more variability and creativity into the predictions.

Similarly to the number of samples, we varied the temperature parameter and conducted multiple prediction runs to evaluate its impact on forecast stability and accuracy. We measured the variance in SMAPE and the consistency of predictions across different time series.

Here are the results on one of the time series:


Key Observations

  • Metric Variance: While we observed an increase in prediction variance as temperature increases, this impact remains marginal. The model maintains relatively stable predictions across different temperature settings.

  • Metric Performance: Our analysis shows that the default temperature of 1 yields the best SMAPE scores across different time series, suggesting this value optimally balances prediction accuracy and stability.

  • Inference Time: Unlike the number of samples, changing the temperature does not impact inference time. This allows for flexibility in adjusting the temperature without affecting the speed of predictions.

  • Default Parameter Choice: The default temperature of 1 appears to offer a good balance, providing stable and accurate forecasts for most series. However, fine-tuning this parameter for specific datasets could further optimize performance.

This plot shows the variance of the predicitions at each time step:


Fine-tuning Chronos

While foundation models like Chronos are capable of zero-shot forecasting on unseen time series, fine-tuning them on specific datasets can significantly enhance their performance. Fine-tuning involves training the model further on a particular subset of data, allowing it to adapt its parameters to better capture the unique patterns and nuances of the series.

Using a DataBricks guide, we fine-tuned Chronos on 138 selected time series, chosen for their strong or weak seasonality. This process involved adjusting the model's weights to minimize forecasting errors on the training data, thereby improving its accuracy for similar future predictions.

Here are some indivudual forecasts comparisons between Chronos and ChronosFT:


  • Improved Long-Term Forecasts: While the original Chronos model struggled with longer prediction lengths, fine-tuning resolved this issue, providing accurate forecasts over extended periods.


  • Accurate Peak Predictions: The fine-tuned model demonstrated an improved ability to predict peaks in the data, though it missed the exact timing. This discrepancy may result from sudden shifts in the timing of peaks, which can be challenging to capture precisely.

In summary, fine-tuning enhances the performance of generative AI models like Chronos, making them more adept at capturing the specific dynamics of the datasets they are trained on. This process underscores the potential of foundation models to be tailored to meet the unique forecasting needs of different organizations, leading to more reliable and actionable insights.

Conclusion

Our comprehensive evaluation reveals a nuanced picture of cash forecasting technologies. Kyriba's specialized models, LQP and CMAI, demonstrate superior performance, particularly for time series with strong seasonality patterns. This highlights the continued value of domain-specific expertise and targeted model development in cash forecasting. However, Chronos and other foundation models show promising potential, especially after fine-tuning, suggesting a future where GenAI would be used for cash forecasting. The significant improvements observed with fine-tuned models, combined with their scalability and adaptability, make a compelling case for their adoption in production environments.

To accelerate this transition towards GenAI-powered forecasting, several key steps are crucial. First, we should explore the deployment of the fine-tuned model into production environments to assess its real-world applicability. Additionally, studying the integration of these models within the DataBricks environment could streamline their usage and increase efficiency. Fine-tuning efforts should also expand to include training on more series, alongside optimizing hyperparameters for better accuracy. Finally, implementing support for external covariates could further enhance forecasting capabilities, allowing for a more comprehensive analysis of cash flow dynamics.