Large Language Models for Treasury Requests using the Kyriba API

Image generated using bing chat/DALL-e using the prompt: “AI helping treasury team pixel art style”.

Authors:
Basille Etienne, Data Scientist
Pierre Runavot, Data Scientist Manager

Demo

First, a little demo of the prototype we developed, technical details will be provided in the article below:


Introduction

This article delves into the groundbreaking application of Large Language Models (LLM) in treasury management solutions and explores how the strategic utilization of APIs can unlock a new era of efficiency, accuracy and intelligence in financial operations. While LLM has undoubtedly shown great promise, understanding its boundaries such as hallucinations or cut off time, is vital for leveraging its benefits effectively.

Our research focuses on the topic and the proof of concepts we have built. We are cognizant of the data privacy challenges associated with LLM models hosted by companies like OpenAI, but for our research and experimentation purpose, we will not consider those important issues.

The most obvious strength of LLM is their understanding of natural language, and this, in many different languages. Our idea is to translate natural language into actions that are executed on the software. Indeed, finance teams, such as Kyriba users, have questions. They will translate those questions into clicks in the Kyriba UI to launch a report or do some navigation, do some set up to launch inquiries to get results. Those results are then interpreted to answer the initial questions by the users.

What if we could use natural language to ask those questions, and get answers?

For that, we will translate natural language into Kyriba API calls, and use the results to provide an answer.

In this project, we focused on three API endpoints: cash balances, accounts and bank statements. The end product should take in input a related question from a treasurer (for example, “Which accounts have been closed last month?”), make the appropriate API call and return a curated answer.

The goals of this project can be summarized in three parts :

  • Choose the appropriate API endpoint to use for making the API call.

  • Build an API request URL with the help of the documentation of the selected endpoint, extract the provided parameters from the treasurer’s input prompt, whether they are path parameters, query parameters or filters and put it in a precise format to build the url.

  • Curate the API answer, using the LLM’s code generation capabilities (for example, for plotting a time series rather than displaying only the raw response table with dozens of columns, or computing an average).

Building an API request url using a natural language model isn’t trivial, due notably to the limit in context of each model.

We defined three main criteria that have to be met to attain our goals:

  • Accuracy: the product shall choose the right API endpoint, return a correct request url, including all the information given by the prompt, warns if any required parameter is missing, and doesn’t add (or hallucinates) unwanted fields. It shall also generate code that can be properly run and that returns the desired information. Finally, the potential plotted values shall be relevant to the user.

  • Robustness: the product shall consistently return accurate responses. This means that variations of a same prompt (that contains the same information), must always meet the criteria cited above.

    Example: if we consider the task of selecting an endpoint, a completely random system that returns one of the possible endpoints might be accurate on its first try (with a probability of ⅓ in our case) but will be wrong ⅔ of the time so is not robust.

  • Execution speed: the product shall be sufficiently quick to respond, so that the user doesn’t have the time to wonder whether s/he should have used another method to obtain the desired information.

The Tools Used

The recent rise in popularity of LLMs is undoubtedly due to OpenAI’s ChatGPT product. In their API solutions, they allow us to use the most advanced models of the moment. It is the foundation on which we built this project.

We mainly used text-davinci-003 (a GPT 3.5 model), and ran some tests on gpt-3.5-turbo, which is a light version of the latter. We plan to explore different other solutions, including open source ones, in the future. However, our main focus now is to properly “guide” the model to have the expected result, and we have used the following two approaches:

  • Langchain: a library that implements various frameworks (like chains and agents) to help create the response for complex tasks.

  • DIY & Guidance: before the challenges that we faced using Langchain regarding execution time, robustness and cost (that’s a spoiler), we adopted a more custom approach to meet our criteria. We used guidance to help format the LLM’s responses that both help in accuracy and robustness.

Langchain

APIChain

As its name suggests, the APIChain is a chain. A chain allows us to sequentially put components together to build a final response.

In this specific case, the chain will build the request url, make the API call and return a summary of the API response. This is done given the API documentation, authentication and the user prompt as input.

In the version of Langchain we used, getting the raw JSON response was not implemented and we had to do a quick workaround to retrieve it (see the class myAPIChain below).

Code example:

The output, containing the built url, the api response and a summary of this response if its length fits the available context maximum size:

Although quick to execute, the APIChain does have some limitations:

  • For more complex prompts, it might not get the right url.

  • It doesn’t specify to the user if there are some missing parameters, until the response is received, so the treasurer wouldn’t be informed at a glance of all the variables s/he has to input and should make several tries to get them all.

  • Because of its limit in context, as the API documentation must be passed as one string, it can’t handle long documentations or multiple endpoints.

  • The handling of dates can be finicky.

  • Filters, when needed, are not always well handled, as an RSQL expression has to be made.

To resolve the context problem, we then used the JSONtoolkit that dives into the json documentation instead of having it as a plain string.

JSONtoolkit

The Agent interface enables applications that require a flexible chain of calls to LLMs and other tools based on user input. It offers the necessary flexibility for such applications by providing agents with access to a suite of tools. Agents can selectively choose which tools to use based on user input, and they can even utilize multiple tools in a sequential manner, where the output of one tool becomes the input for the next.

Additionally, action agents make decisions on the next action by considering the outputs of previous actions at each timestep. To facilitate specific use cases, toolkits are agents that serve as wrappers for collections of tools. For example, an agent interacting with a SQL database might need tools for query execution and table inspection.

In our case, we use the JSONtoolkit, that can read through a JSON and decide at each step what to do (go deeper in the JSON tree, retrieve an information, build the request url from what it extracted).

Code example:

Note that for the same prompt, we get two different results, even if the temperature of the LLM is set to 0. Here are two examples of outputs, one that accurately gets the url, the other that fails:

Good output:

Wrong output:

The JSONtoolkit has some limitations. One drawback is the extended execution time, which can take up to a minute if an agent gets stuck in a loop, something that happens quite often. This can make the agent less robust as it struggles to recover and may find it challenging to navigate the documentation effectively. Another issue is the higher cost associated with the JSONtoolkit, which can be more than 10 times the expense of using an APIchain, ranging from 0.1$ to 1$ per url built!

The same issues associated with dates and filters also occur as they do with the APIChain. Given these limitations in terms of execution time and cost, it is important to explore more resilient and cost-effective approaches to address these challenges.

Custom Step-by-Step URL Construction

To cope with the limitations of Langchain, we decided to build a custom pipeline, using mainly guidance as a helper library. A demo of the prototype is available at the beginning of the article.

Guidance for selections/classifications and formatting

The need for guidance:

  • Cleaning special characters, additional unwanted punctuation, etc.

  • Avoiding potential introductory sentences before giving the result.

  • Easy classification tasks with logprobs, when a specific path is selected, the model has to return exactly this path (robustness) => guidance allows to do this easily.

Evaluation

See sheet, 92.8% on the defined score:

Description Score (out of 4)
Call cannot be made from the request url 0
Call can be made from the request url, but wrong parameters are passed or much of the information from the prompt is lost. 1
At most 1 parameter is wrongly passed (example: currency.code==euro instead of EUR), or 1 logical error (example:'or' instead of 'and') 2
1 parameter is wrong, because the prompt and documentation does not contain the necessary information / adds
unspecified information but that doesn't impact much the final result
3
Perfect request 4


Sample of the data used:

Our solution allowed to considerably reduce costs and execution time for the construction of the url compared to JSONtoolkit. We are getting close to the speed of APIChain, but with way more robust performances, especially regarding dates, filters, extraction of parameters and checking required ones. All that using a documentation that can be as long as needed.

Next Steps

In our pursuit of empowering natural language interfaces for interacting with APIs, we are setting our sights on the development of Agents for multiple API calls. This paradigm exhibits unique complexities that we aim to tackle. For instance, let's consider a situation where a user wants to calculate the sum of their euro balances. To fulfill this request, the system has to first make an API call to retrieve the account numbers associated with euros. For each of these accounts, it then has to make subsequent API calls to learn the balance and, finally, calculate the sum of all the balances.

The inherent challenge here is the dependency of subsequent API calls on the results obtained from the preceding ones. Traditional chains won't suffice for these operations. Instead, we need to employ agents, capable of handling the order of execution based on data.

Performance evaluation in this scenario also becomes a complex task. We need to measure not just the correctness of the system’s responses, but also the efficiency of our agent-based API calling mechanism. This will require innovative methods of testing, benchmarking and optimization to ensure a fluid and responsive user experience.

The forefront of our concerns is ensuring the security and privacy of our clients. Kyriba values privacy and commits to maintain the highest standards for our customers. As the Agents for multiple API calls would require the large language model to read the results from the API calls, this must be executed securely, ensuring total confidentiality. One viable solution we are considering is the utilization of open-source models. This offers a transparency that could allow us to reassure our clients of complete security.

Furthermore, we are not just limiting ourselves to existing open-source models. We plan to rigorously test the best the market has to offer, and potentially fine-tune some of these models to achieve optimal performance. By these means, we hope to align our goals with the privacy-oriented future of data interaction and drive the evolution of secure, natural language processing in API engagements.