GenAPI: A Robust Framework for Production-Ready API Integration with Large Language Models
By Basile Etienne, Pierre Runavot
December 18, 2024
Abstract
We present GenAPI, a comprehensive framework designed to bridge the gap between Large Language Models (LLMs) and production APIs. Our system introduces a novel approach combining automated dataset generation, few-shot learning, and robust verification mechanisms to create a reliable pipeline for API interactions. Through extensive testing across multiple endpoints, we demonstrate remarkable performance, achieving up to 98.3% accuracy with state-of-the-art language models while maintaining production-grade reliability. This paper details the architecture, implementation, and performance characteristics of GenAPI, offering insights into building robust LLM-powered API integration systems.
1. Introduction
The integration of Large Language Models (LLMs) with production APIs represents a significant challenge in modern software architecture. While LLMs excel at understanding natural language, their direct integration with structured API endpoints presents numerous challenges on parameter validation, endpoint selection, and response handling. GenAPI addresses these challenges through a systematic approach that combines the natural language understanding capabilities of LLMs with rigorous software engineering practices. Our focus on data security and confidentiality led us to notably use the Llama v3.1 family of models (among other open source LLMs) hosted inhouse, with a closed source solution as a benchmark for comparison.
The primary contributions of this work include: an automated dataset generation pipeline derived from OpenAPI specifications, a robust endpoint selection and parameter inference system, and a comprehensive verification framework ensuring API compliance. Our approach demonstrates significant improvements in accuracy and reliability compared to trivial inference of an API request.
2. System Architecture
2.1 Overview
GenAPI employs a multi-stage architecture where each component is specifically designed to handle distinct aspects of the API integration challenge. The system begins with the selection of the endpoint that’s relevant to the user’s question, proceeds with the inference of the selected endpoint’s parameters, deterministically builds the url/body to make the API call and concludes with response generation, augmented by automated calculations based on retrieved data. Each stage incorporates verification mechanisms to ensure reliability and accuracy. The endpoint selection and the parameters inference steps utilize few-shot learning [1] techniques (using a generated dataset, see the next section). We will focus in this article mainly on the parameters inference part.
2.2 Dataset Generation
The foundation of GenAPI's reliability lies in its sophisticated dataset generation system. Unlike older approaches that relied on manual dataset creation, our system automatically generates comprehensive training data from OpenAPI/Swagger specifications. This process involves parsing the API specification to account for parameter types and other requirements.
You can check an example of one of our OpenAPI specification extract for a bank balance endpoint in the Annex.
From such specifications, our system generates diverse, valid parameter combinations while maintaining compliance with the API's requirements. The generation process accounts for various data types, formats, and enums while enforcing required parameters, creating a rich dataset that captures the full complexity of the API's interface.
This step is crucial for both few-shot learning techniques and evaluation of the system.
2.3 Data Type Handling and Generation Strategies
Primitive Types Generation
The generation of API parameters in GenAPI follows a sophisticated approach rooted in the OpenAPI specification's type system. While OpenAPI defines basic types such as integers, strings, booleans, arrays and objects, the complexity lies in handling the various formats and constraints associated with these types. Our implementation employs specialized generators that not only respect the basic type constraints but also incorporate format-specific rules.
Format-Specific String Generation
String parameters in OpenAPI present unique challenges due to their format specifications. Unlike simple string generation, format-specific strings require dedicated generators that understand and implement the underlying format rules. Our system implements specialized generators for common formats including date-time, UUID, email, and ISO-specific formats such as currency codes and country identifiers. These generators ensure that the produced values not only match the format specifications but also represent realistic and meaningful data. This is accompanied by a format verifier that we’ll touch in the next section.
We wrote a separate article on the handling of dates specifically, that presents some temporal reasoning challenges [4]. It allows the system to properly handle questions that include relative dates (e.g. “What was my cash balance in May of last year ?”).
Complex Type Composition
The handling of complex types, including objects and arrays, requires a hierarchical generation approach. Our system implements a recursive generation strategy that respects nested structure constraints while maintaining referential integrity across the generated dataset. This is particularly important for endpoints that accept nested JSON objects or arrays of complex types (in POST requests like cash-balances for instance), where internal consistency must be maintained across all levels of the data structure.
Question generation
We used a strong model (Llama 405B) to generate for each parameter set a meaningful question that a user wanting to use the API could ask. We employed different enforcements of tones, politeness, verbal vs non-verbal sentences, inclusion of data analysis operations (like “average”, “sum”, “highest” etc.), to have a diverse dataset of questions. We did some manual checks and strict value inclusion tests with feedback (e.g. verify that a UUID is exactly included in the question).
A future step towards industrialization would be to use techniques such as LLM-as-a-Judge [3] to verify the generated question is relevant to the set of parameters.
2.4 Parameter Inference and Verification Engine
The parameter inference engine represents a crucial component of GenAPI, employing few-shot learning techniques to translate natural language queries into valid API parameters. This system operates through a series of orchestrated steps, each designed to ensure accuracy and reliability.
The system prompt consists of:
- The OpenAPI specification description.
- The list of parameters with their requirements.
- Some instructions on defaults values to be used.
- A few shot examples consisting of:
- A generated question
- The corresponding set of parameters.
- A generated question
Once an answer is generated, it is parsed into JSON. Note that “json-mode” was used with openAI models, and a feedback mechanism for free-output models was in place. In our future work, the OpenAPI specifications will be translated into a json schema, which can be used as a grammar reference for enforcing the wanted structure and types (enhancing in a statistically meaningful way the performance and robustness of the open-source models we used).
The JSON is then verified before deterministically building the url and the Body for POST requests.
2.5 Verification Architecture
GenAPI implements a validation architecture that ensures generated parameters meet syntactic requirements. Note that semantic verification could be a future work, leveraging LLM-as-a-Judge, [3] for example.
Schema Enforcement and Constraint Validation
The verification system employs a strict schema enforcement mechanism derived directly from the OpenAPI specification. This includes validation of:
- Data types: string, numbers, booleans, arrays, objects are checked.
- Required vs. Optional Parameters: The system maintains strict enforcement of required parameters while appropriately handling optional ones.
- Enumeration Constraints: For parameters with enumerated values, the verification system ensures all generated values belong to the specified set.
- Numeric Constraints: For numeric parameters, the system validates minimum, maximum, and multiple-of constraints.
- Pattern Matching: String parameters with specific formats have associated regex patterns and are validated against them.
Error Recovery and Feedback Mechanisms
When validation fails, the system implements an error recovery mechanism. Instead of simply rejecting invalid parameters, the system provides structured feedback to the parameter inference model that will attempt to understand the validation failure and correct it.
3. Experimental Results
Our experimental evaluation of GenAPI demonstrates its effectiveness across various metrics and models. Let's examine the performances in detail.
3.1 Model Performance Analysis
We conducted extensive testing across six distinct Kyriba API endpoints (companies, bank-balances, banks, cash-flows, transfers-status, accounts) using three different language models: GPT-4o as a benchmark, and for our Trusted AI initiative at Kyriba, Llama 405B (v3.1), and Llama 70B (v3.1). The results demonstrate consistent high performance across all models, with GPT-4o achieving the highest overall accuracy at 98.3%, followed by Llama 405B (95.8%) and Llama 70B (93.8%).
Note that for security and confidentiality purposes regarding financial data, the generative AI products that Kyriba is developing will be based on open source models that can be hosted on our servers, as opposed to OpenAI’s closed source solution.
Accuracy (%) | companies | bank-balances | banks | cash-flows | transfers-status | accounts | Global |
---|---|---|---|---|---|---|---|
GPT 4o | 98.6 | 97.9 | 100 | 96.7 | 100 | 96.6 | 98.3 |
Llama 405B | 100 | 97.9 | 95.2 | 86.9 | 98.6 | 94.3 | 95.8 |
Llama 70B | 98.6 | 100 | 91.6 | 86.9 | 100 | 85.1 | 93.8 |
The more detailed analysis (see Annex) reveals some variations in reliability metrics across the evaluated models, that all remain satisfying with the whole pipeline in place. GPT-4o demonstrated slightly better reliability with zero parsing and validation errors (0.00% error rate). Llama 405B still exhibited very minimal error rates (0.03% for JSON parsing, 0.08% for validation), while Llama 70B showed marginally higher error rates (0.42% for JSON parsing, 0.31% for validation).
These findings suggest that while all models maintain high operational standards, GPT-4o's error-free performance remains a benchmark regarding accuracy and robustness. In all fairness to the open source models and as suggested before, the use of grammar to enforce a JSON schema would greatly diminish - if not suppress - these failure rates, which will need further testing. Note however that the feedback loops made by the verifier greatly reduced the final failure rates for both JSON parsing and Parameter validation. Our goal is to match the performance of the benchmark set by GPT-4o, while guaranteeing the safety of our customers’ data.
3.2 Latency and Efficiency
Response time analysis reveals significant variations between models. GPT-4o demonstrated the lowest average latency at 2.6 seconds per request, while Llama 405B and Llama 70B showed higher latencies at 7.3 and 5.9 seconds respectively. These differences become crucial in production environments where response time directly impacts user experience, and efforts will be made to reduce the latency of open source models.
4. Production Considerations and Future Work
The deployment of GenAPI in production environments requires careful consideration of several factors. First, the system must maintain robust error handling mechanisms to manage API failures, authentication issues, and invalid parameter combinations. Our implementation includes comprehensive retry mechanisms to ensure reliable operation under various failure conditions.
Regarding the productization of GenAPI, a strong feedback mechanism will be put in place to take into account the preferences of the customers, and improve the overall performance. This is essential to add some diverse samples to the Few shot examples, because although we made the dataset the most diversified possible, it most likely doesn’t cover explicitly all the subtleties of the English language.
Looking forward, we identify several promising directions for future research and development. The integration of more sophisticated computation capabilities through restricted Python interpretation could expand the system's analytical capabilities. Additionally, the implementation of LLM-as-a-judge techniques could further improve response quality and reliability, while making the users aware of these aspects and thus reducing the possibility of the system to deceive. Finetuning could be also an effective way to use smaller models for cost reduction and latency reasons while maintaining a high level of performance [2] which would be made possible by our existing data generation pipeline.
5. Conclusion
GenAPI demonstrates the feasibility of creating robust, production-ready API integration systems using Large Language Models. Through careful architecture design and comprehensive verification mechanisms, we achieve high accuracy while maintaining practical response times. The framework's success in handling complex API interactions suggests that similar approaches could be applied to other domains requiring structured interaction with external services.
This work provides a foundation for future research in LLM-based API integration systems and offers practical insights for organizations looking to implement similar solutions. The demonstrated performance improvements and reliability mechanisms make GenAPI a viable solution for production environments requiring sophisticated API interaction capabilities.
References
- Brown et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020.
- Raffel et al. (2019). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer."
- Chen et al. (2023). "LLM-as-a-Judge: A Framework for Enhancing LLM Response Quality."
- Etienne, B., Runavot, P. (2024). "Temporal Logic Enhancement System for AI Agents."
Annex
Extract from Kyriba’s Bank Balances endpoint
"parameters": \[
{
"name": "date",
"in": "query",
"description": "Balance date",
"required": true,
"allowEmptyValue": false,
"schema": {
"type": "string",
"format": "date"
},
"example": "2020-09-29T00:00:00.000Z"
},
{
"name": "ref",
"in": "path",
"description": "The reference (uuid or code) of the account",
"required": true,
"schema": {
"type": "string"
},
"example": "123e4567-e89b-12d3-a456-426655440001"
},
{
"name": "type",
"in": "query",
"description": "Balance type",
"required": true,
"allowEmptyValue": false,
"schema": {
"type": "string",
"default": "END\_OF\_DAY",
"enum": \[
"END\_OF\_DAY",
"INTRADAY"
\]
}
}
\],
That would yield e.g. the following generated sample:
{
"type": "INTRADAY",
"ref": "52009bf1-5c1a-4384-8ebc-0b0bd66109bf",
"date": "2024-11-31"
}
Detailed Results
Model | Endpoint | Length | Global_Accuracy | Average_Time_Per_Request | Accuracy_Endpoint_Selection | Average_NB_JSON_Parsing_trials | Average_NB_Validation_trials | NB_JSON_Parsing_Failure | NB_failure_validation |
---|---|---|---|---|---|---|---|---|---|
llama_405b | companies | 71 | 100 | 5.17 | 100 | 1 | 1 | 0 | 0 |
llama_405b | bank-balances | 94 | 97.9 | 5.39 | 98.9 | 1 | 1 | 0 | 0 |
llama_405b | banks | 83 | 95.2 | 4.4 | 100 | 1.04 | 1.07 | 0 | 2 |
llama_405b | cash-flows | 61 | 86.9 | 15.49 | 100 | 1.21 | 1 | 1 | 0 |
llama_405b | transfers-status | 74 | 98.6 | 6.36 | 100 | 1 | 1 | 0 | 0 |
llama_405b | accounts | 87 | 94.3 | 9.13 | 100 | 1 | 1.01 | 0 | 0 |
global | 470 | 95.8 | 7.3 | 99.78 | 1.03 | 1.01 | 0.03% | 0.08% | |
llama_70b | companies | 71 | 98.6 | 4.96 | 100 | 1.28 | 1.15 | 0 | 0 |
llama_70b | bank-balances | 94 | 100 | 3.92 | 100 | 1 | 1 | 0 | 0 |
llama_70b | banks | 83 | 91.6 | 4.62 | 100 | 1.21 | 1.12 | 4 | 3 |
llama_70b | cash-flows | 61 | 86.9 | 10.56 | 100 | 1.36 | 1 | 7 | 0 |
llama_70b | transfers-status | 74 | 100 | 4.91 | 100 | 1.16 | 1.08 | 0 | 0 |
llama_70b | accounts | 87 | 85.1 | 7.46 | 100 | 1.29 | 1.18 | 2 | 5 |
global | 470 | 93.8 | 5.9 | 100.00 | 1.20 | 1.09 | 0.42% | 0.31% | |
openai_4o | companies | 71 | 98.6 | 2.17 | 100 | 1 | 1 | 0 | 0 |
openai_4o | bank-balances | 94 | 97.9 | 2.18 | 98.9 | 1 | 1 | 0 | 0 |
openai_4o | banks | 83 | 100 | 1.56 | 100 | 1 | 1 | 0 | 0 |
openai_4o | cash-flows | 61 | 96.7 | 4.18 | 100 | 1 | 1 | 0 | 0 |
openai_4o | transfers-status | 74 | 100 | 1.85 | 100 | 1 | 1 | 0 | 0 |
openai_4o | accounts | 87 | 96.6 | 3.78 | 100 | 1 | 1 | 0 | 0 |
global | 470 | 98.3 | 2.6 | 99.78 | 1.00 | 1.00 | 0.00% | 0.00% |