Why: A data mesh to power up our Active Liquidity Network

Kyriba's mission is to support corporate finance digitization on all liquidity and cash related items and Kyriba's strategy to deliver on that mission is to build the first and leading open platform for enterprise liquidity management. I am managing Kyriba's data engineering team and a part of our work is to build the data platform required to enable this strategy. The role of this platform is to support our data engineers, internal teams, partners and ultimately our customers to get access to actionable insights based on their data. That is our context, we want to enable the Kyriba ecosystem to create more value from their data.

Over the last decades, many companies pursued the same goals and have been building data platforms such as data warehouses and data lakes with various level of success. That is why last year, as we embarked on this journey to build our next generation data platform, we surveyed the landscape of problems we were trying to solve, but also the way other companies and industries were tackling them. In this context, we have been particularly interested in Zhamak Dehghani's assessment of the current architectural failure modes and the way to overcome them through the  Data Mesh concepts: pushing data ownership and responsibility to domain teams.

This article is the first of a series intended to present our progress in our journey to build a Data Mesh to power up our active liquidity network.

Data@Kyriba today

Of course, we are not starting from nothing, we already have a data platform supporting our Business Intelligence solution as well as our data driven analytics such as fraud detection, cash forecasting & co. Basically, it is a data warehouse. It is based on a star schema data model governed by the data team and fed by a modern data pipeline powered by Apache Spark.

The overall data pipeline is actually matching at 100% the current state of technology, architecture and organization as described by Zhamak and as depicted in the following diagram extracted from her data mesh principles article.

In our case

Operational data Plane: That is the Kyriba functional modules (Core, Liquidity, Risk, etc.) They store the data in RDBMS. They are owned by dedicated functional development teams. Those apps are exporting their data as batches, typically daily.

Data Pipelines: In house data pipeline owned by Kyriba data team and powered by Apache Spark.

Analytical Data Plane: Data Warehouse based on a Star Schema governed by the data team and serving Business Intelligence solution as well Kyriba's AI based services (Fraud detection, Invoice prediction, etc.)

Visualization/Reporting: Kyriba Visuals, our Business Intelligence solution powered by Qlik Sense.

New Data Platform Motivations

This data platform has proven to bring a lot of value to Kyriba and our customers. Encouraged by this success and following Kyriba's strategy to build an open platform, we are now ready to double down our investment at this level to allow us to be more open and even more productive.

Here are the core motivations for us to build a next generation data platform:

Pervasive: One limitation of our current platform is our analytical database does not contain all the data. To be present in our data warehouse, some scheduled tasks needs to be manually created for each and every customer interested to get access to our data products. This comes with some advantages in terms of flexibility, control and storage requirements but this administrative work is becoming an efficiency bottleneck and a limiting factor for some use cases. We want all the eligible data to be addressable by our data products.

Open: Our data pipelines comes is somehow closed to the other data producers and consumers. So far, we've been expecting the data to come from a single source - Kyriba application, and ingested in one schema governed by the data team, and available solely to the data products developed by the data team. We want our new platform to be more more open to external data stores and to external data consumers.

Consistent: Traditionally, it's been accepted if not even encouraged to have a lag, typically a day, between the time of arrival of the data in our operational data plane against our analytical data plan. That is perfectly ok for some use cases, but we are now seeing more and more situations where our customers would like to use our data products for operational purposes. They want the data there to be fresh, if not real time. Our goal for our new data platform is that wherever our customers are looking at, they would see fresh and trustworthy data.

Seamless: We want to build the first and leading open platform for Enterprise Liquidity Management. To make this possible, we need to leverage the network effect and for this we need our platform to be frictionless. For the data platform that means leveraging open standards and self servicing as much as possible.

Cloud scale: In order to support our growth, we need our next generation platform to cloud native. What that means for us is that it should run on elastic infrastructure, based on streaming analytics and provide a high level of automation and reliability.

Data Compliant at Core: Last, but not least, we need our data platform to ensure data compliancy. That is actually critical, first of all, to protect our customers, but also to unlock the potential of our data. We want our data platform to be aware of our contractual engagement with our customer and allow to provide "drinkable tap water" through a data compliant pipeline to our data scientists. It should be easy for our Data Scientists and Data Engineers to get an access to clean, data compliant data: Obfuscated and Aggregated if needs to be in order to meet our customers or regulatory requirements.

The Data Mesh Architecture

That is in this context, while putting in place the design of such a platform that the data mesh concepts caught our interests. But what is a data mesh?

The core principles are described in the picture below:

Therefore, data mesh is a decentralized eco-system of data products enabled by a common self-served data infrastructure. It is all about pushing the data ownership and responsibilities to the domain teams.

How will this help us to deliver on our new data platform promises?

Remove the organizational bottlenecks: We do not want the data team to be the limiting factor to build data products. Data mesh will be a way for us to leverage the power of the whole organization. This will allow us to move faster but potentially also to be more open as we will open our shared data services and products externally.

Faster delivery: As of today, most new functional features will arrive first in the core products and after that will require some level of cross team synchronization to adapt our star schema data models, update the core product data exports, update the data pipeline, create the new BI apps, serve the new data, update the smart service to take in to account the new data. The domain oriented organization of the data mesh should allow any of the data product to ship the overall new feature in a single quantum of delivery, exposing the new data directly to all the other data products.

Improved consistency: Data consistency between our BI solutions, Data Products and our Operational application has always be a must for us. But that is not easy to achieve when a part of the logic is actually duplicated across several code bases and teams. Having a clear way to reuse the same logic between both worlds will make this much easier. It is helpful not only in terms of productivity, but also in terms of consistency.

Easier governed access to the data: Having a single set of services to discover, produce and consume and process the data will also give us more control and traceability. Having a governed access to the data will also facilitate access to the data for those data compliant use cases.

Overall, the data mesh architecture looks very promising for us to move faster and be more open, but how do you actually build such a platform?

The data mesh momentum is increasing, many engineering teams and even vendors are now saying they are building or selling a data mesh. However, we are still in the early days. To my knowledge, there is no well-known recipe to build such a modern data platform. That means each of those teams currently in the data mesh journey needs to find their own answers, based on their mileages and priorities, to the organizational and architectural questions raised by the data mesh principles. That is also what we are trying to do.

Organization: How do we manage the transition?

Let's start by looking at the way we are currently managing the organizational aspects. Here are the main stages we have identified in our journey and the way we have decided to approach the very first steps.

Step 0. Proof of Concept:

The first thing we did was a PoC. The use case we have selected was involving some real time data replication, streaming analytics available in our BI solution. The most important thing we wanted to validate was our capacity to get the data products developed by the product teams. The PoC was developed by extracting, for one sprint, one developer from our Liquidity team and one developer from the BI team. It was overall successful to both demonstrate the value of the new platform internally, but also to get a better understanding of the design, technologies and teams it would take to get it done at scale.

Step 1. Create a platform MVP:

The PoC did produce a great demo, but also some messy code with no clear separation of concerns between the platform components and the data products. Our next task was then to both clarify the design of the platform: the role and responsibilities of its core components. One important decision then was not to distribute the project too early across multiple teams, but rather to start creating a platform team in charge to implement both the platform MVP and the very first data products. This team was created as a cross functional team from the start with talented developers joining internally from the data team, from the domain teams as well as other developers hired on the market.

Step 2. Go To Production:

That is actually what we are working on at the time of writing this article. We are working hard to get our first workload in production. We are working with a pilot customer on a brand new tool serving a new use case (a dedicated article should follow as soon as it will be live). Another important part is the work we are doing with our DevOps team to get our "data mesh" hosted on its dedicated infrastructure, aside of the functional operational application infrastructure. The overall goal being is to have those services shipped through a new CI / CD pipeline fully decoupled from our core applications.

Step 3. Scale Out:

Obviously, the reason we are investing on a data mesh architecture is not to have a central team implementing all the data products. Once the platform will reach a minimal level of stability, the next step will be to onboard new team on the platform. Our plan is to have both data products developed by the core application teams serving their data to the other data products and data products developed by the data teams as analytics leveraging and enriching the core data.

Step 4. Open it up:

We are not there just yet, but our end goal would be to open our data platform as a set of APIs available to our partners and consumers to integrate with us.

Architecture: How do we "distribute" the ownership?

So, we've learned from our initial PoC - a real time data pipeline built somewhere in between the data team and the core application teams. Based on this learning and a few month of work after that, here is what we came up with.

In terms of architecture, we are building our data platform as a set of event-based micro services. Those services are exposed as RESTFul APIs, documented through Open API Specifications, and asynchronously communicating through domain events exchange as Avro Buffers governed in a Schema registry. We are bridging the new services with the current application with Change Data Capture (CDC) techniques.

In terms of technologies, our events are exchanged through Apache Kafka & processed through Apache Flink. Our states are saved in S3 and our machine learning processes are orchestrated through Apache Spark batches.

Services: Now, speaking about our data mesh, here are the core components as well as their role and responsibilities.

Data Catalog: That is probably the main component that we can see this as the overall directory. That is thanks to the data catalog, that data product can discover and understand each other. The mission of the data catalog is to own the registry of data products, their output and input ports as well as their schema, documentations, SLOs & ACLs. Live ports can be registered under the form of a Kafka Topic while Point in time datasets or Snapshots, and can be registered under the form of an S3 bucket.

Replication Service: This one is as well very important, but more as a facilitator. It is removing the burden of the data replication to the Data Products. Basically, a data products will be able to "register" it's own RDBMS table to the replication service and can expect the data to be captured automatically thanks to the CDC and pushed to a live port of the data catalog. Replication service is always acting on behalf of a data product or a user.

Data Processing Engine: The go to service to manage the SQL requests, be them Continuous Queries or Batches. Depending on the type of request, the result will be automatically registered to the data catalog either as either a live port (Continuous Query), or a snapshot (Batch). As the Replication service, the processing engine is always acting on behalf of a data product or a user.

Data Copy Tool: Data management service to request the export, import, delete or clone the data of a given tenant. It is enabling some level of dynamism in way our products are managing their data allowing them to backup or restore from a point in time backup. Backups are registered as snapshots in the data catalog.

Permission Service: Resolve access control levels against hierarchies of entities described in the data catalog.

Full Text Search: Enable Lucene like search of data registered in the data catalog.

This is concluding the first article of this series. This article was a lot about the why and the what, but we are planning to continue to present our progress and to get into the details of the how in the upcoming articles.

Hopefully, it gave you some food for thought, especially if you or your company are embarking on the data mesh journey as well! Stay tuned!