The next generation of data architectures

Chief architect at IOTICS explores the questions and challenges the next generation of data architecture faces

Educational 22 Feb 2022 by Fabrizio Cannizzo

In this two parts post I discuss how data architectures can evolve to address new challenges that enterprises face going forward.

In part 1, I’ll expose what current challenges the state of the art data architectures fail to address satisfactorily; in part 2 I’ll address how data architectures can evolve to overcome their current limitations.

Going beyond current data architectures

During my work as Chief Architect at IOTICS, I get asked a lot about what good looks like for data architectures. The pros and cons of different approaches and what state of the art really looks like.

I wanted to tackle this in two parts. In this first part I’ll look at what the state of play is for many. Where the frustrations lie, and why current data architectures are failing to address some of the new and emergent challenges.

Solving shared problems requires sharing knowledge

Data architectures have been evolving for the last 40 years. From data silos, to data warehouses, via data lakes and now data mesh.  Their anatomy is fairly well understood; however as a primer this article by Harmeet Nadrey does a great job of providing a succinct description of the state of the art.

As ever, the world evolves fast and data architectures need to catch up.

Looking outward, in a global world and economy, we’re all interconnected and problems that not long ago were solvable in the context of a single enterprise now are “shared problems”: from reaching net-0 by 2050, to dealing with Covid, to re-program the supply chain because a cargo ship blocks the Suez canal (just to mention a few resounding examples).

To solve shared problems, parties need to collaborate and interoperate seamlessly across boundaries (organisational, institutional, geographical). This is evident in the private sectorand in the science community alike. And efforts like EU Strategy for data or Gaia-X are institutionalising this demand.

Can the current data architecture model cope with solving these “shared problems”?

Data architectures: the state of play

As said, software architectures evolve and play catch up: current thinking is limited to optimising the enterprise internal analytics use cases. The technical discourse is focused on describing a very efficient data gathering, transformation and consumption pipeline glued together by technology that enables cataloguing, security and governance.

No alt text provided for this image

Modern data architecture: Efficient data gathering and organisation for analytics

Even the data mesh community – at the forefront of pushing the existing data architecture envelope – limits their emphasis to optimising access to and management of the analytical plane (in Dehghani’s own words). In essence, they are streamlining the creation and management of a centralised architecture that optimises the 5 V’s of big data: Volume, Velocity, Variety, Veracity and Value.

The 5Vs of big data

These architectures are criticised for their complexity and expense to scale. Centralised architecture prioritises technical excellence that optimises Volume, Velocity and Variety but lacks focus on Veracity and Value because by the time data reaches its final destination it has lost context. The loss of context makes it difficult to understand the state of the system at the point where a data or event was generated (loss of veracity) and, as a consequence, reduces the ability to discover valuable insights (loss of value).

The data mesh architecture approach limits these problems by “shifting left” the onus on keeping the data “in context” to the production side: data becomes “product” and product thinking applies – i.e. make data “valuable” from the outset.

No alt text provided for this image

Data mesh: shifting left by creating data products

Going further, cracks appear

Expanding our focus beyond the analytics use cases we see that challenges emerge with this architecture in other sets of use cases.

Linking analytical data plane back with operational data plane: data is data

At the broadest scale, we see in the current generation of data architectures that data always flows in one direction: from operations to analytics: left to right, (or even top to bottom depending how you imagine your data silo).

Data silos have their place: they offer a simple model which is easy to understand, secure and efficient to operate within. Of course, the downside is that data/information silos, as traditionally understood, don’t scale very well for those organisations who want to collaborate by sharing data.

In fact, in many ways it feels like looking at the rendering of a fractal function: zoom out and you’ll observe the very same pattern that current data architectures are designed to produce: the enterprise silo.

In fairness, the data mesh architecture attempts to break the “enterprise silos” by promoting the idea that “data products” are the “architecture quantum”, but critics of this approach say that a “data product” is nothing more than a fancy way to describe a “data silo”. The saving grace is the centralisation of infrastructure management and governance. The desired outcome of centralising these functions is to reach a cohesive vision for the enterprise that surpasses the limitations of the narrow viewpoint imposed by “thinking in products”.

Data mesh is a very valid and scalable approach, since it decouples the technical infrastructure management from the data product management; the question, though, is still open: how to efficiently enable bidirectional data flow between analytical plane and operational plane? This question is pertinent because an efficient mechanism to feed back insights extracted from analysis directly affects business velocity.

Interestingly, the proponents of “data-centric architectures” address the concern of linking analytical and operational planes rather elegantly. A data-centric approach assumes that there’s a core common model built on semantically described data and that applications (operational or analytical) tap in and share.

Such a model is grown in an agile manner and driven on one end, by the need of applications to interoperate with each other and on the other end, by the evolving nature of the enterprise (applications come and go, data evolves).

Data-centric approach: a common data fabric that unifies access to data for operational and analytical use cases

Data-centric approach: a common data fabric that unifies access to data for operational and analytical use cases

It’s easy to imagine how the data layer can be thought of as a data fabric (in the original accepted meaning).

Reverse ETL tools

Reverse ETL tools are gaining popularity. These tools address the need to automate the copying of data from cloud data warehouses or datalakes to popular cloud based tools used to perform “operational analytics.”

Such tools work very well by delivering data for human consumption because they come equipped with predefined connectors for the most common platforms. But they fall short when it comes to automating delivery of events back to operations.

In practice these tools work very well for B2C use cases such as copying data to a Tableau server, or to a CRM system for consumption of the insights from users

And they fall short for B2B use cases that typically require machine-to-machine interactions; in these cases a new approach to sharing is required.

Interoperate across boundaries

Current data architectures look inwards. The “enterprise” is the boundary delimiting the data architecture.

For efficiency purposes, all cross-cutting concerns are centralised: management of the infrastructure, governance, access policy management etc. This is a common trait observable across all models (data mesh, data lake, data-centric). Whilst centralisation of some of these aspects make perfect economic sense, in other cases it is an impediment. Specifically when applications need to access data across boundaries.

Such boundaries may be internal to the enterprise or, in most cases, external. Let’s consider some examples:

1 – Mergers and acquisitions. Enterprises grow organically by merging with other enterprises or acquiring new ones. Transformation processes take time and cost money and during this phase access to data is limited.

No alt text provided for this image

Mergers and acquisitions

2 – Regulatory constraints. For regulated enterprises, a single centralised infrastructure is impossible (read: illegal). Data can cross boundaries only under certain conditions yet, sharing and exchanging is required for the enterprise to operate.

No alt text provided for this image

The regulated enterprise: boundaries exist with-in. Selective data sharing may occur with data crossing the boundary

3 – Inter-enterprise sharing and exchanging. Global challenges require multiple parties to exchange and share data. Consortia of enterprises form to cooperate, collaborate or compete on common goals. Similarly to case 2, hard boundaries need to be crossed.

No alt text provided for this image

Consortia: cooperate collaborate and compete on common goals: data is shared selectively within the consortium participants

4 – Interoperate without sharing data. Interoperability refers to systems or enterprises to be able to exchange or share information. Typically this happens by parties exchanging data. It is conceivable that in certain circumstances data sharing isn’t an option. The root causes can vary from performance (there is simply too much data to ship) to privacy (for example personal data) or because of legal constraints (GDPR, etc). A hybrid model where an App can be allowed to run in a controlled environment is conceivable and desirable. In such cases the App can be vetted or approved by having it inspected and allowed. How can one design a data layer to allow any App to find, access and understand the relevant data ?

No alt text provided for this image

Hybrid model where data can’t be shared but applications are allowed to run in a controlled context (escrow, DMZ, …) to access data

Conclusion

Solving shared problems requires collaboration and competition underpinned by sharing data and, more in general, interoperating across boundaries, irrespective whether these boundaries are internal or external to the organisation.

A new way of thinking is in order: where “data” is at the centre; where secure and selective sharing is not an afterthought, but rather an integral part of the architecture.

Here at IOTICS we’re addressing these issues head on and our product, IOTICSpace is the embodiment of the next generation of data architectures available now. In my next piece I’ll discuss the characteristics of IOTICSpace and its foundations.

line2

Join Our Community

We enable the world’s data to interact safely and securely with other data, of all types, in all places, dynamically.