Comparison of Trend Data Management Strategies
This is part of Solutions Review’s Premium Content Series, a collection of reviews written by industry experts in maturing software categories. In this submission, ChaosSearch Founder and CTO Thomas Hazel pits data lake against data mesh to compare different strategies for managing trend data.
Precisely Senior Vice President of Product Management Emily Washington shares her predictions on four key data integrity trends for 2022.
As modern organizations struggle to manage ever-increasing amounts of enterprise data, many are reassessing their data management strategies to determine the optimal approach to delivering business insights and analytics at scale.
With this goal in mind, most organizations are looking for ways to analyze data without having to spend additional time and resources moving or transforming it. As a result, we have seen data lake and data mesh architectures gain popularity; these approaches promise to meet the accessibility, consistency, data quality, and governance requirements that organizations need to perform large-scale data analytics.
But the question remains: which of these two solutions is the best? Whether keeping data distributed in a data mesh or centralizing it in a data lake, every organization must consider a unique set of criteria to determine the best solution for their business.
Setting the scene: the data lake
The rise of Big Data and the challenges it highlighted for traditional enterprise solutions inspired James Dixon to coin the term “data lake” over a decade ago (2010). At their core, data lakes promise to break down data silos by serving as a single landing repository that centralizes, organizes, and protects large amounts of data from multiple sources. It follows a read-schema approach and can store structured, semi-structured, and unstructured data, typically on cloud storage platforms such as AWS S3.
These flexible storage solutions have become increasingly popular among modern businesses, but a common misconception is that they inherently include analytical features. In order to perform indexing, transformation, querying, and analysis, the data lake must be connected to a combination of other cloud-based services and software tools. In a typical data lake architecture, a self-service data analytics engine will be based on a cloud-based data repository. This is when an organization can realize the true benefits of a data lake and act on the full value of its data assets.
The rise of data mesh
Until recently, data warehouses and data lakes were the two main enterprise data management solutions. But a new approach has emerged in the last year – the concept of “data meshing”. In fact, it’s becoming one of the most talked about buzzwords every day. Thoughtworks defines a data mesh as “a shift in modern distributed architecture that applies platform thinking to create a self-service data infrastructure, treating data as the product.”
This type of architecture supports the idea of distributed data, where all data is accessible to those who have the right to access it. One of the main differentiators between a data lake and a data mesh is that in a data mesh, the data does not need to be consolidated into a single data lake and can reside in different databases. For this reason, a data mesh architecture connects various data sources, including data lakes, into a cohesive infrastructure.
Data Lake vs. Data Mesh; What is the difference?
Data lakes have come a long way since the failure of the first instances built on Hadoop. While many industry experts still remember the byproduct of “data swamps”, there have been tremendous innovations in this area since then. New data lakes remove the constraints inherent in traditional approaches to storage, infrastructure, and access for analytics. Today’s modern data lakes are cloud-native and can be enabled to index multiple types of data and make that data readily available and accessible to various business stakeholders.
Querying data in a data mesh will be limited by its slowest query. For organizations that store data in multiple silos but are looking for more efficient queries, it would make more sense to leverage a data lake platform for analytics within the existing data mesh architecture.
There are solutions that can remove some of these known challenges of data mesh architectures. For example, cloud data platforms can virtually publish logical data views for querying in the data lake without complex extract, transform, and load (ETL) pipelines. It is an approach to improve the democratization of data within an organization, without the need for data scientists or data engineers.
Since data lakes and data mesh architectures take different approaches (e.g. data integration), these two strategies can be viewed as complementary rather than having to choose one over the other. ‘other. However, while everyone loves the vision of ubiquitous data, the truth is that companies don’t realize the requirements to achieve it. Data meshing and data democratization are one and the same thing – you can’t have a decentralized data architecture when there are gatekeepers limiting who has access. Therefore, to achieve this goal of a distributed data mesh, companies must first enable the free flow of data across the entire organization, an intrinsic byproduct of data lakes.
There is no single solution to becoming a data-driven organization. For some, a data mesh would be useful if they store their data across multiple databases, while those looking for a solution that allows queries without data movement might benefit more from a data lake. Ultimately, the desired goal for most organizations using one of these data management solutions is to have a unified analytics platform that can deliver powerful insights without the need for complex behind-the-scenes support from intermediaries. As more and more organizations develop new approaches to democratize access to data, this space will be an important area to watch in the years to come.