Will the data lake prevail?

By Micha van der Ende, Partner, Data

Nowadays, reflecting the quantity of data they house, organisations have moved on to lakes from warehouses. Powerful entities, data lakes are seamlessly integrated with the cloud and different services and have proved themselves to be an indispensable asset in storing and processing data. Plus, they are valuable tools for historicising data.

Structured and unstructured data

The term ‘data lake’ first came about in 2011 and their original purpose was to extract and store data for analysis in single Hadoop-based repositories. This opened the door to a wider range of data types, bringing the once-static term ‘Big Data’ to life. Unlike traditional data warehouses, which were limited to structured data, the data lake could now handle semi-structured and unstructured data.

However, along with the opportunities, these technological advances brought new challenges. The adoption of a ‘schema-on-read’ method resulted in a lack of control over the stored data, which quickly resulted in ‘data swamps’. And the complexity of managing a Hadoop environment made the data lake less attractive.

The rise of the cloud made data lakes more attractive

The rise of the cloud brought a turning point. Many vendors packaged data lakes in their cloud offerings, such as AWS S3 buckets, Azure Data Lake Storage (ADLS) and Google Cloud Storage. This had the upside of reducing the complexity of Hadoop management while preserving the benefits of data lakes, particularly as these storage methods were pretty cost-effective. Open-source developments and the integration with the cloud ensured vendor lock-in could also be avoided to a certain extent.

Data lakehouses

Today, data lakes play a critical role, especially given the exponential growth of data volumes. The traditional ETL (extract, transform, load) process in data warehouse environments, which processed data overnight and moved it multiple times, is no longer feasible. In response, data lakehouses have emerged, which use data lakes to land data and present it virtually in a database, while maintaining volumes and velocity.

Technologies such as Delta Lake, Apache Iceberg, and HUDI enable data lakes to historicise data through ACID transactions, a functionality normally reserved for SQL databases. Even major players like Microsoft, Databricks and Snowflake have embraced the data lake as their primary storage method. It is also becoming easier to ensure ownership/stewardship, using tools and techniques such as data catalogues and data lineage.

The role of data lakes in modern data stacks is far from over. The ease of their deployment, management and the benefits you can get within today’s data architectures will ensure this technology continues to evolve and continues to be at the heart of the ongoing development of data management. So the answer is a resounding ‘yes’ – the data lake will prevail. 

Want to know more?

If you want information about Valcon’s data offerings, take a read here, or dive into Valcon’s World of Data. You are also more than welcome to reach out to Micha van der Ende at [email protected] for further information.

Insights