Building a platform to process big data

By Frederick Venter, Senior Principal, Data

Data is the new oil. You’ve all heard it. So imagine a world where there is an abundance of oil – it’s springing leaks all over the place, rivulets of the black stuff are running through towns and villages. But people have no idea how to capture it, how to transport it, how to monetise it.

And where it comes to data, this is the place a lot of organisations have found themselves in – a world of lack of control and missed opportunity. Most organisations have tried to solve the problem of storing and processing data in one of two ways. The first was data warehousing, the meticulous curation of every piece of structured data, where strict rules would be imposed on its format – this was a well-trodden path that served many organisations well for decades. But it was by no means perfect – it was rigid and expensive and with the dawn of data science, data warehousing could not provide the scale and tools to do an effective job.

The other approach is the data lake, where you dump unfiltered and uncleansed data into a large pit of cheap, commodity storage and once you want to analyse it, you unleash an unholy amount of processing power. This solution has been a revelation – a less cumbersome structure paired with unlimited power. It has underpinned much of the growth of machine learning and AI over the last 15 or so years. But the danger with a lake is that it turns into a data swamp, an oozing mess of undocumented data. No one can find what they’re looking for and data security is shaky. 

The Databricks lakehouse has proved to be the answer to simplifying data storage and enabling swift analysis on data lakes. Combining the heft of a big data platform, with the stability and governance of a data warehouse, a databricks warehouse has multiple benefits:

  • Scalability: a lakehouse can handle any amount of data, from terabytes to petabytes, and can scale up or down automatically to meet demands as they change. You only pay for what you use and there’s the benefits of unlimited storage and computing resources of the cloud provider.
  • Performance: a lakehouse utilises the capabilities of a prominent open-source engine for big data processing, known for its speed and broad adoption. This engine is enhanced for cloud environments with features like a storage layer that ensures high-performance data reliability and quality.
  • Collaboration: a lakehouse enables you to work seamlessly with your team members across roles and disciplines. You can use notebooks to write code in Python, Scala, R, or SQL, and share them with others for interactive development and debugging. You can also integrate with popular tools like GitHub, Azure DevOps, Power BI, and MLflow to streamline your workflows and track your experiments.
  • Security: a lakehouse will ensure that your data is secure and compliant with the highest standards of encryption, access control, auditing, and governance. Many existing security tools and protocols from your cloud provider can be tightly integrated into your Databricks lakehouse.
  • Reach: Both Spark and Delta were developed by the teams at Databricks. Both technologies were subsequently open sourced and are fast becoming the de facto processing and storage engines for the cloud, it’s used by the likes of Microsoft, Amazon and Google as part of their own cloud data offerings. This further bolsters the status of these technologies in the market and makes cross-platform use easy.

The message is, be careful where and how you store your data. it doesn’t take too many wrong turns before your data lake becomes a swamp. If you’re looking for the scale of a lake but with the stability and governance structure a data warehouse has traditionally offered, then a Databricks lakehouse could be the solution.

Case study: Valcon and Databricks – Lakehouses in action

Valcon has a strong record in implementing lakehouses to ensure customer success. One example is the work Valcon has been doing in the last three years to help a global construction company based in The Netherlands implement a cloud-based Databricks Lakehouse.

The business challenge was to overcome the siloed data landscape with multiple data platforms, varying degrees of data maturity within different departments, limited access or overview of the data by the business and the underutilisation of data as an asset. Plus the organisation didn’t have a unified solution for data science and AI.

To address these challenges, Valcon worked with the company to adopt a lakehouse approach, where they consolidated existing data warehouses and lakes into one platform. They also helped establish and staff a DevOps teams representing all departments to work on the same platform with the same standards. This enabled the business to have an overview of available data and to use it in a self-service way.  

The company was able to finally create information products that provided value to the customers and stakeholders. The company also leveraged the tools in Databricks to enable a new data science team to deliver solutions using the newest developments in machine learning and AI.

As a result, the company has succeeded in getting consistent data quality, there has been no duplication of effort and in addition to data-driven decision-making, it benefits from improvements in efficiency and competitiveness. More people in the organisation now collaborate and share their insights gained from data.

To become truly data driven it is important to align your business goals with your data strategy. When it comes to the platform that is needed to support your data strategy, we believe that Lakehouse is the right technology to give you a head start.

Want to know more?

If you would like to speak to someone about how a Databricks lakehouse could benefit your organisation, please get in touch with Frederick Venter at [email protected].

If you want information about Valcon’s data offerings, take a read here, or dive into Valcon’s World of Data.

Insights