How to Build a Modern Data Infrastructure Using a Lakehouse
Data continues to be one of the most critical assets and transformation drivers in small and large businesses. However, organizations must contend with ever-growing volumes of data that make it difficult for them to operate and manage their existing data lake and data warehouse architectures.
IT and businesses want a better approach to their data platform strategy so that architects and engineers can spend less time worrying about installation — integrating different components, letting them talk to each other — and more time designing data solutions.
If your organization is serious about modernizing your data management, you need an architecture that can help you achieve that goal. We’ve all heard about lake houses lately, but how did we get here?
Data Warehouse vs Data Lake
Purushottam Darshankar
Purushottam Darshankar is Chief Data Architect at Persistent Systems based in London, UK. He has over 25 years of IT experience and is responsible for the design, architecture and deployment of big data solutions in the banking, financial services, healthcare, retail and utility industry domains. Prior to his current role at Persistent Systems, Purushottam worked for leading multinational companies such as Wipro, Siemens and Reliance Communication. He earned his master’s degree in electrical engineering from the government. College of Engineering, Pune, and in Management Studies from IIM Kolkata.
Before recent advances in data management technologies, data warehouses were the architectural standard for storing and processing business data. While data warehouses for structured data were extremely reliable, the platform architecture faltered with the introduction of unstructured data.
Also, with the data warehouse model, there was a lot of data preparation and data movement in the form of ETL (extract, transfer and load) before you could do the data analysis and therefore the time to insight was often very slow.
As a solution, companies began building data lakes, inexpensive repositories that stored any type of raw data, structured or unstructured, in an open file format so that it could later be transformed for various business use cases.
Unlike data warehouses, which enforce a table schema, data lakes do not enforce a schema, which allows the storage of unknown data sources, including videos, text files, images, audio, etc. While the data lake was a promising strategy, “data swamp” – disorganized data – quickly caught on due to inadequate data governance, resulting in stale, underutilized data in a data lake.
So how can companies get the best of both worlds, including the benefits of data warehouses and data lakes?
A combined approach – Data Lakehouse
Because both data warehouses and data lakes have advantages, companies developed a new data management architecture pattern – a lakehouse, which combines the cost-effectiveness, scalability and flexibility of a data lake with the data management and data structure of a data warehouse.
The separation of memory and processing power enables increased availability and scalability at lower costs. The calculation uses separate clusters and can scale independently depending on the type of workload.
Lakehouses supports Atomicity, Consistency, Isolation and Durability (ACID) transactions, which ensure data reliability and data integrity in the event of a failure or when different components are running simultaneously.
They also enforce schema standards and apply governance to ensure data is properly organized, governed, and consistent in a data lakehouse.
And lakehouses give BI (business intelligence) tools direct access to data, improving the freshness of the data used for real-time reporting.
A robust, modern enterprise data architecture integrates a data lake, a lakehouse data warehouse, and other purpose-built capabilities for unified data management and seamless data movement.
While core data processing systems have remained more or less the same over the past few years, the supporting tools and platforms have rapidly proliferated. The right combination of individual tools and technologies should give you the ability to build the right modern data platform for your business.
Modern cloud reference architecture
A modern enterprise data architecture, enabled by a lakehouse, provides accessibility, speed, flexibility, and reliability, allowing organizations to optimize any data source and use it to make better business decisions. Now that you have a data lakehouse, you still need a variety of supporting services.
Think of an actual lake house (e.g. an Airbnb vacation home). Lakehouses offer visitors great views but still need supporting services such as garbage collection, cleaning and maintenance, landscaping, security services, etc. For data lakehouses, the current platform ecosystem has a number of characteristics and the main ones are listed below.
Automated workflow and orchestration
With automated workflow and orchestration in the cloud, organizations can ensure data flows smoothly and unimpeded to all parts of the organization while maintaining data quality and governance.
Vendors like Airflow/Astronomer, Prefect, and Dagster provide tools to orchestrate analytical and operational workflows.
Data pipelines and ELT processing
This is the core of the cloud data platform, which guarantees that data arrives at its destination accurately, on time and in the right format.
It has evolved from traditional ETL (extract, transform and load) providers to a new class of cloud-native players (e.g. Fivetran, dbt and Matillion) capable of more complex dependencies across different data environments handle.
AI/ML (Artificial Intelligence and Machine Learning)
This includes advanced analytics that apply ML and algorithmic modeling to optimize business decisions. This space is thriving, as evidenced by the increasing number of vendors supporting it.
Libraries like PyTorch, TensorFlow, and Rasa provide AI/ML algorithms for data scientists to train data models, and notebooks like Jupyter and Zeppelin help them customize AI/ML algorithms.
Data Management and Security
As the data stack becomes more complex, data governance and security have become increasingly important to secure data and maintain compliance throughout the data lifecycle. In the face of strict data security policies, access to data requires a controlled data access and authorization mechanism.
Laws and regulations such as GDPR (General Data Protection Regulation) and HIPPA (Health Insurance Portability and Accountability) regulations have been enforced by governments to protect PII (Personally Identifiable Information).
Vendors like OneTrust, Collibra, Privacera and Immuta help companies meet some of their security requirements that are under government oversight.
observability of the data
Data observability is a new addition to the list of data platform capabilities and refers to the provision of monitoring and diagnostic capabilities for the entire flow of data.
The tools (Monte Carlo, Great Expectations, and Bigeye) provide automated monitoring, alerting, and triaging to identify and assess data quality and discoverability issues.
In summary, there is growing interest and clarity about a modern data architecture built on the foundations of Lakehouse and supported by a variety of vendors (including Amazon Web Services, Google, Azure, Starburst, Databricks, etc.) and data Warehouse actors supported .
Cloud data warehouses like Snowflake have grown rapidly, primarily focusing on SQL users and BI use cases. However, organizations have seen accelerated adoption of other technologies such as Databricks’ Data Lakehouse.
With an offer for modernizing the data landscape, Persistent Systems supports companies in modernizing their data landscape in the cloud.
Of course, choosing the Lakehouse approach is only a fraction of the work. A good data model with optimized data processing flow is a must for data platform performance and cost optimization.