How to Design and Implement a Data Lake?

Article Index

Companies  are continuously envisioning new and innovative ways to use data for operational reporting and advanced analytics. The Data Lake, a next-generation data storage and management solution, was developed to meet the ever-evolving needs of increasingly savvy users.

The white paper published by Knowledgent (a data and analytics firm ) explores existing challenges with the enterprise data warehouse and other existing data management and analytic solutions. It describes the necessary features of the Data Lake architecture and the capabilities required to leverage a Data and Analytics as a Service (DAaaS) model. It also covers the characteristics of a successful Data Lake implementation and critical considerations for designing a Data Lake.

Current Enterprise Data Warehouse Challenges

Business users are continuously envisioning new and innovative ways to use data for operational reporting and advanced analytics. With the evolution of users’ needs coupled with advances in data storage technologies, the inadequacies of current enterprise data warehousing solutions have become more apparent. The following challenges with today’s data warehouses can impede usage and prevent users from maximizing their analytic capabilities:

  • Timeliness. Introducing new content to the enterprise data warehouse can be a time-consuming and cumbersome process. When users need immediate access to data, even short processing delays can be frustrating and cause users to bypass the proper processes in favor of getting the data quickly themselves. Users also may waste valuable time and resources to pull the data from operational systems, store and manage it themselves, and then analyze it.
  • Flexibility. Users not only lack on-demand access to any data they may need at any time, but also the ability to use the tools of their choice to analyze the data and derive critical insights. Additionally, current data warehousing solutions often store one type of data, while today’s users need to be able to analyze and aggregate data across many different formats.
  • Quality. Users may view the current data warehouse with suspicion. If where the data originated and how it has been acted on are unclear, users may not trust the data. Also, if users worry that the data in the data warehouse is missing or inaccurate, they may circumvent the warehouse in favor of getting the data themselves directly from other internal or external sources, potentially leading to multiple, conflicting instances of the same data.
  • Findability. With many current data warehousing solutions, users do not have a function to rapidly and easily search for and find the data they need when they need it. Inability to find data also limits the users’ ability to leverage and build on existing data analyses.

Advanced analytics users require a data storage solution based on an IT “push” model (not driven by specific analytics projects). Unlike existing solutions, which are specific to one or a small family of use cases, what is needed is a storage solution that enables multiple, varied use cases across the enterprise.

 

This new solution needs to support multiple reporting tools in a self-serve capacity, to allow rapid ingestion of new datasets without extensive modeling, and to scale large datasets while delivering performance. It should support advanced analytics, like machine learning and text analytics, and allow users to cleanse and process the data iteratively and to track lineage of data for compliance. Users should be able to easily search and explore structured, unstructured, internal, and external data from multiple sources in one secure place.

The solution that fits all of these criteria is the data lake.

The Data Lake Blueprint

 

Data Lake Architecture
Data Lake Architecture

The Data Lake is a data-centered architecture featuring a repository capable of storing vast quantities of data in various formats. Data from webserver logs, data bases, social media, and third-party data is ingested into the Data Lake. Curation takes place through capturing metadata and lineage and making it available in the data catalog (Datapedia). Security policies, including entitlements, also are applied.

Data can flow into the Data Lake by either batch processing or real-time processing of streaming data. Additionally, data itself is no longer restrained by initial schema decisions, and can be exploited more freely by the enterprise. Rising above this repository is a set of capabilities that allow IT to provide Data and Analytics as a Service (DAaaS), in a supply-demand model. IT takes the role of the data provider (supplier), while business users (data scientists, business analysts) are consumers.

The DAaaS model enables users to self-serve their data and analytic needs. Users browse the lake’s data catalog (a Datapedia) to find and select the available data and fill a metaphorical “shopping cart” (effectively an analytics sandbox) with data to work with. Once access is provisioned, users can use the analytics tools of their choice to develop models and gain insights. Subsequently, users can publish analytical models or push refined or transformed data back into the Data Lake to share with the larger community.

Although provisioning an analytic sandbox is a primary use, the Data Lake also has other applications. For example, the Data Lake can also be used to ingest raw data, curate the data, and apply ETL. This data can then be loaded to an Enterprise Data Warehouse. To take advantage of the flexibility provided by the Data Lake, organizations need to customize and configure the Data Lake to their specific requirements and domains.