How to implement Data Lake on GCP

Posted on by By admin, in Data Lake Analytics | 0

Introduction

In this blog we are going to cover how to implement Data Lake on GCP.

Google Cloud Platform (GCP) enables us to build modern datalake platform that can store petabytes of structured and unstructured data , data ingestion and data processing pipelines to ingest, clean, transform the data as well as provides tools to query/analyze the data. Depending on the nature of the data, size, analysis requirements, we can use the various services provided by GCP to build our data analytics architecture. Below is a sample architecture that gives an idea of the various layers / components that a typical data lake architecture would comprise of.

Architecture data lake on gcp

This diagram shows a layered architecture which comprises of 6 Layers –

How to implement Data Lake on GCP

Ingest – This layer is responsible for ingesting data into the various data storage targets which form the data lake, typically called the raw zone. The storages used could be object store which stores raw files or databases. Ingestion layer has the capability to connect to diverse data sources and can ingest batch or streaming data into the lake. Ingestion Pipelines can be triggered based on a pre-defined schedule, in response to an event, or can be explicitly called via REST APIs.

Store – This layer is responsible for storing the raw data(raw zone) as well as processed data (curated zone). It should be scalable as well as cost effective to allow us to store vast quantities of data. It also has security, high availability, data archival and backup capabilities.

Process and Enrich – This layer validates, transforms, and moves your datasets into your Curated zone in your data lake. These layer has the data processing pipelines and the means to orchestrate the data flow. It has mechanisms for auditing , reconciliation of data at various stages in the pipeline. At this stage Machine Learning Models can also be invoked to enrich your datasets and generate further business insights.

Serve – This is the layer which provides tools that support analysis methods, including SQL, batch analytics, BI dashboards, reporting, and ML.

Security, Monitoring – This layer ensures security to the various components of all layers in the data analytics architecture. It has means for authentication, authorization, encryption, monitoring, logging, alerts.

Discover and Govern – This layer registers all your datasources , organizes them and allows auto-discovery of new datasets, cataloging and metadata updation. It stores data classification, data sensitivity, data lineage information. It can help us maintain a business glossary with the specific business terminology required for users to understand the semantics of what datasets mean and how they are meant to be used across the organization.

Google Could Services Used

Below are some of the services used by the above architecture in its various layers as can be seen in the diagram.

Cloud Storage – Cloud Storage is well suited to serve as the central storage repository for the data lake. With Cloud Storage, we can eventually grow your data lake to exabytes in size. Cloud Storage supports high-volume ingestion of new data and high-volume consumption of stored data in combination with other services such as Pub/Sub. It is scalable, durable , cost efficient and secure.

Cloud Pub/Sub – Cloud pub/Sub can be used to ingest events for streaming into BigQuery, data lakes or operational databases.

Cloud Dataflow – Fully managed data processing service which can be used for streaming as well as batch ingestion into the data lake. Can also be used to develop scalable data processing pipelines. It has Flexible scheduling and pricing for batch processing which helps in saving costs.

Transfer Appliance/ Transfer Service /gsutil – For moving large amounts of data to cloud or one-time transfers, you can consider these options. Transfer service helps transfer data quickly and securely between object and file storage across Google Cloud, Amazon, Azure or private data centers. Transfer Appliance is a high-capacity storage device that enables you to transfer and securely ship your data to a Google upload facility, where we upload your data to Cloud Storage. “gsutil” is an open source command-line tool that is available for Windows, Linux, and Mac. It supports multi-threaded transfers, processed transfers, parallel composite uploads, retries, and resumability.

Cloud Dataprep – Dataprep is a serverless, scalable, intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.

Cloud Data Proc – Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks. We can choose our favorite big data processing tools and setup clusters on cloud to run our data processing workloads.

Cloud Big Table – Big Table is a fully managed, scalable NoSQL database service for large analytical and operational workloads. For time-series analytics, you can store ingested data in Cloud Bigtable to facilitate rapid analysis. This is basically the “Hot Path” of the Lambda architecture pattern for low latency analytic requirements.

Big Query – Serverless, highly scalable, and cost-effective multicloud data warehouse service to store our processed data for analytics.

Cloud ML Engine – Google Cloud includes pretrained models for speech, vision, video intelligence, and natural language processing. For these cases, we pass the appropriate inputs, such as audio, images, or video, to the respective Google Cloud service. You then extract valuable metadata and store that metadata in a service such as BigQuery for further querying and analysis. We can also build your own custom ML models with various ML services.

Cloud Data Catalog – Data Catalog is a fully managed, scalable metadata management service within Dataplex. Data Catalog provides a centralized place to gain a a unified view and allows searching for the right data. Allows enriching data with technical and business metadata, ownership, quality attributes. Also has automatic cataloging feature.

Cloud Logging/ Cloud Monitoring – Cloud Logging is a fully managed, real-time log management with storage, search, analysis and alerting. Cloud monitoring helps you gain visibility into the performance, availability, and health of your applications and infrastructure.

Cloud IAM – Cloud IAM provides the platform to setup centralized fine-grained access control and visibility for all of our cloud resources.

If you are looking for Data Lake implementation on top of GCP (or other vendors) please reach out to us. We can show you some of our past work on Data Lake implementation, explain the pros and cons of various architecture (from technology as well as costing point of view) as well as help you with the implementation as well.

Reach out to us on nikhilesh@helicaltech.com for more information.

5 1 vote
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments