AWS Glue

Posted on April 23, 2019 by By admin, in AWS | 0

Introduction to AWS Glue

AWS Glue is a cloud-optimized ETL service. It is a cloud service that prepares data for analysis through the automated extract, transform and load (ETL) processes. It helps to organize, locate, move and perform transformations on data sets so that they can be fetched and put to use.

Glue is different from other ETL products in certain ways.

It is serverless.
Provides crawlers for automatic schema generation for all kinds of data sets.
Generate scripts automatically to extract, transform and load the data.

The service can automatically find an enterprise’s structured or unstructured data when it is stored within data lakes in Amazon Simple Storage Service (S3), data warehouses in Amazon Redshift and other databases that are part of the Amazon Relational Database Service. Glue also supports MySQL, Oracle, Microsoft SQL Server and PostgreSQL databases that run on Amazon Elastic Compute Cloud (EC2) instances in an Amazon Virtual Private Cloud.

The service then profiles data in its centralized metadata catalog, which is a metadata repository for all data assets that contain details such as table definition, location, and other attributes.

To pull metadata into the Data Catalog, the service uses Glue crawlers, which scan data stores and extract schema and other attributes for later querying and analysis.

ETL Engine

After data is cataloged, it is accessible and ready for ETL jobs. AWS Glue includes an ETL script recommendation system to create Python and Spark (PySpark) code, as well as an ETL library to execute jobs.ETL code can be written via the Glue custom library, or write PySpark code via the AWS Glue Console script editor.

PySpark code or libraries can also be imported.

Schedule, orchestrate ETL jobs

AWS Glue provides scheduled, on-demand and job completion triggers. A scheduled trigger executes jobs at specified intervals, while an on-demand trigger executes when prompted by the user. With a job completion trigger, single or multiple jobs can execute when a job finishes. These jobs can trigger at the same time or sequentially, and they can also trigger from an outside service, such as AWS Lambda.

AWS Glue Platform and Components

AWS Glue uses Apache Spark as an underlying engine to process data records and scale to provide high throughput, all of which is transparent to AWS Glue users.

AWS Glue has three main components:

Data Catalog— A data catalog is used for storing, accessing and managing metadata information such as databases, tables, schemas, and partitions. Crawlers infer the schema/objects within data sources while setting up a connection with them and create the tables with metadata in AWS Glue Data Catalog.

ETL Engine—Once the metadata is available in the catalog, data analysts can create an ETL job by selecting the source and target data stores from the AWS Glue Data Catalog. The next step is to define an ETL job for AWS Glue to generate the required PySpark code. The code can be customized based on transformation requirements.

Scheduler—Once the ETL job is created, it can be scheduled to run on-demand, at a specific time or upon completion of another job. AWS Glue provides a flexible schedule that can even retry the failed jobs.

In case if you have any queries please get us at support@helicaltech.com
Connecting to Database from AWS Glue
Fetching AWS Glue Connection Details
Email Notification for AWS Glue Job Execution
Passing and Accessing Parameters in AWS Glue Job
How to use External Python Libraries in AWS Glue Job
CDC Capture Changes Made at Data Source

Thank You
Rajitha
Helical IT Solutions Pvt Ltd

Best Open Source Business Intelligence Software Helical Insight Here

A Business Intelligence Framework

Best Open Source Business Intelligence Software Helical Insight is Here

A Business Intelligence Framework

asw AWS glue ETL talend

0 0 votes

Article Rating

0 Comments

Inline Feedbacks

View all comments

You might also like..

Helical Insight 5.2.1

Helical IT Solutions Launches Helical Insight 5.2.1: Elevating Open Source BI to New Heights

By admin

02 Sept 2024 – Helical IT Solutions is thrilled to announce the release of Helical Insight version 5.2.1, the latest upgrade to its Open Source Business Intelligence (BI) platform. This new version delivers a powerful, cost-effective BI solution that is...

Business Intelligence

Installation of Firebird db

By admin

Steps to install firebird db 1. Go to google and type firebird in search box and then click on first link. License aggrement 2. Click on downloads and then install Firebird latest version(5.0.0). 3. It will navigate to the below...

Software Testing

Defect Life Cycle

By admin

This blog explains about the complete life cycle of a bug and different status of bug from the stage it was identified,fixed,retest and close. What is Defect life cycle? Defect life cycle is the life cycle of a defect or...

About Helical IT Solutions Pvt Ltd

Location

Contact Us

Search what you are looking for..

AWS Glue

Posted on April 23, 2019 by By admin, in AWS | 0

Introduction to AWS Glue

AWS Glue Platform and Components

AWS Glue has three main components:

A Business Intelligence Framework

A Business Intelligence Framework

You might also like..

Helical Insight 5.2.1

Helical IT Solutions Launches Helical Insight 5.2.1: Elevating Open Source BI to New Heights

By admin

Business Intelligence

Installation of Firebird db

By admin

Software Testing

Defect Life Cycle

By admin

Contact Form