HOW TO ENRICH AND FLATTEN THE DATA IN AWS GLUE

Posted on by By admin, in AWS | 0

INTRODUCTION

In this article we are going to see what is AWS glue, AWS Glue resources, and how to submit a job using pyspark.

REQUIREMENTS

To use the AWS glue we should have the following setups in our system:

1. AWS account should be created with data present in S3.
2. Apache Spark (can be downloaded at this site)
https://phoenixnap.com/kb/install-spark-on-windows-10

WHAT IS AWS Glue ?

HOW TO ENRICH AND FLATTEN THE DATA IN AWS GLUE

AWS Glue is a fully managed ETL service which makes it simpler and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams.

ETL means Extract Transform Load, which means that the AWS extracts the data from the JDBC, S3, Redshift or any other databases and transforms it to required format and then loads it to the data warehouse.

AWS Glue ETL scripts can be coded in Python or Scala. Python scripts use a language that is an extension of the Pyspark Python dialect for extract, transform, and load (ETL) jobs.

HOW TO ENRICH AND FLATTEN THE DATA IN AWS GLUE

AWS GLUE RESOURCES

Data Catalog: It stores the metadata of the actual data. If we have data in the database, we can access it through JDBC connections.The information in the Data Catalog is used to create and monitor your ETL jobs.

HOW TO ENRICH AND FLATTEN THE DATA IN AWS GLUE

Database: It is the logical separation of the tables we will be getting.We have to create the databases in AWSGlue. Upload the files in csv format into the database.

HOW TO ENRICH AND FLATTEN THE DATA IN AWS GLUE

Connections: The connections are the Data Catalog objects that stores connection information for a particular data store. We can add the connections here .

HOW TO ENRICH AND FLATTEN THE DATA IN AWS GLUE

Crawlers: The crawler understands the connections and gets data from the data catalog and updates one or more tables in your Data Catalog. ETL jobs that are defined in AWS Glue use these Data Catalog tables as sources and targets.

HOW TO ENRICH AND FLATTEN THE DATA IN AWS GLUE

Job: Job is the business logic which is required to perform ETL work. There are three types of jobsSpark, Streaming ETL, and Python shell. A Spark job is run in an Apache Spark environment managed by AWS Glue.

It processes data in batches. A streaming ETL job is similar to a Spark job, except that it performs ETL on data streams. We can submit the job using spark-submit or in console.

HOW TO ENRICH AND FLATTEN THE DATA IN AWS GLUE

SPARK-SUBMIT USING PYSPARK

Once we have the AWS glue ready with all the requirements mentioned above we can start to submit the job.

Steps to submit a job using spark:

Step1: Create a python file based on what operation you should perform for the file present in database.

For example, Lets create a file called example.py which has code to read what is present in the file.

The below code loads the data of the csv file into the dataframe. Thus, we can perform the operations as required.

HOW TO ENRICH AND FLATTEN THE DATA IN AWS GLUE

Step2: Now open the command prompt and give the command

spark-submit “location of the python file”

HOW TO ENRICH AND FLATTEN THE DATA IN AWS GLUE

The output yield us the schema of the csv file as shown below,

HOW TO ENRICH AND FLATTEN THE DATA IN AWS GLUE

Step 3: Now we are able to access the file in the database, Thus we can perform any operations like filtering data, reading, writing and upload it using spark.

Conclusion:

The AWS Glue has certain resources which combine together to provide us the upload files in it and perform changes as per requirement. The data present in the AWS Glue can be accessed by using pyspark by submitting the job (creating ETL jobs) as explained.

Thank You
Vani Bolle
Helical IT Solutions

logo

Best Open Source Business Intelligence Software Helical Insight is Here

logo

A Business Intelligence Framework

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments