Data Engineer ETL Project Using Spark with AWS Glue

hello everyone, this is AWS ETL project using Spark and AWS glue. in this project I have basically performed 3 steps Extract, Transform and load that’s why we called it ETL project. i performed these steps with AWS various services like AWS S3, AWS Glue, AWS IAM. with the help of below image you can understand the structure of project.

we are going to perform this project step by step so that everyone can understand.

step 1: Create IAM Role for whole project.

step 2: Create an S3 bucket and load data to the bucket from our local.

step 3: Create AWS Glue database and table.

step 4: Create ETL Job in AWS Glue

step 5: Write transform Code

Let’s take a closer look at each step to understand them more thoroughly.

Step 1: Create IAM Role

What is IAM Role – An IAM (Identity and Access Management) role in AWS defines permissions for resources and services, allowing them to access specific resources without needing permanent credentials. It enhances security and simplifies access management.

  • Go to AWS IAM → Roles → Create Role
  • Use cases for other AWS services : Select Glue
  • Add permissions → Search and Select ‘AdministratorAccess’
  • Role name : ‘AWSGlueServiceRole-me’

Step 2:  We create a bucket in AWS S3

AWS S3 – Amazon S3 is an object storage service offered by AWS. It’s renowned for its high availability, durability, and scalability. In Amazon S3, you can store objects such as images, code files, video files, and even host websites. All data in Amazon S3 is organized into containers called “buckets.”

Step 2.1: Go to AWS S3 -> Buckets -> Create bucket

  • Bucket name: ‘new-york-city-airbnb-2023’
create bucket in aws s3

Step 2.2: We create database folder

  • Go to AWS S3 –> Buckets –> ‘new-york-city-airbnb-2023’ –> Create folder
  • Folder name : ‘NYC-airbnb-2023’
new york data folder in aws s3

Step 2.3: We create 2 folder for raw data and transformed data.

  • Go to AWS s3 –> Buckets –> ‘new-york-city-airbnb-2023’ –> ‘NYC-airbnb-2023’ –> create 2 folders 
  • Folder name: ‘row-data’
  • Folder name: ‘transformed-data’
row transformed folders in aws s3

Step 2.4: Now we will upload our local data to S3 ‘row-data’ folder.

  • Go to AWS S3 –> Buckets –> ‘new-york-city-airbnb-2023’ –> ‘NYC-airbnb-2023’ –> ‘row-data’ –> click on Upload –> Add Files –> ‘NYC-Airbnb-2023.csv’
csv file into AWS S3 bucket

here we have completed our 2nd step we successful upload our local data to S3 bucket. next we create AWS database and table 

Step 3: Set-up AWS Glue Database and table 

AWS Glue – AWS Glue is a fully managed ETL (Extract, Transform, Load) service that makes it easier to prepare and move data for analysis. It automates data discovery, transformation, and loading tasks, helping you to create data pipelines and work with data efficiently in the AWS cloud.

Step 3.1: Create Database in AWS Glue

  • Go to AWS Glue –> Data Catalog –> Database –> Add database
  • Database name: ‘etl-pipeline-project’
aws-glue-database

Step 3.2: Create Table in Database

  • Go to AWS Glue –> Data Catalog –> Tables –> click on ‘Add tables using crawler’ 

after clicking you have to fill some detail to fetch schema from AWS S3. those detail we have given below.

  • Crawler name: ‘elt-airbnb-data-pipeline’
  • Data source S3 path: Choose the ‘raw_data’ bucket
  • IAM role –> Choose IAM role –> ’AWSGlueServiceRole-me’
  • Target database –> Choose the ‘etl-pipeline-project’
  • Schedule : On demand
table-in-aws-glue-database

Step 3.3: Run the Crawler

after created successful crawler click on Run Crawler to fetch schema of file. our crawler is successfully created. we can check the table and schema of our table.

AWS glue crawler
schema-of-csv-file

Step 4: we create ETL Job in AWS Glue

  • Go to AWS Glue  –>  Spark script editor –>  click on Create 
  • Job name: ‘myGlueETLjob’
  • IAM Role –> Choose the role that we created –> ‘AWSGlueServiceRole-me’
  • Type: Spark
aws-glue-etl-job

Step 5:  let’s Write Some PySpark Code for Transform the Data.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import * from pyspark.sql.types import *
 
S3_INPUT_DATA = 's3://new-york-city-airbnb-2023/NYC-airbnb-2023/row-data/'
S3_OUTPUT_DATA = 's3://new-york-city-airbnb-2023/NYC-airbnb-2023/transformed-data/'

# Creating the SparkSession
spark = SparkSession.builder.master("local[1]")\
.appName('techycrispy.com')\
.getOrCreate()
                    
# Spark DataFrame (Raw) - Transformation
df = spark.read.format("csv")\
.option("header",True)\
.option("inferschema",True)\
.load(S3_INPUT_DATA)
          
# Convert the column's datatype
df = df.withColumn("id", df['id'].cast(IntegerType()))\
       .withColumn("host_id",df['host_id'].cast(IntegerType()))\
       .withColumn("latitude", df['latitude'].cast(FloatType()))\
       .withColumn("longitude", df['longitude'].cast(FloatType()))\
       .withColumn("minimum_nights", df['minimum_nights'].cast(IntegerType()))\
       .withColumn("number_of_reviews",df['number_of_reviews'].cast(IntegerType()))\
       .withColumn("last_review", df['last_review'].cast(DateType()))\
       .withColumn("reviews_per_month",df['reviews_per_month'].cast(IntegerType()))\
.withColumn("calculated_host_listings_count",df['calculated_host_listings_count'].cast(IntegerType()))\
.withColumn("availability_365", df['availability_365'].cast(IntegerType()))\
.withColumn("number_of_reviews_ltm",df['number_of_reviews_ltm'].cast(IntegerType())\
.withColumn("license", df['license'].cast(IntegerType()))\
.withColumn("price", df['price'].cast(IntegerType()))             
                    
#Remove columns from the dataset that contain a high number of null (missing) values.              
null_counts = df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
col_to_drop = [k for k, v in null_counts.items() if v >400 ] 
droping = df.drop(col_to_drop)

# count rows in dataframe 
droping.count()

# counting room types available
df1 = df.groupBy("room_type").agg(count("room_type").alias("counted"))
count_room_type = df1.sort(df1["counted"].desc()).limit(4)

#total earning by room types
df2 = df.groupBy("room_type").agg(sum("price").alias("total"))
total_earn_room_type = df2.sort(df2.total.desc()).limit(4)

# Save the DataFrames under different folders within the S3_OUTPUT_DATA bucket
df.write.mode('overwrite').csv(S3_OUTPUT_DATA + "/airbnb-dataframe")
count_room_type.write.mode('overwrite').csv(S3_OUTPUT_DATA + "/count-room-type")
total_earn_room_type.write.mode('overwrite').csv(S3_OUTPUT_DATA + "/total-earn-room-type")
droping.write.mode('overwrite').csv(S3_OUTPUT_DATA + "/droping")

we write transform code to the AWS S3 in ‘transformed-data‘  Folder. Let’s check our ‘transformed-data’. 

transformed-file in s3 bucket

we can see that transformed in AWS S3 bucket.

1 thought on “Data Engineer ETL Project Using Spark with AWS Glue”

  1. Dօes your bⅼog hɑve a contaⅽt page? I’m having problems locating it but,
    I’d like to shoоt you an e-mail. I’ve got some suggestiоns for
    yoսr blⲟg you might be interesteⅾ in hearing.
    Either way, great websitе and I look forward to seeing it expand oveг
    time.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top