An open API service providing repository metadata for many open source software ecosystems.

GitHub / khushal2405 / ETL-pipeline-using-Airflow-and-AWS-EMR

We Build an ETL pipeline using Airflow that accomplishes the following: Downloads data from an AWS S3 bucket, Runs a Spark/Spark SQL job on the downloaded data producing a cleaned-up dataset of delivery deadline missing orders and then Upload the cleaned-up dataset back to the same S3 bucket in a folder primed for higher level analytics

JSON API: http://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/khushal2405%2FETL-pipeline-using-Airflow-and-AWS-EMR

Stars: 1
Forks: 0
Open issues: 0

License: None
Language: Python
Size: 15.4 MB
Dependencies parsed at: Pending

Created at: over 2 years ago
Updated at: about 2 years ago
Pushed at: about 2 years ago
Last synced at: about 2 years ago

Topics: airflow, apache-spark, aws, aws-emr, data-engineering, etl, pyspark, python, s3, s3-bucket, scala, spark

    Loading...