aws-emr | Topic | Ecosyste.ms: Repos

Topic: "aws-emr"

adornes/spark_python_ml_examples

Spark 2.0 Python Machine Learning examples

Language: Python - Size: 13.7 KB - Last synced at: almost 2 years ago - Pushed at: over 5 years ago - Stars: 93 - Forks: 42

adornes/spark_scala_ml_examples

Spark 2.0 Scala Machine Learning examples

Language: Scala - Size: 1.32 MB - Last synced at: almost 2 years ago - Pushed at: over 5 years ago - Stars: 77 - Forks: 52

jwplayer/sparksteps

:star: CLI tool to launch Spark jobs on AWS EMR

Language: Python - Size: 216 KB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 67 - Forks: 12

dacort/demo-code

Bits of code I use during live demos

Language: Jupyter Notebook - Size: 774 KB - Last synced at: 20 days ago - Pushed at: 4 months ago - Stars: 31 - Forks: 24

Wittline/pyspark-on-aws-emr

The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.

Language: Python - Size: 3.61 MB - Last synced at: 14 days ago - Pushed at: almost 3 years ago - Stars: 27 - Forks: 13

abdullahkhawer/aws-auto-terminate-idle-emr

An AWS based solution using AWS CloudWatch and AWS Lambda based on Python to automatically terminate AWS EMR clusters that have been idle for a specified period of time.

Language: Python - Size: 22.5 KB - Last synced at: 25 days ago - Pushed at: 11 months ago - Stars: 26 - Forks: 16

terraform-aws-modules/terraform-aws-emr

Terraform module to create AWS EMR resources 🇺🇦

Language: HCL - Size: 94.7 KB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 24 - Forks: 23

amzn/rheoceros

Cloud-based AI / ML workflow and data application development framework

Language: Python - Size: 2.49 MB - Last synced at: 21 days ago - Pushed at: 8 months ago - Stars: 17 - Forks: 9

ismaildawoodjee/aws-data-pipeline

A batch processing data pipeline, using AWS resources (S3, EMR, Redshift, EC2, IAM), provisioned via Terraform, and orchestrated from locally hosted Airflow containers. The end product is a Superset dashboard and a Postgres database, hosted on an EC2 instance at this address (powered down):

Language: Python - Size: 4.77 MB - Last synced at: about 1 year ago - Pushed at: almost 3 years ago - Stars: 17 - Forks: 6

memosstilvi/emr-cost-calculator

EMR Cost Calculator

Language: Python - Size: 13.7 KB - Last synced at: 3 months ago - Pushed at: about 7 years ago - Stars: 17 - Forks: 27

xonai-computing/xonai-dashboard

A Grafana-based application to assist Big Data infrastructure optimization initiatives where Spark applications are a dominant cost driver

Language: Python - Size: 6.28 MB - Last synced at: 17 days ago - Pushed at: 11 months ago - Stars: 14 - Forks: 1

AWS-Big-Data-Projects/Analysing-Census-Data-using-aws

Use aws-emr and aws-redshift to analyse dataset of adult census of USA

Size: 638 KB - Last synced at: 1 day ago - Pushed at: over 4 years ago - Stars: 13 - Forks: 0

AWS-Big-Data-Projects/AWS-EMR

Analyzing Big Data with Amazon EMR

Size: 9.77 KB - Last synced at: 1 day ago - Pushed at: over 4 years ago - Stars: 12 - Forks: 0

AWS-Big-Data-Projects/Run-a-Spark-job-within-Amazon-EMR

Run a Spark job within Amazon EMR

Language: Java - Size: 8.79 KB - Last synced at: 1 day ago - Pushed at: over 4 years ago - Stars: 12 - Forks: 1

mauropelucchi/aws-emr-docker-integration

AWS EMR Docker integration

Language: Dockerfile - Size: 13.7 KB - Last synced at: 11 days ago - Pushed at: over 4 years ago - Stars: 11 - Forks: 2

linghaol/CommunityDetection-Spark-AWS

A Spark application, written in Python, to figure out strongly connected components with Bi-directional Label Propagation algorithm. This project implemented an 1.3GB Twitter network dataset on AWS EMR cluster.

Language: Python - Size: 298 KB - Last synced at: about 1 year ago - Pushed at: almost 6 years ago - Stars: 9 - Forks: 4

daniel-cortez-stevenson/cookiecutter-pyspark-cloud

A cookiecutter template for working with PySpark on AWS EMR

Language: Python - Size: 305 KB - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 8 - Forks: 2

sjmiller8182/Warehousing-Stock-Tweet-Data

A large-scale data framework that will enable us to store and analyze financial market data and drive future predictions for investment.

Language: TSQL - Size: 8.43 MB - Last synced at: over 1 year ago - Pushed at: about 5 years ago - Stars: 7 - Forks: 3

wingkwong/aws-playground

My AWS Playground

Language: Python - Size: 348 KB - Last synced at: 19 days ago - Pushed at: 10 months ago - Stars: 6 - Forks: 0

Mathews-Tom/MSc-in-Machine-Learning-and-Artificial-Intelligence

Master of Science in Machine Learning & Artificial Intelligence - Indian Institute Technology Madras & Liverpool John Moores University

Language: Jupyter Notebook - Size: 2.12 GB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 6 - Forks: 7

adornes/spark_r_ml_examples

Spark 2.0 R/SparkR Machine Learning examples

Language: R - Size: 6.84 KB - Last synced at: almost 2 years ago - Pushed at: over 8 years ago - Stars: 6 - Forks: 6

Nerdward/batch_gh_archive

Data Engineering Project with Terraform, Spark, AWS, Docker, Airflow and other tools

Language: Python - Size: 250 KB - Last synced at: about 2 years ago - Pushed at: almost 3 years ago - Stars: 4 - Forks: 0

pratikbarjatya/spark-walmart-data-analysis-exercise

Data Analysis Exercise over Walmart Stock

Language: Jupyter Notebook - Size: 42 KB - Last synced at: about 1 year ago - Pushed at: almost 6 years ago - Stars: 4 - Forks: 13

HarshadRanganathan/aws-emr-launcher

Generic python library that enables to provision emr clusters with yaml config files (Configuration as Code)

Language: Python - Size: 128 KB - Last synced at: 7 days ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 0

abhibalani/emr_lambda

Lambda to start EMR and run a map reduce job

Language: Python - Size: 2.93 KB - Last synced at: about 2 years ago - Pushed at: over 5 years ago - Stars: 3 - Forks: 1

bajaj-varun/aws-test

Use-Case: Airline on-time performance

Language: Java - Size: 76.2 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 2 - Forks: 5

giulic3/data-engineering-nanodegree

Projects realized for the Data Engineering Nanodegree offered by Udacity https://www.udacity.com/course/data-engineer-nanodegree--nd027

Language: Jupyter Notebook - Size: 6.43 MB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 2 - Forks: 1

dhruv007patel/Impact-of-Covid-19-on-Aviation-Industry

This project analyzes the correlation between COVID-19 and the US aviation industry. By studying data on passenger/freight traffic and delays alongside COVID-19 trends, it provides insights into airline and passenger responses. The findings help airlines adapt to the pandemic's impact.

Language: Python - Size: 504 KB - Last synced at: almost 2 years ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 0

jonathanAmancioSales/BigData_AWS_EMR_MRJob_DIO

Projeto de processamento distribuído de dados utilizando Python, MRJob e AWS EMR

Language: Python - Size: 305 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 0

AuFeld/AWS_MWAA_With_Step_Functions

Build modern workflows with AWS MWAA, AWS Step Functions, AWS Glue, and AWS EMR

Language: Python - Size: 437 KB - Last synced at: about 1 year ago - Pushed at: almost 4 years ago - Stars: 2 - Forks: 1

ricardo-farias/CovidDataProduct

This repository will be used to understand data science and data engineering concepts

Language: Scala - Size: 641 KB - Last synced at: 11 months ago - Pushed at: over 4 years ago - Stars: 2 - Forks: 1

silviomori/covid19-datalake

Language: Python - Size: 13.7 KB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 2 - Forks: 0

dvu4/udacity-data-engineering

Data Engineering Projects including Data Modeling, Data Warehouse, Data Lake Development

Language: Jupyter Notebook - Size: 2.09 MB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 2 - Forks: 2

NitinSPatil15/Project-4-Data-Lake-with-AWS-EMR

An ETL pipeline that extracts data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables

Language: Python - Size: 601 KB - Last synced at: about 1 year ago - Pushed at: almost 5 years ago - Stars: 2 - Forks: 4

seahrh/bad-renter

Working examples of Spark ML Pipeline and SMOTE algorithm for synthetic data augmentation

Language: Scala - Size: 59.6 KB - Last synced at: 29 days ago - Pushed at: about 5 years ago - Stars: 2 - Forks: 0

mayankrastogi/faculty-page-rank

A Spark application to process the DBLP dataset to find out the Page Rank of faculty at the UIC CS department based on their co-authorships on publications.

Language: Scala - Size: 214 KB - Last synced at: over 1 year ago - Pushed at: about 6 years ago - Stars: 2 - Forks: 0

mayankrastogi/faculty-collaboration

A Hadoop Map-Reduce job to process the DBLP dataset to produce a graph depicting which professors at the CS department of UIC have co-authored publications.

Language: Scala - Size: 62.5 KB - Last synced at: over 1 year ago - Pushed at: about 6 years ago - Stars: 2 - Forks: 0

krishnan-mani/emr-access-bucket-cross-account

Illustrates access to S3 bucket owned by a different account from instances in an EMR cluster

Size: 4.88 KB - Last synced at: about 2 years ago - Pushed at: about 7 years ago - Stars: 2 - Forks: 1

trantuanngoc/us_immigration_data_engineering

US immigration data engineering : ETL pipeline, data modeling and warehousing of US immigration data

Language: HCL - Size: 3.58 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

samchenghowing/COMP4442

Analysis and monitoring system using AWS... Also the comp4442 project

Language: Python - Size: 38.5 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

markoshlima/crimes-map

This Big Data project consists of obtaining data on vehicle theft in the city of São Paulo and consolidating it in a counting and heat map, in order to show areas with a higher index of this type of crime. All applicable in AWS Resources.

Language: Scala - Size: 13.7 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 0

khushal2405/ETL-pipeline-using-Airflow-and-AWS-EMR

We Build an ETL pipeline using Airflow that accomplishes the following: Downloads data from an AWS S3 bucket, Runs a Spark/Spark SQL job on the downloaded data producing a cleaned-up dataset of delivery deadline missing orders and then Upload the cleaned-up dataset back to the same S3 bucket in a folder primed for higher level analytics

Language: Python - Size: 15.4 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 0

MDS-BD/aws-emr-local-dev-env-with-docker

Companion repository related to an AWS tech blog article.

Language: Dockerfile - Size: 9.77 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 1

johnnyiller/cluster_funk

An opinionated framework for running big data jobs

Language: Python - Size: 83 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

PannagaS/ETL-Logic-orchestration-using-Spark-and-AWS

An ETL logic is written in Spark for transforming the given data set present in S3, and query on the transformed data is run using AWS Redshift. The data sets are in json format. All the raw data in json format has to be first uploaded to an S3 source bucket. Using EMR, a Spark job is executed, which would fetch the source data from S3 source bucket, and then perform the necessary transformations on it as per the problem statement. Finally, store the transformed data were to partitioned and stored in parquet format in S3 destination bucket. Now, these files are accessed using AWS Redshift by running SQL queries on the transformed processed data.

Language: Python - Size: 1.6 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

GiladAbudi/CollocationExtraction

Map-Reduce program which produces a list of top-100 collocations from the Google 2-grams, Build with Java, AWS - Hadoop - Amazon Elastic Map Reduce

Language: Java - Size: 38.8 MB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 1 - Forks: 0

cevoaustralia/data-lake-demo

Data lake demo using change data capture (CDC) on AWS

Language: PLpgSQL - Size: 215 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 1 - Forks: 3

carlostomeh/Predict_Marketing_Campaign_Success

Goal: Develop Machine Learning aplication in a distributed environment using AWS services with Spark.

Language: Jupyter Notebook - Size: 469 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

kunjmehta/ny-taxi-prediction-pyspark-emr

Repo containing the notebook for my PySpark big data EDA and ML project of New York taxi fare prices built using AWS EMR clusters

Language: Jupyter Notebook - Size: 502 KB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

najuzilu/DL-Spark

Building a Data Lake with Spark

Language: Python - Size: 894 KB - Last synced at: about 1 month ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

motua16/Sparkify-Churn-Prediction---Pyspark

Machine Learning on a Large 12 GB dataset with Pyspark on AWS EMR

Language: Jupyter Notebook - Size: 2.35 MB - Last synced at: over 1 year ago - Pushed at: almost 4 years ago - Stars: 1 - Forks: 0

SmellyArmure/OC_DS_Project8

Concevoir et déployer une architecture Big Data sur AWS (OpenClassrooms | Data Scientist | Projet 8)

Language: Jupyter Notebook - Size: 21.7 MB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 1 - Forks: 0

jomavera/dataPipelineEMR

ETL pipeline with PySpark on EMR orchestrated with Airflow

Language: Python - Size: 87.9 KB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 1 - Forks: 0

Mark-McAdam/Data-Engineering-Batch

Takes product reviews and performs natural language processing to provide sentiment analysis. The new insight gets combined with matching product information in the central database to provide a clearer picture of user behavior.

Language: Python - Size: 963 KB - Last synced at: about 2 years ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0

andre-marcos-perez/ifood-arch-readme

The application is the documentation of my solution for the iFood data architect test.

Size: 454 KB - Last synced at: 23 days ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0

rupeshtr78/awsiot

AWS IOT Intergration Using EMR Spark Kinesis

Language: Jupyter Notebook - Size: 117 KB - Last synced at: about 2 months ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0

harshkavdikar1/Udacity-DataEngineering-NanoDegree

Language: Jupyter Notebook - Size: 3.58 MB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 1 - Forks: 0

m1theus/aws-emr-terraform

Example for provisioning AWS EMR service with Terraform

Language: HCL - Size: 4.88 KB - Last synced at: 12 months ago - Pushed at: almost 5 years ago - Stars: 1 - Forks: 1

JiajunSong629/AWS_EMR_Spark_Workflow

Spark jobs workflow on AWS EMR

Language: Python - Size: 5.86 KB - Last synced at: over 1 year ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 0

tansudasli/spark-sandbox

Apache spark sandbox on GCP and Amazon EMR.

Language: Jupyter Notebook - Size: 3.89 MB - Last synced at: about 2 months ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 0

pradeepbhadani/tf-examples

Terraform Examples

Language: HCL - Size: 46.9 KB - Last synced at: about 2 months ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 7

ahuber1/Hubble-Simulator-AWS-Version

A version of the "Hubble Simulator" project that uses as many AWS services as possible. (Original project at https://github.com/ahuber1/Project5)

Language: Java - Size: 6.39 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 0

Huizerd/ET4310_SBD

Assignments belonging to the course Supercomputing for Big Data (ET4310) at TU Delft

Language: Scala - Size: 40.8 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 0

HAOYU-LI/SparkML-Churn-Prediction

Language: Jupyter Notebook - Size: 353 KB - Last synced at: over 1 year ago - Pushed at: about 6 years ago - Stars: 1 - Forks: 0

AditModi/realtime-bushfire-alert-with-apache-flink-cep Fork of aws-samples/realtime-bushfire-alert-with-apache-flink-cep

Code and documentation for the demonstration example of the real-time bushfire alerting with the Complex Event Processing (CEP) in Apache Flink on Amazon EMR and a simulated IoT sensor network as described on the AWS Big Data Blog: Real-time bushfire alerting with Complex Event Processing in Apache Flink on Amazon EMR and IoT sensor network.

Size: 19.2 MB - Last synced at: 12 months ago - Pushed at: over 6 years ago - Stars: 1 - Forks: 0

matchilling/kata-mapreduce

Language: Jupyter Notebook - Size: 9.69 MB - Last synced at: 25 days ago - Pushed at: over 7 years ago - Stars: 1 - Forks: 0

smart-storm/storm-emr

spin up EMR (AWS) cluster for the merge conversion purposes. Uses AWS cloudformation.

Size: 0 Bytes - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 1 - Forks: 0

shanmuga-sudan/Big-Data-Systems

This repo contains all the assignments, project work on Engineering Big Data Systems coursework

Language: C# - Size: 299 MB - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 1 - Forks: 0

jamespaultg/AWS_EMR

Language: Python - Size: 11.7 KB - Last synced at: over 1 year ago - Pushed at: about 8 years ago - Stars: 1 - Forks: 2

branesh2k/AWS-emr-project

AWS EMR-based ETL pipeline using PySpark and S3. Executed using SSH spark-submit.

Language: Python - Size: 1.29 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

kgelli/NYC-Taxi-Analytics---Spark-ETL-Pipeline-on-AWS-EMR

NYC Taxi Analytics: Spark ETL Pipeline on AWS EMR for processing and analyzing NYC taxi trip data using Apache Spark and Amazon Elastic MapReduce.

Language: Python - Size: 637 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

w-k-s/Celebrity-Soundbites-MapReduce-Experiment

Using MapReduce to build a dictionary of YouTube celebrity video clips

Language: Python - Size: 60.5 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

dmrhimali/terraform

Tutorial on how to create and run terraform scripts for providers aws and newrelic

Language: HCL - Size: 20.6 MB - Last synced at: about 2 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

sowrabh-m/Data_Processing_using_Spark_Flink

This project demonstrates data cleaning, processing with Apache Spark and Apache Flink, both locally and on AWS EMR.

Language: Python - Size: 1.46 MB - Last synced at: about 2 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

datawaver/emre-airflow

Use Airflow to create and run Spark Jobs with an EMRE Spark cluster

Language: Python - Size: 17.6 KB - Last synced at: about 2 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

gakas14/Batch-Data-Pipeline-using-Airflow-Spark-EMR-Snowflake

The project will utilize Airflow to orchestrate and manage the data pipeline as it creates and terminates an EMR transient cluster to save on cost. Apache Spark will transform data, and the final dataset will be loaded into Snowflake.

Language: Python - Size: 13.7 KB - Last synced at: about 2 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

epap011/Spark-EMR-HiBench-Performance-Testing

Analyzing Spark Cluster Performance in Amazon EMR

Language: Python - Size: 1.06 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

raghadkibrahim/google-ngrams-big-data

Language: Jupyter Notebook - Size: 0 Bytes - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

BiGHeaDMaX/Traitement-Big-Data-avec-Spark

Ce projet a pour but de réaliser un traitement sur des données volumineuses à l'aide de Spark dans le cloud.

Language: Jupyter Notebook - Size: 2.64 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

TimKong21/AWS-Batch-Processing

Big data analysis with AWS services, filtering the Wikiticker dataset with Apache Spark on Amazon EMR, storing data in S3, cataloging with AWS Glue, and querying with Amazon Athena. This end-to-end pipeline exemplifies handling and analyzing big data in the cloud.

Language: Python - Size: 8.01 MB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

seyfal/SparkMitMAttackSim

Scalable simulation of MitM attacks using parallel random walks and graph analytics on Spark.

Language: Scala - Size: 76.2 KB - Last synced at: 5 days ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

eljandoubi/Sparkify

Utilize Apache Spark for ETL processes to prepare data, followed by the construction of a Machine Learning model for Natural Language Processing (NLP) classification. Subsequently, deploy the model within a Gradio web application for seamless interaction.

Language: Jupyter Notebook - Size: 805 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0