GitHub topics: apachespark

Repositories

apache/hudi

Upserts, Deletes And Incremental Processing on Big Data.

Language: Java - Size: 1.74 GB - Last synced at: about 5 hours ago - Pushed at: 3 days ago - Stars: 5,750 - Forks: 2,396

tspannhw/FLiPStackWeekly

FLaNK AI Weekly covering Apache NiFi, Apache Flink, Apache Kafka, Apache Spark, Apache Iceberg, Apache Ozone, Apache Pulsar, and more...

Size: 767 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 20 - Forks: 0

josephmachado/docker_for_data_engineers

Code for blog at: https://www.startdataengineering.com/post/docker-for-de/

Language: C - Size: 561 KB - Last synced at: 13 days ago - Pushed at: 12 months ago - Stars: 36 - Forks: 15

DataExpert-io/data-engineer-handbook

This is a repo with links to everything you'd ever want to learn about data engineering

Language: Jupyter Notebook - Size: 55.6 MB - Last synced at: 17 days ago - Pushed at: 21 days ago - Stars: 27,428 - Forks: 5,607

datumbrain/gossub

Trigger spark-submit in Golang. A Go implementation of famous SparkLauncher.java.

Language: Go - Size: 4.88 KB - Last synced at: 4 days ago - Pushed at: over 4 years ago - Stars: 7 - Forks: 2

hassonor/kafka-spark-data-engineering

Language: Java - Size: 3.2 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

Nkeramov/spark_java_benchmark

Example of working with Apache Spark over Java API

Language: Java - Size: 12.7 KB - Last synced at: 2 months ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

MarioInf-Master-CompuerScience-UCM/Gestion_datosInformacion

Repositorio de trabajo de la asignatura "Sistemas de gestión de datos y de la información" (curso 22-23), perteneciente al Máster en Ingeniería Informática de la Universidad Complutense de Madrid (UCM)

Language: Jupyter Notebook - Size: 313 MB - Last synced at: 23 days ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

Nkeramov/spark_scala_benchmark

Language: Scala - Size: 4.88 KB - Last synced at: 2 months ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

Punam918/Real_Time_Voting_Analysis

Real Time Voting Analysis System using Big Data Technologies like apache spark and apache Kafka.

Language: Python - Size: 1.16 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

ravishankar324/Washington-state-electric-vehicles-ETL-pipeline

ETL Datapipeline to process Washington's EV data using Apache Spark, Docker, Snowflake, Airflow, AWS services and visualize the transformed parquet data by creating Tableau Dashboards.

Language: Python - Size: 1.85 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

geazi-anc/dracula

a brief analysis to the most common words in Dracula, by Bram Stoker

Language: Jupyter Notebook - Size: 19.5 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

divithraju/divith-raju-Immigration-Data-Engineering

A Capstone Project that covers several aspects of Data Engineering (Data Exploration, Cleaning, Modeling, Pipelining, Processing)

Language: Jupyter Notebook - Size: 2.5 MB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 0

holdenk/sparkProjectTemplate.g8

Template for Spark Projects

Language: Scala - Size: 37.1 KB - Last synced at: 21 days ago - Pushed at: 11 months ago - Stars: 101 - Forks: 41

Cyang18/MusicProducer Fork of KathiraveluLab/Dragonfly

This is a distributed system that utilizes Apache Spark through Dataproc. We use the Spotify API to send song data to Apache Spark, which then forwards the information to Google Cloud Services. The system processes this data to recommend songs based on the extracted information.

Language: Python - Size: 179 KB - Last synced at: 5 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 1

ayyankalu/Ipl_Data_analysis_Spark

This repository showcases IPL data analysis using Apache Spark. The project demonstrates the power of Spark for data transformation, cleaning, SQL queries, and visualization, all performed with PySpark to handle large-scale data efficiently.

Language: Jupyter Notebook - Size: 2.65 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

SmartDataAnalytics/MA-INF-4223-DBDA-Lab

Repository for Lab “Distributed Big Data Analytics” (MA-INF 4223), University of Bonn

Language: Jupyter Notebook - Size: 177 MB - Last synced at: 22 days ago - Pushed at: over 2 years ago - Stars: 11 - Forks: 34

SwethaJoseph/Crime-Pattern-Analysis-Project

Analysis and visualization of open-source police data from two areas, Leicestershire Street and Northumbria Street to derive data-driven insights

Size: 2.78 MB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

meghna-cse/CloudComputingAndBigData-CSE6332

Projects completed as part of the CSE 6332 CCBD course at UTA, covering distributed computing, data processing frameworks, and cloud platforms.

Language: Java - Size: 4.88 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

ashkrit/sparkmicroservices

Microservices for Spark application

Language: Java - Size: 42 KB - Last synced at: 12 months ago - Pushed at: almost 2 years ago - Stars: 5 - Forks: 8

saikumarsuvanam/BigData

Hadoop,MachineLearningAlgos,Spark,Pig,Hive

Language: Java - Size: 4.37 MB - Last synced at: 12 months ago - Pushed at: over 7 years ago - Stars: 0 - Forks: 1

Az1m04/Advance-Data-Science-with-IBM-Watson-Studio

This work on Python notebook .It shows how to calculate covariance and correlations using pyspark

Language: Jupyter Notebook - Size: 1000 Bytes - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 0

AbdelmajidLh/spark-functionality-repo

Ce dépôt GitHub contient un document détaillé sur les bases du langage Scala.

Size: 847 KB - Last synced at: 28 days ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

propelledanalytics/SparkSQL.jl

SparkSQL.jl enables Julia programs to work with Apache Spark data using just SQL.

Language: Julia - Size: 146 KB - Last synced at: 26 days ago - Pushed at: about 1 year ago - Stars: 25 - Forks: 0

ZeroTwoDataRW/DE-Stream-Project-Random-Generated-User-Data

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

Language: Python - Size: 393 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

mehrdadalmasi2020/ApacheSpark_ApacheZeppelin_SQL_Shell

Run your first analysis project on Apache Zeppelin using Scala (Spark), Shell, and SQL

Language: Scala - Size: 1.4 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

gilga001/HPCandBigDataPipeline

A published paper in PEARC18: Combining HPC and Big Data Infrastructures in Large-Scale Post-Processing of SimulaBon Data: A Case Study

Language: Python - Size: 33.2 KB - Last synced at: over 1 year ago - Pushed at: almost 7 years ago - Stars: 0 - Forks: 0

Thapep/ApacheSpark

Apache Spark project for Advanced Topics on Databases course

Language: Python - Size: 438 KB - Last synced at: over 1 year ago - Pushed at: about 4 years ago - Stars: 1 - Forks: 1

sfrechette/spark-jdbc-mssql

Connect to SQL Server using Apache Spark

Language: Scala - Size: 1.95 KB - Last synced at: over 1 year ago - Pushed at: over 8 years ago - Stars: 7 - Forks: 2

Cloudslab/dSpark

Source code for the work "dSpark: Deadline-Based Resource Allocation for Big Data Applications in Apache Spark" published in IEEE e-Science 2017

Language: Java - Size: 15.6 KB - Last synced at: over 1 year ago - Pushed at: about 7 years ago - Stars: 1 - Forks: 2

sahith/Link-Prediction-for-Citation-Networks-using-Apache-Spark

Link Prediction is about predicting the future connections in a graph. In this project, Link Prediction is about predicting whether two authors will be collaborating for their future paper or not given the graph of authors who collaborated for atleast one paper together.

Language: Scala - Size: 6.41 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 5 - Forks: 1

sarathchandrikak/ETL-Bank-Transcation

Data Analysis of bank transaction data

Language: Jupyter Notebook - Size: 9.34 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0

funkyminds/cleanframes

type-class based data cleansing library for Apache Spark SQL

Language: Scala - Size: 67.4 KB - Last synced at: over 1 year ago - Pushed at: almost 6 years ago - Stars: 81 - Forks: 8

YFC-ophey/big-data-group-project

US superstore opening analysis

Size: 3.5 MB - Last synced at: 2 months ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

aravinthsci/Spark_Delta_Lake

Delta Lake Examples

Language: Jupyter Notebook - Size: 285 KB - Last synced at: about 1 year ago - Pushed at: about 5 years ago - Stars: 12 - Forks: 12

martandsingh/ApacheSpark

This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.

Language: Python - Size: 141 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 71 - Forks: 47

lensesio/lenses-jdbc-spark 📦

Apache Spark with Kafka via JDBC !!!

Language: Java - Size: 3.91 KB - Last synced at: 12 months ago - Pushed at: almost 7 years ago - Stars: 6 - Forks: 2

MRYingLEE/Apache_Spark_in_Google_Colab 📦

This is a Jupyter Notebook to practice Apache Spark in Google Colab, especially for the exam CCA Spark and Hadoop Developer Exam (CCA175).

Language: Jupyter Notebook - Size: 35.2 KB - Last synced at: about 1 year ago - Pushed at: almost 5 years ago - Stars: 2 - Forks: 3

bartosz25/data-ai-summit-2020

You will find here the demo codes for my Data+AI 2020 talk about customizing Apache Spark state store.

Size: 194 KB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 2

funkyminds/cleanframes-examples

Examples usages for cleanframes library

Language: Scala - Size: 8.79 KB - Last synced at: over 1 year ago - Pushed at: almost 6 years ago - Stars: 5 - Forks: 1

syedsaadahmed/Hands-on-with-Apache-Spark

Working with Apache Spark, Creating some small tutorials and at last implemeting a small project

Size: 1000 Bytes - Last synced at: 2 months ago - Pushed at: almost 5 years ago - Stars: 1 - Forks: 0

amit2014/IBM-AI-Engineering-Professional-Certificate

The rapid pace of innovation in Artificial Intelligence (AI) is creating enormous opportunity for transforming entire industries and our very existence. After competing this comprehensive 6 course Professional Certificate, you will get a practical understanding of Machine Learning and Deep Learning. You will master fundamental concepts of Machine Learning and Deep Learning, including supervised and unsupervised learning. You will utilize popular Machine Learning and Deep Learning libraries such as SciPy, ScikitLearn, Keras, PyTorch, and Tensorflow applied to industry problems involving object recognition and Computer Vision, image and video processing, text analytics, Natural Language Processing, recommender systems, and other types of classifiers. You will be able to scale Machine Learning on Big Data using Apache Spark. You will build, train, and deploy different types of Deep Architectures, including Convolutional Networks, Recurrent Networks, and Autoencoders. By the end of this Professional Certificate, you will have completed several projects showcasing your proficiency in Machine Learning and Deep Learning, and become armed with skills for a career as an AI Engineer.

Size: 14.6 KB - Last synced at: about 1 year ago - Pushed at: about 5 years ago - Stars: 0 - Forks: 0

amit2014/Advanced-Data-Science-with-IBM-Specialization

As a coursera certified specialization completer you will have a proven deep understanding on massive parallel data processing, data exploration and visualization, and advanced machine learning & deep learning. You'll understand the mathematical foundations behind all machine learning & deep learning algorithms. You can apply knowledge in practical use cases, justify architectural decisions, understand the characteristics of different algorithms, frameworks & technologies & how they impact model performance & scalability. If you choose to take this specialization and earn the Coursera specialization certificate, you will also earn an IBM digital badge. To find out more about IBM digital badges follow the link ibm.biz/badging.

Size: 13.7 KB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0

mayankrawat/CSVJoin

Use this project to join data from multiple csv files. Currently in this project we support one to one and one to many join. Along with this you can find how to use kafka producer efficiently with spark.

Language: Java - Size: 10.7 KB - Last synced at: about 2 years ago - Pushed at: almost 3 years ago - Stars: 2 - Forks: 0

payamrastogi/SparkCourse

Language: Python - Size: 3.62 MB - Last synced at: about 1 year ago - Pushed at: over 7 years ago - Stars: 0 - Forks: 0

Related Keywords

apachespark 45 spark 15 bigdata 11 python 10 pyspark 10 scala 8 sql 6 sparksql 5 kafka 4 docker 4 hive 4 datalake 4 apachekafka 3 databricks 3 datavisualization 3 java 3 spark-sql 3 hadoop 3 dataengineering 3 big-data 3 apache 2 deeplearning 2 deltalake 2 datascience 2 data-engineering 2 airflow 2 shapeless 2 emr 2 python3 2 s3 2 tableau 2 machine-learning 2 apacheflink 2 zeppelin-notebook 2 jdbc-driver 2 apache-spark 2 data 2 docker-compose 2 linkprediction 1 apachespark-rdd 1 redshift 1 s3bucket 1 simulation 1 sqoop 1 postprocessing 1 mdtraj 1 hpc 1 apachehudi 1 sparkscala 1 businessanalysis 1 link-prediction 1 databases 1 dataframes-api 1 dataframes 1 ntua 1 big-data-analytics 1 awsemr 1 aws 1 resource-allocation 1 sql-server 1 sqlserver 1 performance-modeling 1 deadline 1 keras 1 ml 1 internetofthings 1 integrate-kafka-spark 1 join-apache-spark 1 kafka-producer 1 kafka-producer-spark 1 kafka-spark 1 kafka-with-spark 1 one-to-many 1 one-to-many-join 1 one-to-many-joins-spark 1 one-to-one-join 1 spark-csv 1 spark-dataframes 1 spark-java 1 spark-kafka 1 spark-kafka-integration 1 hdfs 1 delta-lake 1 data-analysis 1 database 1 etl 1 etl-pipeline 1 spark-streaming 1 timetravel 1 colab 1 googlecolab 1 streaming-data 1 structured-streaming 1 joins 1 machine-learning-algorithms 1 ai 1 ann 1 coursera 1 ibm 1 mongodb 1