GitHub topics: spark-sql | Ecosyste.ms: Repos

abailey81/Big-data-spark-analytics

End-to-end PySpark analysis of NYC Taxi, Ethereum, and NASA logs (batch, graph, streaming)

Language: Python - Size: 102 KB - Last synced at: about 14 hours ago - Pushed at: about 15 hours ago - Stars: 1 - Forks: 0

dotnet/spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

Language: C# - Size: 4.88 MB - Last synced at: about 5 hours ago - Pushed at: about 2 months ago - Stars: 2,084 - Forks: 330

vim89/flowforge

Let's be honest - most data pipeline frameworks treat types as suggestions. Config files are strings. Schemas are "validated" at runtime. Data quality is an afterthought. So, let's do differently

Language: Scala - Size: 9.23 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

almond-sh/almond

A Scala kernel for Jupyter

Language: Scala - Size: 13.3 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 1,618 - Forks: 253

japila-books/spark-sql-internals

The Internals of Spark SQL

Size: 1.58 GB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 477 - Forks: 135

A beginner-friendly PySpark learning repository that covers essential functions, transformations, and actions to get started with big data processing and distributed computing. Perfect for those new to PySpark and Apache Spark.

Language: Jupyter Notebook - Size: 102 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

apache/kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

Language: Scala - Size: 60.5 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2,269 - Forks: 970

apache/incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.

Language: Scala - Size: 203 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1,469 - Forks: 540

Sudharsanan098/PySpark

📚 Master PySpark in 18 days with structured lessons, hands-on tasks, and an end-to-end project, covering essential concepts and ML model training.

Size: 1.3 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

Bilal-iqbal01/Voltmart-Orders-Data-Cleaning

🧹 Clean and preprocess Voltmart's orders dataset for accurate demand forecasting using PySpark, ensuring high data quality and consistency.

Language: Python - Size: 9.67 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

imran5570/pyNotes

📝 Simplify your note-taking with pyNotes, a user-friendly desktop app for adding, editing, and organizing notes using Python and PyQt6.

Language: Python - Size: 1.33 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

cladsanityPH/Geospatial-Analysis-With-Spark

🚀 Streamline real-time geospatial analytics for MBTA using Kafka, Spark, and MongoDB, delivering live vehicle status and insights via a dynamic web interface.

Language: Python - Size: 13.3 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

jackstanmma/Customer-Sentiment-Prediction-PySpark

🔍 Predict customer sentiment using PySpark and Databricks, analyzing 49,000+ e-commerce reviews to drive insights for improved business strategies.

Language: Python - Size: 2.45 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

Nonppk/pyspark-etl-automation

A containerized automated ETL pipeline built with PySpark, PostgreSQL, and Docker.

Language: Python - Size: 110 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

shivp436/taming-big-data-with-spark-udemy

Course material & practice scripts from the Spark4 course by Frank Kane

Language: Python - Size: 15.3 MB - Last synced at: 13 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

streamnative/awesome-pulsar

A curated list of Pulsar tools, integrations and resources.

Size: 11.7 KB - Last synced at: 1 day ago - Pushed at: almost 5 years ago - Stars: 85 - Forks: 9

AlexRogalskiy/spark-patterns

🏆 Spark4You Design patterns

Language: Shell - Size: 21.9 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 3 - Forks: 0

microsoft/data-accelerator

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

Language: C# - Size: 401 MB - Last synced at: 17 days ago - Pushed at: 8 months ago - Stars: 306 - Forks: 91

riccardotommasini/spark-training

This repository for the training on Spark and Spark Structured Streaming

Language: Jupyter Notebook - Size: 147 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 2 - Forks: 2

fuadonates/sql-migration-modernization

SQL Server to Databricks migration expertise - T-SQL translation, stored procedure modernization, Delta Lake patterns for financial services

Language: Python - Size: 12.7 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0

fuadonates/enterprise-storage-migration-analytics

Data analytics platform tracking global cloud storage migration for 130M+ users - Tableau, Power BI, MySQL, Hive, Databricks

Language: Python - Size: 0 Bytes - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0

AMRprojects/EDA-practice

Contains practice Exploratory Data Analysis (EDA) and Machine Learning Analysis for Kaggle Data Sets

Language: Jupyter Notebook - Size: 6.17 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

OKDP/spark-images

Collection of Apache Spark docker images for OKDP

Language: Dockerfile - Size: 146 KB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 2 - Forks: 4

databricks/LearningSparkV2

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

Language: Scala - Size: 75.2 MB - Last synced at: 22 days ago - Pushed at: 10 months ago - Stars: 1,349 - Forks: 783

mixaisealx/DevOps-n-DataOps

Hands-on project demos covering infrastructure automation (Ansible, Docker), big-data processing & streaming (Hive, Spark, Kafka), and network experiments (MitM, TCP-over-UDP).

Language: Python - Size: 61.5 KB - Last synced at: 24 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

getredash/redash

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

Language: Python - Size: 27.8 MB - Last synced at: 27 days ago - Pushed at: 28 days ago - Stars: 27,874 - Forks: 4,527

bumfo/spark-protobuf-backport

The spark‑protobuf‑backport project brings the from_protobuf and to_protobuf functions introduced in Spark 3.4 to Spark 3.2.1.

Language: Scala - Size: 709 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 1 - Forks: 0

Wh1isper/sparglim

Sparglim✨ makes PySpark App Configurable and Deploy Spark Connect Server Easier!

Language: Python - Size: 159 KB - Last synced at: 18 days ago - Pushed at: 29 days ago - Stars: 38 - Forks: 4

harryprince/awesome-sparklyr

An awesome sparklyr related package collection

Size: 47.9 KB - Last synced at: 8 days ago - Pushed at: over 5 years ago - Stars: 42 - Forks: 7

ZuhairBhati/healthcare_dlt_pipeline

This project demonstrates a comprehensive Delta Live Tables (DLT) pipeline for healthcare data processing, implementing the Bronze-Silver-Gold medallion architecture. The pipeline processes patient admission data with real-time streaming capabilities, data quality enforcement, and multi-dimensional analytics.

Language: Jupyter Notebook - Size: 626 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

JBris/time-series-airflow-kafka-spark

A simple demonstration of an Airflow-Kafka-Spark (AKS) stack for online time series forecasting.

Language: Python - Size: 699 KB - Last synced at: 13 days ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

groda/big_data

Big Data essentials: Hadoop, MapReduce, Spark. Explore tutorials and demos in Jupyter notebooks—most are self-contained and live, ready to run with a click.

Language: Jupyter Notebook - Size: 62.5 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 85 - Forks: 27

SamuelBarbosaDev/Justweb_Technical_Test

Esse é um teste técnico para a vaga de Desenvolvedor Python Pleno.

Language: Jupyter Notebook - Size: 3.66 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

smusab9152/Pyspark_programs_and_projects

Collection of PySpark programs and projects demonstrating the use of Apache Spark's Python API for big data processing and analysis. It includes practical implementations such as logistic regression classification, data analysis on the Iris dataset, and basic PySpark operations like temperature conversion.

Language: Jupyter Notebook - Size: 588 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

kevinkyhalim/SparkCourse

Learnings of PySpark from Udemy

Language: Python - Size: 92.5 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

bbenz/bbenz-azure-aca-rapids

This project demonstrates how to run GPU-accelerated XGBoost machine learning models using NVIDIA RAPIDS on Azure Container Apps. The implementation uses Apache Spark with RAPIDS acceleration to process the Agaricus mushroom dataset, train an XGBoost classification model to identify edible vs. poisonous mushrooms, and store vector embeddings

Language: Java - Size: 51.8 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 1

OmarAlhaz/Voltmart-Orders-Data-Cleaning

Data cleaning and preprocessing of Voltmart e-commerce orders using PySpark (DataFrame API & SQL). Includes feature engineering (time_of_day, purchase_state), data validation, and output in Parquet format for demand forecasting.

Language: Python - Size: 8.59 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

ploomber/jupysql Fork of catherinedevlin/ipython-sql

Better SQL in Jupyter. 📊

Language: Python - Size: 12.9 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 813 - Forks: 79

martandsingh/ApacheSpark

This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.

Language: Python - Size: 143 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 102 - Forks: 66

newfront/delta-connect-playground

Learn to use delta-connect without any headaches.

Language: Jupyter Notebook - Size: 114 KB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 3 - Forks: 0

Orkhan-1/Full-Course-Apache-Spark

This comprehensive course is designed for beginners and experienced developers alike, providing an in-depth exploration of Apache Spark

Language: Java - Size: 7.81 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

letsiki/airflow-orchestrated-cleaning-and-hdfs-loading

[data engineer assessment] airflow orchestrated cleaning and loading of transactional data and into hadoop-hdfs

Language: Python - Size: 13.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

bumfo/spark-datasources

This project provides a DataSource V2 implementation for reading 4mc‐compressed text files into Spark DataFrames. It extends Spark’s FileDataSourceV2 and FileTable to leverage Spark’s distributed file discovery and partition pruning.

Language: Scala - Size: 139 KB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

dejarol/azure-search-spark

Azure AI Search connector for Spark

Language: Scala - Size: 1.63 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

kriss024/Spark

Spark for Data Science and ETL process.

Language: Jupyter Notebook - Size: 78.1 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

camilesing/Hive-Spark-SQL-Helper-VSCode

Hive & Spark SQL extension for Visual Studio Code

Language: TypeScript - Size: 7.35 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 5 - Forks: 0

taylorteixeira/APACHE-SPARK-COM-DELTA-LAKE-E-APACHE-ICEBERG

Projeto desenvolvido para demonstração do Apache Spark Local (pyspark) gravando arquivos no formato Delta Lake também de forma local criando modelo ER, imagens e códigos DDL - e da fonte de dados utilizada (dados públicos) e evidenciando e explicando.

Language: Jupyter Notebook - Size: 742 KB - Last synced at: 2 months ago - Pushed at: 7 months ago - Stars: 2 - Forks: 1

Alex-Nettekoven/Climate-Data-Analysis

Climate Data Analysis on BigQuery with Spark and Data Visualization

Language: Jupyter Notebook - Size: 2.29 MB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

streamnative/pulsar-spark

Spark Connector to read and write with Pulsar

Language: Scala - Size: 722 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 116 - Forks: 53

LearningJournal/Spark-Streaming-In-Python

Apache Spark 3 - Structured Streaming Course Material

Language: Python - Size: 19.4 MB - Last synced at: 2 months ago - Pushed at: about 2 years ago - Stars: 122 - Forks: 164

samuelmurguia/big-data-analytics-on-stocks-data

Enhanced profitability and research of stocks historical data using distributed system analytics.

Language: Jupyter Notebook - Size: 7.06 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

hazecodeio/spark-sandbox

Language: Scala - Size: 13.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

sunujh6/spark_practice

Language: Jupyter Notebook - Size: 1.62 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

Qbeast-io/qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!

Language: Scala - Size: 37.3 MB - Last synced at: 3 months ago - Pushed at: 10 months ago - Stars: 232 - Forks: 23

fealt/databricks_incremental_data_project

Databricks project showcasing incremental data ingestion with industry best practices.

Language: Jupyter Notebook - Size: 11.5 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

BayoAdejare/lightning-streams

Batch/stream ETL pipeline of NOAA GLM dataset, using Python frameworks: Dagster, PySpark and Parquet storage.

Language: Python - Size: 63.4 MB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 5 - Forks: 0

anaregdesign/openaivec

Pandas extension, Tabular calculation with LLM, Spark UDF Builder

Language: Python - Size: 1.31 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 13 - Forks: 1

royungar/BigData_Employee_Insights_Project

This project simulates a mini data engineering pipeline using Apache Spark. It showcases structured data ingestion from CSV, schema enforcement, SQL interface creation, column transformations, aggregations, and join operations using PySpark — all designed to reflect real-world data processing workflows.

Language: Jupyter Notebook - Size: 35.2 KB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

e1washere/weather-spark-pipeline

Scalable pipeline using Apache Spark to process and analyze weather data.

Language: Python - Size: 140 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

thiagoneye/course-databricks

Databricks studies.

Size: 13 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

Smars-Bin-Hu/azure-cloud-datapipeline-EDA

A cloud-native data pipeline and visualization project analyzing Formula 1 racing data using Azure, Databricks, Delta Lake, Tableau, and Python for insightful EDA and interactive dashboards.

Language: Jupyter Notebook - Size: 26.4 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 3 - Forks: 0

izhangzhihao/Real-time-Data-Warehouse

Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi

Language: Dockerfile - Size: 106 KB - Last synced at: 2 months ago - Pushed at: almost 2 years ago - Stars: 117 - Forks: 44

Dhaanesh26/spotify-analysis

Spotify listening trends analyzed and visualized using AWS cloud services

Language: Python - Size: 555 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

nsphung/pyspark-template

A Python PySpark Projet with Poetry

Language: Jupyter Notebook - Size: 98.6 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 23 - Forks: 4

Bashwan/data-engineer-notes

Explore my Data Engineer Bootcamp notes, featuring summaries, key concepts, and lab solutions. Perfect for mastering data engineering! 🚀📊

Language: Makefile - Size: 22.5 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

donat-konan33/EtltAirbyteGcsBigQueryDbtAirflow

Deployment of a containerized ELT/ETLT architecture integrating Airbyte for data extraction, python for data processing, and bigquery-dbt for analytical modeling and transformation. This project enables experimentation with new data extraction and processing use cases using Airbyte and python, while ensuring portability through Docker

Language: Python - Size: 2.56 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

josephmachado/advanced_spark_sql_for_data_engineers

Advanced Spark SQL for Data Engineers

Language: Jupyter Notebook - Size: 6.45 MB - Last synced at: 2 months ago - Pushed at: 5 months ago - Stars: 12 - Forks: 4

canluca09/real-estate-spark-analytics

Analyze real estate trends with our distributed big data framework using Apache Spark. Predict property prices and market dynamics effectively. 🏠✨

Language: Python - Size: 2.14 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

debanjan06/real-estate-spark-analytics

Production-grade Apache Spark application for real estate analytics. Implements distributed ML (Linear Regression, Random Forest, GBT) with PySpark 3.5.5, processing 100K+ properties across major US cities.

Language: Python - Size: 2.14 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Sabal999/end-to-end-data-pipeline-acs

This repository showcases a robust end-to-end data pipeline for the American Community Survey dataset, utilizing tools like Python, SparkSQL, and Docker. 🚀 Explore the architecture that transforms raw data into valuable insights through a Bronze / Silver / Gold framework. 🐙

Language: Python - Size: 1.17 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

GADES-DATAENG/mod3-spark

A repository containing all the necessary code and resources for an Apache Spark demo, showcasing data processing and transformation workflows with practical examples.

Language: Jupyter Notebook - Size: 425 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

LXY-yq/UserActionAnalyze

电商用户行为分析数据平台

Language: Java - Size: 4.88 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

shreyasaxena29103/tmdb-movie-analysis-and-optimization

This project explores the TMDB movie dataset (1M+ records) using PySpark and Spark SQL in Databricks. It focuses on performance optimization and analysis

Language: HTML - Size: 164 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Sushmi08B/NYC-YELLOWTAXI-DATA-ETL

Production-scale ETL pipeline using PySpark and PostgreSQL to process 100M+ NYC Yellow Taxi trip records with full data validation, enrichment, and partitioned storage

Language: Python - Size: 1.04 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

snigdhasv/Music_Mood_Analyzer

This Real-Time Music Mood Analyzer system processes streaming music data to classify songs by mood using audio features. It combines Apache Kafka for real-time streaming, Spark Streaming for mood classification, and MySQL for storage. A dashboard visualizes results, while batch processing reanalyzes historical data for accuracy.

Language: Python - Size: 441 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0