GitHub topics: spark-sql
abailey81/Big-data-spark-analytics
End-to-end PySpark analysis of NYC Taxi, Ethereum, and NASA logs (batch, graph, streaming)
Language: Python - Size: 102 KB - Last synced at: about 14 hours ago - Pushed at: about 15 hours ago - Stars: 1 - Forks: 0
dotnet/spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Language: C# - Size: 4.88 MB - Last synced at: about 5 hours ago - Pushed at: about 2 months ago - Stars: 2,084 - Forks: 330
vim89/flowforge
Let's be honest - most data pipeline frameworks treat types as suggestions. Config files are strings. Schemas are "validated" at runtime. Data quality is an afterthought. So, let's do differently
Language: Scala - Size: 9.23 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0
almond-sh/almond
A Scala kernel for Jupyter
Language: Scala - Size: 13.3 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 1,618 - Forks: 253
japila-books/spark-sql-internals
The Internals of Spark SQL
Size: 1.58 GB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 477 - Forks: 135
deypadma2020/PySpark
A beginner-friendly PySpark learning repository that covers essential functions, transformations, and actions to get started with big data processing and distributed computing. Perfect for those new to PySpark and Apache Spark.
Language: Jupyter Notebook - Size: 102 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0
apache/kyuubi
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Language: Scala - Size: 60.5 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2,269 - Forks: 970
apache/incubator-gluten
Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
Language: Scala - Size: 203 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1,469 - Forks: 540
Sudharsanan098/PySpark
📚 Master PySpark in 18 days with structured lessons, hands-on tasks, and an end-to-end project, covering essential concepts and ML model training.
Size: 1.3 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0
Bilal-iqbal01/Voltmart-Orders-Data-Cleaning
🧹 Clean and preprocess Voltmart's orders dataset for accurate demand forecasting using PySpark, ensuring high data quality and consistency.
Language: Python - Size: 9.67 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0
imran5570/pyNotes
📝 Simplify your note-taking with pyNotes, a user-friendly desktop app for adding, editing, and organizing notes using Python and PyQt6.
Language: Python - Size: 1.33 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0
cladsanityPH/Geospatial-Analysis-With-Spark
🚀 Streamline real-time geospatial analytics for MBTA using Kafka, Spark, and MongoDB, delivering live vehicle status and insights via a dynamic web interface.
Language: Python - Size: 13.3 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0
jackstanmma/Customer-Sentiment-Prediction-PySpark
🔍 Predict customer sentiment using PySpark and Databricks, analyzing 49,000+ e-commerce reviews to drive insights for improved business strategies.
Language: Python - Size: 2.45 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0
Nonppk/pyspark-etl-automation
A containerized automated ETL pipeline built with PySpark, PostgreSQL, and Docker.
Language: Python - Size: 110 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0
shivp436/taming-big-data-with-spark-udemy
Course material & practice scripts from the Spark4 course by Frank Kane
Language: Python - Size: 15.3 MB - Last synced at: 13 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0
streamnative/awesome-pulsar
A curated list of Pulsar tools, integrations and resources.
Size: 11.7 KB - Last synced at: 1 day ago - Pushed at: almost 5 years ago - Stars: 85 - Forks: 9
AlexRogalskiy/spark-patterns
🏆 Spark4You Design patterns
Language: Shell - Size: 21.9 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 3 - Forks: 0
microsoft/data-accelerator
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Language: C# - Size: 401 MB - Last synced at: 17 days ago - Pushed at: 8 months ago - Stars: 306 - Forks: 91
riccardotommasini/spark-training
This repository for the training on Spark and Spark Structured Streaming
Language: Jupyter Notebook - Size: 147 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 2 - Forks: 2
fuadonates/sql-migration-modernization
SQL Server to Databricks migration expertise - T-SQL translation, stored procedure modernization, Delta Lake patterns for financial services
Language: Python - Size: 12.7 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0
fuadonates/enterprise-storage-migration-analytics
Data analytics platform tracking global cloud storage migration for 130M+ users - Tableau, Power BI, MySQL, Hive, Databricks
Language: Python - Size: 0 Bytes - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0
AMRprojects/EDA-practice
Contains practice Exploratory Data Analysis (EDA) and Machine Learning Analysis for Kaggle Data Sets
Language: Jupyter Notebook - Size: 6.17 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0
OKDP/spark-images
Collection of Apache Spark docker images for OKDP
Language: Dockerfile - Size: 146 KB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 2 - Forks: 4
databricks/LearningSparkV2
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Language: Scala - Size: 75.2 MB - Last synced at: 22 days ago - Pushed at: 10 months ago - Stars: 1,349 - Forks: 783
mixaisealx/DevOps-n-DataOps
Hands-on project demos covering infrastructure automation (Ansible, Docker), big-data processing & streaming (Hive, Spark, Kafka), and network experiments (MitM, TCP-over-UDP).
Language: Python - Size: 61.5 KB - Last synced at: 24 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0
getredash/redash
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Language: Python - Size: 27.8 MB - Last synced at: 27 days ago - Pushed at: 28 days ago - Stars: 27,874 - Forks: 4,527
bumfo/spark-protobuf-backport
The spark‑protobuf‑backport project brings the from_protobuf and to_protobuf functions introduced in Spark 3.4 to Spark 3.2.1.
Language: Scala - Size: 709 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 1 - Forks: 0
Wh1isper/sparglim
Sparglim✨ makes PySpark App Configurable and Deploy Spark Connect Server Easier!
Language: Python - Size: 159 KB - Last synced at: 18 days ago - Pushed at: 29 days ago - Stars: 38 - Forks: 4
harryprince/awesome-sparklyr
An awesome sparklyr related package collection
Size: 47.9 KB - Last synced at: 8 days ago - Pushed at: over 5 years ago - Stars: 42 - Forks: 7
ZuhairBhati/healthcare_dlt_pipeline
This project demonstrates a comprehensive Delta Live Tables (DLT) pipeline for healthcare data processing, implementing the Bronze-Silver-Gold medallion architecture. The pipeline processes patient admission data with real-time streaming capabilities, data quality enforcement, and multi-dimensional analytics.
Language: Jupyter Notebook - Size: 626 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0
JBris/time-series-airflow-kafka-spark
A simple demonstration of an Airflow-Kafka-Spark (AKS) stack for online time series forecasting.
Language: Python - Size: 699 KB - Last synced at: 13 days ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0
groda/big_data
Big Data essentials: Hadoop, MapReduce, Spark. Explore tutorials and demos in Jupyter notebooks—most are self-contained and live, ready to run with a click.
Language: Jupyter Notebook - Size: 62.5 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 85 - Forks: 27
SamuelBarbosaDev/Justweb_Technical_Test
Esse é um teste técnico para a vaga de Desenvolvedor Python Pleno.
Language: Jupyter Notebook - Size: 3.66 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0
smusab9152/Pyspark_programs_and_projects
Collection of PySpark programs and projects demonstrating the use of Apache Spark's Python API for big data processing and analysis. It includes practical implementations such as logistic regression classification, data analysis on the Iris dataset, and basic PySpark operations like temperature conversion.
Language: Jupyter Notebook - Size: 588 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0
kevinkyhalim/SparkCourse
Learnings of PySpark from Udemy
Language: Python - Size: 92.5 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0
bbenz/bbenz-azure-aca-rapids
This project demonstrates how to run GPU-accelerated XGBoost machine learning models using NVIDIA RAPIDS on Azure Container Apps. The implementation uses Apache Spark with RAPIDS acceleration to process the Agaricus mushroom dataset, train an XGBoost classification model to identify edible vs. poisonous mushrooms, and store vector embeddings
Language: Java - Size: 51.8 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 1
OmarAlhaz/Voltmart-Orders-Data-Cleaning
Data cleaning and preprocessing of Voltmart e-commerce orders using PySpark (DataFrame API & SQL). Includes feature engineering (time_of_day, purchase_state), data validation, and output in Parquet format for demand forecasting.
Language: Python - Size: 8.59 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0
ploomber/jupysql Fork of catherinedevlin/ipython-sql
Better SQL in Jupyter. 📊
Language: Python - Size: 12.9 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 813 - Forks: 79
martandsingh/ApacheSpark
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
Language: Python - Size: 143 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 102 - Forks: 66
newfront/delta-connect-playground
Learn to use delta-connect without any headaches.
Language: Jupyter Notebook - Size: 114 KB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 3 - Forks: 0
Orkhan-1/Full-Course-Apache-Spark
This comprehensive course is designed for beginners and experienced developers alike, providing an in-depth exploration of Apache Spark
Language: Java - Size: 7.81 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0
letsiki/airflow-orchestrated-cleaning-and-hdfs-loading
[data engineer assessment] airflow orchestrated cleaning and loading of transactional data and into hadoop-hdfs
Language: Python - Size: 13.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0
bumfo/spark-datasources
This project provides a DataSource V2 implementation for reading 4mc‐compressed text files into Spark DataFrames. It extends Spark’s FileDataSourceV2 and FileTable to leverage Spark’s distributed file discovery and partition pruning.
Language: Scala - Size: 139 KB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0
dejarol/azure-search-spark
Azure AI Search connector for Spark
Language: Scala - Size: 1.63 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0
kriss024/Spark
Spark for Data Science and ETL process.
Language: Jupyter Notebook - Size: 78.1 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0
camilesing/Hive-Spark-SQL-Helper-VSCode
Hive & Spark SQL extension for Visual Studio Code
Language: TypeScript - Size: 7.35 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 5 - Forks: 0
taylorteixeira/APACHE-SPARK-COM-DELTA-LAKE-E-APACHE-ICEBERG
Projeto desenvolvido para demonstração do Apache Spark Local (pyspark) gravando arquivos no formato Delta Lake também de forma local criando modelo ER, imagens e códigos DDL - e da fonte de dados utilizada (dados públicos) e evidenciando e explicando.
Language: Jupyter Notebook - Size: 742 KB - Last synced at: 2 months ago - Pushed at: 7 months ago - Stars: 2 - Forks: 1
Alex-Nettekoven/Climate-Data-Analysis
Climate Data Analysis on BigQuery with Spark and Data Visualization
Language: Jupyter Notebook - Size: 2.29 MB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0
streamnative/pulsar-spark
Spark Connector to read and write with Pulsar
Language: Scala - Size: 722 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 116 - Forks: 53
LearningJournal/Spark-Streaming-In-Python
Apache Spark 3 - Structured Streaming Course Material
Language: Python - Size: 19.4 MB - Last synced at: 2 months ago - Pushed at: about 2 years ago - Stars: 122 - Forks: 164
samuelmurguia/big-data-analytics-on-stocks-data
Enhanced profitability and research of stocks historical data using distributed system analytics.
Language: Jupyter Notebook - Size: 7.06 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0
hazecodeio/spark-sandbox
Language: Scala - Size: 13.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0
sunujh6/spark_practice
Language: Jupyter Notebook - Size: 1.62 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0
Qbeast-io/qbeast-spark
Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
Language: Scala - Size: 37.3 MB - Last synced at: 3 months ago - Pushed at: 10 months ago - Stars: 232 - Forks: 23
fealt/databricks_incremental_data_project
Databricks project showcasing incremental data ingestion with industry best practices.
Language: Jupyter Notebook - Size: 11.5 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0
BayoAdejare/lightning-streams
Batch/stream ETL pipeline of NOAA GLM dataset, using Python frameworks: Dagster, PySpark and Parquet storage.
Language: Python - Size: 63.4 MB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 5 - Forks: 0
anaregdesign/openaivec
Pandas extension, Tabular calculation with LLM, Spark UDF Builder
Language: Python - Size: 1.31 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 13 - Forks: 1
royungar/BigData_Employee_Insights_Project
This project simulates a mini data engineering pipeline using Apache Spark. It showcases structured data ingestion from CSV, schema enforcement, SQL interface creation, column transformations, aggregations, and join operations using PySpark — all designed to reflect real-world data processing workflows.
Language: Jupyter Notebook - Size: 35.2 KB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0
e1washere/weather-spark-pipeline
Scalable pipeline using Apache Spark to process and analyze weather data.
Language: Python - Size: 140 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0
thiagoneye/course-databricks
Databricks studies.
Size: 13 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0
Smars-Bin-Hu/azure-cloud-datapipeline-EDA
A cloud-native data pipeline and visualization project analyzing Formula 1 racing data using Azure, Databricks, Delta Lake, Tableau, and Python for insightful EDA and interactive dashboards.
Language: Jupyter Notebook - Size: 26.4 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 3 - Forks: 0
izhangzhihao/Real-time-Data-Warehouse
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
Language: Dockerfile - Size: 106 KB - Last synced at: 2 months ago - Pushed at: almost 2 years ago - Stars: 117 - Forks: 44
Dhaanesh26/spotify-analysis
Spotify listening trends analyzed and visualized using AWS cloud services
Language: Python - Size: 555 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0
nsphung/pyspark-template
A Python PySpark Projet with Poetry
Language: Jupyter Notebook - Size: 98.6 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 23 - Forks: 4
Bashwan/data-engineer-notes
Explore my Data Engineer Bootcamp notes, featuring summaries, key concepts, and lab solutions. Perfect for mastering data engineering! 🚀📊
Language: Makefile - Size: 22.5 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0
donat-konan33/EtltAirbyteGcsBigQueryDbtAirflow
Deployment of a containerized ELT/ETLT architecture integrating Airbyte for data extraction, python for data processing, and bigquery-dbt for analytical modeling and transformation. This project enables experimentation with new data extraction and processing use cases using Airbyte and python, while ensuring portability through Docker
Language: Python - Size: 2.56 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0
josephmachado/advanced_spark_sql_for_data_engineers
Advanced Spark SQL for Data Engineers
Language: Jupyter Notebook - Size: 6.45 MB - Last synced at: 2 months ago - Pushed at: 5 months ago - Stars: 12 - Forks: 4
canluca09/real-estate-spark-analytics
Analyze real estate trends with our distributed big data framework using Apache Spark. Predict property prices and market dynamics effectively. 🏠✨
Language: Python - Size: 2.14 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0
debanjan06/real-estate-spark-analytics
Production-grade Apache Spark application for real estate analytics. Implements distributed ML (Linear Regression, Random Forest, GBT) with PySpark 3.5.5, processing 100K+ properties across major US cities.
Language: Python - Size: 2.14 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0
Sabal999/end-to-end-data-pipeline-acs
This repository showcases a robust end-to-end data pipeline for the American Community Survey dataset, utilizing tools like Python, SparkSQL, and Docker. 🚀 Explore the architecture that transforms raw data into valuable insights through a Bronze / Silver / Gold framework. 🐙
Language: Python - Size: 1.17 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0
GADES-DATAENG/mod3-spark
A repository containing all the necessary code and resources for an Apache Spark demo, showcasing data processing and transformation workflows with practical examples.
Language: Jupyter Notebook - Size: 425 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0
LXY-yq/UserActionAnalyze
电商用户行为分析数据平台
Language: Java - Size: 4.88 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0
shreyasaxena29103/tmdb-movie-analysis-and-optimization
This project explores the TMDB movie dataset (1M+ records) using PySpark and Spark SQL in Databricks. It focuses on performance optimization and analysis
Language: HTML - Size: 164 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0
Sushmi08B/NYC-YELLOWTAXI-DATA-ETL
Production-scale ETL pipeline using PySpark and PostgreSQL to process 100M+ NYC Yellow Taxi trip records with full data validation, enrichment, and partitioned storage
Language: Python - Size: 1.04 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0
snigdhasv/Music_Mood_Analyzer
This Real-Time Music Mood Analyzer system processes streaming music data to classify songs by mood using audio features. It combines Apache Kafka for real-time streaming, Spark Streaming for mood classification, and MySQL for storage. A dashboard visualizes results, while batch processing reanalyzes historical data for accuracy.
Language: Python - Size: 441 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0
asuiu/SparkORM
ORM for Apache Spark and DataFrames schema manager
Language: Python - Size: 482 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 15 - Forks: 3
apache/kyuubi-docker
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Language: Dockerfile - Size: 30.3 KB - Last synced at: 15 days ago - Pushed at: 4 months ago - Stars: 13 - Forks: 7
s-yazhini/Hexa-DE-Main-Project
Data engineering main project 1
Language: Jupyter Notebook - Size: 15.5 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0
zekeriyyaa/PySpark-Structured-Streaming-ROS-Kafka-ApacheSpark-Cassandra
A structured streaming was applied to the robot data from ROS-Gazebo simulation environment using Apache Spark. Data is collected in Kafka, analyzed by Apache Spark and stored in Cassandra.
Language: Python - Size: 652 KB - Last synced at: 4 months ago - Pushed at: almost 4 years ago - Stars: 19 - Forks: 6
rvilla87/ETL-PySpark
ETL (Extract, Transform and Load) with the Spark Python API (PySpark) and Hadoop Distributed File System (HDFS)
Language: Jupyter Notebook - Size: 6.84 KB - Last synced at: 3 months ago - Pushed at: almost 7 years ago - Stars: 17 - Forks: 9
syedhassaanahmed/databricks-notebooks
Collection of Databricks and Jupyter Notebooks
Language: Jupyter Notebook - Size: 742 KB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 22 - Forks: 15
xReniar/US-used-cars-analysis
US used cars analysis with Map-Reduce, Hive, Spark core and Spark SQL
Language: Python - Size: 2.25 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0
minio/spark-select
A library for Spark DataFrame using MinIO Select API
Language: Scala - Size: 65.4 KB - Last synced at: 5 months ago - Pushed at: about 6 years ago - Stars: 98 - Forks: 19
sjrusso8/spark-connect-rs
Apache Spark Connect Client for Rust
Language: Rust - Size: 3.88 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 109 - Forks: 18
letsiki/end-to-end-data-pipeline-acs
End-to-end data pipeline for the ACS dataset using Python, PySpark, PostgreSQL, and Kubernetes (Bronze / Silver / Gold architecture).
Language: Python - Size: 1.17 MB - Last synced at: 4 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0
lmouhib/auto-register-spark-ui-k8s
A lightweight operator to automatically expose Spark UI manage its ingress when running Spark on Kubernetes
Language: Go - Size: 3.67 MB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 4 - Forks: 0
Ren294/Log-Analysis-Project
This project builds a scalable log analytics pipeline use Lambda architecture for real-time and batch processing of NASA server logs.
Language: Python - Size: 2.88 MB - Last synced at: 4 months ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 1
Thanaraklee/Real-Time-PySpark
This project introduces PySpark, a powerful open-source framework for distributed data processing. We explore its architecture, components, and applications for real-time data analysis.
Language: Python - Size: 329 MB - Last synced at: 4 months ago - Pushed at: about 1 year ago - Stars: 33 - Forks: 17
leowheeler1/HomeSales
All work for the Module 22 Challenge, UofM Data Analytics Bootcamp
Language: Jupyter Notebook - Size: 1.33 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0
oeljeklaus-you/UserActionAnalyzePlatform
电商用户行为分析大数据平台
Language: Java - Size: 1.26 MB - Last synced at: 6 months ago - Pushed at: almost 3 years ago - Stars: 1,029 - Forks: 386
zsvoboda/ngods-stocks
New Generation Opensource Data Stack Demo
Language: Jupyter Notebook - Size: 22.1 MB - Last synced at: 6 months ago - Pushed at: almost 3 years ago - Stars: 432 - Forks: 101
qubole/sparklens
Qubole Sparklens tool for performance tuning Apache Spark
Language: Scala - Size: 175 KB - Last synced at: 6 months ago - Pushed at: over 1 year ago - Stars: 575 - Forks: 142
majchandra/covid19-data-analysis-spark
Projet Big Data & Machine Learning avec PySpark : analyse et clustering des cas COVID-19 dans le monde (2020–2023).
Language: Jupyter Notebook - Size: 5.35 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0
rizkipragustono/data_analysis_spark
Exploration: Data Analysis using Spark
Language: Jupyter Notebook - Size: 8.79 KB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0
ludreinsalvador/gadgets-product-6850-model
Created a machine learning model that predicts whether a customer will purchase Product 6850 in January 2019 with an accuracy above 85% and a recall of at least 70%.
Language: Jupyter Notebook - Size: 4.88 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0
indix/sparkplug
Spark package to "plug" holes in data using SQL based rules ⚡️ 🔌
Language: Scala - Size: 503 KB - Last synced at: 5 months ago - Pushed at: over 5 years ago - Stars: 29 - Forks: 2
cuebook/cuelake
Use SQL to build ELT pipelines on a data lakehouse.
Language: JavaScript - Size: 28 MB - Last synced at: 6 months ago - Pushed at: over 3 years ago - Stars: 288 - Forks: 28
Neelka96/Home_Sales
DataViz Module 22 Big Data Challenge - Spark
Language: Jupyter Notebook - Size: 28.3 KB - Last synced at: 4 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0
igopalakrishna/nyc-subway-foot-traffic-prediction-and-forecasting
Designed and implemented a scalable real-time analytics pipeline using Apache Kafka, Spark Structured Streaming, and MongoDB to simulate NYC MTA turnstile data and forecast real-time subway foot traffic using SparkML Random Forest models.
Language: Jupyter Notebook - Size: 1.27 MB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 1
mananghetia/Healthcare-RCM
Language: Jupyter Notebook - Size: 0 Bytes - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0