GitHub topics: pyspark
shreyamalogi/Retail-Pipeline
Retail insights at cloud scale — 5M+ records analyzed with PySpark on GCP
Language: Python - Size: 12.7 KB - Last synced at: about 2 hours ago - Pushed at: about 11 hours ago - Stars: 12 - Forks: 0

ethanwebber123/Airflow-ETL-ELT
Airflow-ETL-ELT is a robust data pipeline tool that enables efficient extraction, transformation, loading, and orchestration of data workflows. It offers a scalable and customizable solution for managing complex ETL and ELT processes with ease.
Size: 1000 Bytes - Last synced at: about 12 hours ago - Pushed at: about 14 hours ago - Stars: 0 - Forks: 0

zyra121/advertising-sales-prediction
This repository showcases a linear regression analysis using the Advertising dataset, demonstrating both simple and multiple regression techniques in Python. It also features a custom implementation of Gradient Descent for a deeper understanding of the concepts. 🐱💻📊
Language: Python - Size: 1.33 MB - Last synced at: about 18 hours ago - Pushed at: about 19 hours ago - Stars: 0 - Forks: 0

muhd-minhaz/AWS-Glue--Data-Copy
The function for copying data like CSV, Parquet, avro etc., from a source S3 bucket to a destination S3 bucket using AWS Glue. It includes the necessary setup for the Glue job, logging, reading data from the source bucket, and writing it to the destination bucket
Language: Python - Size: 4.88 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

dashmug/glue-utils
Python library designed to enhance the developer experience when working with AWS Glue ETL and Python Shell jobs by reducing boilerplate code, increasing type safety, and improving IDE auto-completion.
Language: Python - Size: 762 KB - Last synced at: about 24 hours ago - Pushed at: 1 day ago - Stars: 8 - Forks: 2

Will436851/Raspberry-pi-OS-Documentation
Raspberry pi OS installation and system call demonstration
Language: Shell - Size: 1.32 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

surajvm1/learningSWE
An end-to-end software engineering project built for learning purposes. Tech: Mobile dev, Web dev, Backend dev, Data Engineering, Docker. It's a skeleton project touching every domain to briefly understand the world of SWE. Will increase complexity as I find bandwidth.
Language: Python - Size: 28.3 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 1

apache/linkis
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Language: Java - Size: 87.2 MB - Last synced at: about 19 hours ago - Pushed at: 4 days ago - Stars: 3,371 - Forks: 1,164

Ahm-rgb/Alpha-SQL
Official repository for the paper "Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search" [ICML'25]
Language: Python - Size: 787 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 1

ibis-project/ibis
the portable Python dataframe library
Language: Python - Size: 175 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 5,798 - Forks: 642

alexott/databricks-playground
Code samples, etc. for Databricks
Language: Python - Size: 202 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 64 - Forks: 18

microsoft/SynapseML
Simple and Distributed Machine Learning
Language: Scala - Size: 156 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 5,134 - Forks: 846

ivinnyaraujo/dataengineer-datascience-python
Data Engineer | Data Science | Azure | Fabric | Notebooks | Python | PySpark | R | More
Language: Jupyter Notebook - Size: 7.67 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

VladimirZelenokor1/Big-Data-Project---Predicting-Trip-Fares-with-Spark-Hive
A CRISP-DM–based big data pipeline for predicting NYC ride-sharing trip fares: ingesting 2024 TLC data via Sqoop into HDFS/Hive, performing ETL and feature engineering with Spark & PySpark, training and tuning Linear Regression & Gradient Boosted Tree models, and outlining end-to-end deployment.
Language: Java - Size: 906 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

opendp/tumult-analytics
Tumult Analytics is a Python library for privately computing aggregate queries on tabular data. It is built atop the Tumult Core library.
Language: Python - Size: 4.86 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 3 - Forks: 0

opentargets/gentropy
Open Targets python framework for post-GWAS analysis
Language: Jupyter Notebook - Size: 31.6 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 39 - Forks: 10

ego-creator/hepmassClassification
Pipeline PySpark pour la classification de particules en physique des hautes énergies (dataset HEPMASS). Inclut le prétraitement distribué, l'entraînement de modèles (régression logistique, arbres de décision), l'évaluation et des visualisations clés. Optimisé pour Hadoop/Spark.
Size: 1.95 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

Zain970/ETL-Data-Pipline-Using-Apache-Airflow
Utilize Sqoop to import data from relational databases and ingest files from S3 buckets into HDFS.Apply complex transformations using Apache Spark to prepare data for analysis and reporting. Create and manage Hive tables for structured data storage and query optimization.Load processed data into HBase, making it accessible for various teams and app
Language: Python - Size: 3.91 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

lakehq/sail
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive AI workloads.
Language: Rust - Size: 4.01 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 765 - Forks: 34

EmmanuelleTOCS/steam-game-recommender
A powerful recommendation system for Steam games, combining Content-Based and Collaborative Filtering techniques. Built with Python, Scikit-learn, and Streamlit to deliver accurate, real-time game recommendations. Perfect for gamers and data scientists interested in building intelligent recommendation engines.
Language: Jupyter Notebook - Size: 3.55 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

mitchelllisle/sparkdantic
✨ A Pydantic to PySpark schema library
Language: Python - Size: 3.4 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 91 - Forks: 15

apache/incubator-graphar
An open source, standard data file format for graph data storage and retrieval.
Language: C++ - Size: 17.1 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 276 - Forks: 69

vitalibo/norma
Data validation via JSON schema
Language: Python - Size: 154 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

ndomah1/AWS-YouTube-Data-Analysis
Automated an AWS-native pipeline that ingests YouTube trending data and surfaces 30 B+ cross-regional views in a sub-second QuickSight dashboard for data-driven ad targeting.
Language: Python - Size: 420 KB - Last synced at: 3 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

AlexIoannides/pyspark-example-project
Implementing best practices for PySpark ETL jobs and applications.
Language: Python - Size: 769 KB - Last synced at: 3 days ago - Pushed at: over 2 years ago - Stars: 1,923 - Forks: 763

sparkdq-community/sparkdq
A declarative PySpark framework for row- and aggregate-level data quality validation.
Language: Python - Size: 7.43 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 46 - Forks: 5

Anirudh-rao/Machine-Learning-Pyspark
This repository covers all the basics of using spark for machine learning
Language: Jupyter Notebook - Size: 4.66 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

vgrcontreras/mobility-app-anomalies-detector
A data engineering project that simulates and processes ride-hailing app data to detect potential fraud and operational anomalies using Pydantic for validation, MongoDB for storage, and Apache Airflow for orchestration. Ride data is generated in JSONL format and uploaded to AWS S3 as part of the ingestion pipeline.
Language: Python - Size: 139 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

Ayan007JBond/Sensor-Data-Analytics
This notebook showcases a complete machine learning workflow—from data preprocessing to model evaluation—for a binary classification task. It includes key techniques like feature scaling, handling class imbalance, and threshold tuning to improve prediction accuracy.
Language: Jupyter Notebook - Size: 10.2 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

Kvj-Harsha/earthquakepipeline
Azure df + databricks + synapse + pipeline + powerbi
Language: Jupyter Notebook - Size: 98.6 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

jupyter-incubator/sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Language: Python - Size: 2.68 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1,355 - Forks: 452

jelmerk/hnswlib
Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs
Language: Java - Size: 1.23 MB - Last synced at: 4 days ago - Pushed at: 2 months ago - Stars: 274 - Forks: 56

GoogleCloudPlatform/dataproc-templates
Dataproc templates and pipelines for solving in-cloud data tasks
Language: Python - Size: 18.6 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 128 - Forks: 99

Subhajit-Chowdhury/RAW-Spotify-Data-into-Insights-with-AWS
Unlocking Spotify insights with an AWS data pipeline: S3 data lake -> Glue ETL-> Athena queries -> QuickSight Dashboard
Language: Python - Size: 253 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

HuuVuong0912/rag-llm-based-recommender
Explore a smarter way to shop online with this full-stack project built on the infrastructure of Google Cloud Platform (GCP) for RAG based e-commerce with LLM.
Language: TypeScript - Size: 4.2 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

JuanLara18/Distributed-Translation-System
High-performance distributed translation system for large multilingual datasets using PySpark and OpenAI. Supports caching, checkpointing, and metadata-preserving Stata translation.
Language: Python - Size: 276 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

mafew98/StreamAnalyzer
The Aim of this project is to query a source of streaming data and run analytics on it. To achieve this, kafka is used to handle the streaming data, pySpark is used to analyse the data and logstash/kibana/opensearch are used to visualize the data.
Language: Python - Size: 2.78 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

ev2900/Glue_Examples
PySpark code samples designed for AWS Glue
Language: Python - Size: 53.7 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

slevine/pyspark-pandas-vs-pandas
Dataframe Performance Comparison - Polars, Pandas on Spark, and Pandas
Language: Jupyter Notebook - Size: 351 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3 - Forks: 0

mmooney512/kvk
Loading files into delta table warehouse
Language: Python - Size: 1.24 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

astrolabsoftware/fink-filters
Define your filters to create your alert stream in Fink!
Language: Python - Size: 63.7 MB - Last synced at: 1 day ago - Pushed at: 6 days ago - Stars: 1 - Forks: 9

Flixteu356/BigData-Architecture
Big Data system predicts pandemic risk (COVID-19) via data analysis, ML modeling, and real-time dashboard.
Language: Python - Size: 35.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

jadianes/spark-py-notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Language: Jupyter Notebook - Size: 2.2 MB - Last synced at: 3 days ago - Pushed at: about 1 year ago - Stars: 1,652 - Forks: 917

PAFOULEH5/Loan-Default-Prediction
A machine learning project to predict loan default risk using financial and credit history data. Built as part of a team capstone project in master degree at Deakin University.
Language: HTML - Size: 1.6 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

Impesud/ai-mlops-project
Production-ready MLOps pipeline with PySpark, MLflow, and LLM-based data reporting.
Language: Python - Size: 65.5 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

sb-ai-lab/RePlay
A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models
Language: Python - Size: 36.1 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 337 - Forks: 34

databrickslabs/dbldatagen
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
Language: Python - Size: 11.1 MB - Last synced at: about 17 hours ago - Pushed at: 20 days ago - Stars: 406 - Forks: 73

mrpowers-io/quinn
pyspark methods to enhance developer productivity 📣 👯 🎉
Language: Python - Size: 1.98 MB - Last synced at: 4 days ago - Pushed at: 3 months ago - Stars: 672 - Forks: 98

MauricioFBL/mktg-capstones
This repository contains an end-to-end data analytics pipelines, leveraging AWS and Airflow for orchestration. The project enables data integration, storage, transformation, and visualization
Language: Jupyter Notebook - Size: 1.46 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 2 - Forks: 0

mahmoudparsian/pyspark-tutorial
PySpark-Tutorial provides basic algorithms using PySpark
Language: Jupyter Notebook - Size: 8.96 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1,223 - Forks: 478

anonymo2239/Big-Data-Churn-Analyzer
Scalable customer churn prediction using PySpark. Includes EDA, feature engineering, modeling, and real-time inference on new data.
Language: Jupyter Notebook - Size: 1.47 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

JohnSnowLabs/spark-nlp
State of the Art Natural Language Processing
Language: Scala - Size: 3.45 GB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3,982 - Forks: 726

logicalclocks/hopsworks
Hopsworks - Data-Intensive AI platform with a Feature Store
Language: Java - Size: 152 MB - Last synced at: 4 days ago - Pushed at: 4 months ago - Stars: 1,225 - Forks: 151

alexott/cyber-spark-data-connectors
Cybersecurity-related custom data connectors for Spark
Language: Python - Size: 589 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

narwhals-dev/narwhals
Lightweight and extensible compatibility layer between dataframe libraries!
Language: Python - Size: 9.71 MB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 1,025 - Forks: 143

arverma/TowardsDataEngineering
This repo contains commands that data engineers use in day to day work.
Language: Python - Size: 81.1 KB - Last synced at: 1 day ago - Pushed at: over 2 years ago - Stars: 61 - Forks: 43

ceyhun262613/thesis-sentiment-analysis
Greek Twitter Sentiment Analysis Using BERT and Machine Learning This project applies BERT (Greek uncased model) alongside traditional machine learning models (Logistic Regression, Naive Bayes) to perform sentiment classification on Greek Twitter data.
Language: Python - Size: 1.32 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

raphaelauv/docker-pyspark
slim pyspark docker image
Language: Dockerfile - Size: 12.7 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 2 - Forks: 0

jelmerk/hnswlib-spark
Spark library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs
Language: Scala - Size: 1.34 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 12 - Forks: 2

BassBoulder/PythonPySpark
Central location for nifty Python & PySpark variant scripts
Language: Jupyter Notebook - Size: 1.15 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

guidok91/spark-structured-streaming-kafka
Spark Structured Streaming data pipeline that processes movie ratings data in real-time.
Language: Python - Size: 192 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 13 - Forks: 4

Hazim-HF/Data-Management
This repository covers data management and big data technologies, including databases, querying, and big data processing. Topics include Hadoop (MapReduce, HDFS), Apache Spark, data security, and optimization techniques. Students will learn Spark’s architecture, data distribution, parallel computing, and memory caching to enhance big data solutions
Language: Jupyter Notebook - Size: 42.8 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

commoncrawl/cc-pyspark
Process Common Crawl data with Python and Spark
Language: Python - Size: 157 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 430 - Forks: 89

guidok91/spark-movies-etl
Spark data pipeline that processes movie ratings data.
Language: Python - Size: 3.88 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 28 - Forks: 12

woonstadrotterdam/pyspark-testframework
Quickly create dataquality tests for PySpark DataFrames.
Language: Python - Size: 149 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 11 - Forks: 3

mihirchhiber/Network-Intrusion-Detector
Network Intrusion Detector is a distributed intrusion detection system built with PySpark. It preprocesses, encodes, and models network traffic data to detect anomalies using a Random Forest classifier, achieving high accuracy and efficiency through feature selection and scalable data processing. The system is suitable for large-scale environments
Language: Jupyter Notebook - Size: 860 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

AdityaSreevatsaK/PySpark-Pipeline
A collection of PySpark projects showcasing scalable data processing, transformation pipelines, and big data analytics using Apache Spark.
Language: Jupyter Notebook - Size: 1.66 MB - Last synced at: 7 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

ptyadana/SQL-Data-Analysis-and-Visualization-Projects
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Language: Jupyter Notebook - Size: 35.1 MB - Last synced at: 10 days ago - Pushed at: almost 3 years ago - Stars: 1,436 - Forks: 553

canimus/cuallee
Possibly the fastest DataFrame-agnostic quality check library in town.
Language: Python - Size: 2.29 MB - Last synced at: 8 days ago - Pushed at: 13 days ago - Stars: 190 - Forks: 21

isabekov/pyspark-cookbook
A collection of useful copy-pasteable PySpark code snippets with output
Language: CSS - Size: 538 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

iobruno/data-engineering-examples
Data Engineering examples for Airflow, Prefect; dbt for BigQuery, Redshift, ClickHouse, Postgres, DuckDB; PySpark for Batch processing; Kafka for Stream processing
Language: Python - Size: 5.02 MB - Last synced at: 9 days ago - Pushed at: 3 months ago - Stars: 65 - Forks: 2

Wb-az/MLib-PySpark-SoundLevel-Prediction
Creates a ML Pipeline leveraging PySpark SQL and PySpark MLib to predict sound level
Language: Jupyter Notebook - Size: 972 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

majchandra/covid19-data-analysis-spark
Projet Big Data & Machine Learning avec PySpark : analyse et clustering des cas COVID-19 dans le monde (2020–2023).
Language: Jupyter Notebook - Size: 5.35 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

G-Research/spark-extension
A library that provides useful extensions to Apache Spark and PySpark.
Language: Scala - Size: 1.1 MB - Last synced at: 9 days ago - Pushed at: 2 months ago - Stars: 224 - Forks: 28

maltzsama/sumeh
Sumeh — Unified Data Quality Framework Sumeh is a unified data quality validation framework supporting multiple backends (PySpark, Dask, Polars, DuckDB, Pandas) with centralized rule configuration.
Language: Python - Size: 1.61 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 2 - Forks: 0

josephmachado/efficient_data_processing_spark
Code for "Efficient Data Processing in Spark" Course
Language: Python - Size: 23.9 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 304 - Forks: 63

shauryashaurya/learn-data-munging
Notes on Data Engineering with Pandas, PySpark, Dask, Ray, Arrow DataFusion, Polars etc.
Language: Jupyter Notebook - Size: 627 MB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 47 - Forks: 21

astrolabsoftware/fink-science
Define your science modules to add values to Fink alerts!
Language: Jupyter Notebook - Size: 747 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 12 - Forks: 17

rizkipragustono/data_analysis_spark
Exploration: Data Analysis using Spark
Language: Jupyter Notebook - Size: 8.79 KB - Last synced at: 2 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

lykmapipo/Python-Spark-Log-Analysis
Python scripts to process, and analyze log files using PySpark.
Language: Python - Size: 131 KB - Last synced at: about 6 hours ago - Pushed at: 11 months ago - Stars: 6 - Forks: 0

ayshanbagir/pyspark_Movies
PySpark project using DataFrames
Language: Jupyter Notebook - Size: 1.11 MB - Last synced at: 11 days ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 2

capitalone/datacompy
Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!
Language: Python - Size: 11.5 MB - Last synced at: 12 days ago - Pushed at: 16 days ago - Stars: 571 - Forks: 141

SathyaV99/hadoop-spark-traffic-predictor-toronto
🚦 Toronto Traffic Prediction with Apache Spark, Hadoop and SparkML. Used Random Forest as the model for prediction
Language: Jupyter Notebook - Size: 31.9 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

davidzajac1/zillacode
Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake
Language: Python - Size: 2.68 MB - Last synced at: 8 days ago - Pushed at: 4 months ago - Stars: 177 - Forks: 38

datamole-ai/pysparkdt
An open-source Python library for simplifying local testing of Databricks workflows that use PySpark and Delta tables.
Language: Python - Size: 57.6 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 28 - Forks: 0

Nike-Inc/koheesio
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
Language: Python - Size: 7.96 MB - Last synced at: 12 days ago - Pushed at: 16 days ago - Stars: 638 - Forks: 36

fabricks-framework/fabricks
Language: Python - Size: 1.25 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 5 - Forks: 0

quintoandar/butterfree
A tool for building feature stores.
Language: Python - Size: 1.43 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 302 - Forks: 37

lyhue1991/eat_pyspark_in_10_days
pyspark🍒🥭 is delicious,just eat it!😋😋
Language: Python - Size: 4.66 MB - Last synced at: 9 days ago - Pushed at: over 2 years ago - Stars: 802 - Forks: 217

MrPowers/mack
Delta Lake helper methods in PySpark
Language: Python - Size: 2.81 MB - Last synced at: 10 days ago - Pushed at: 9 months ago - Stars: 325 - Forks: 42

MrPowers/chispa
PySpark test helper methods with beautiful error messages
Language: Python - Size: 3.6 MB - Last synced at: 12 days ago - Pushed at: about 2 months ago - Stars: 692 - Forks: 72

microsoft/Azure-Databricks-NYC-Taxi-Workshop
An Azure Databricks workshop leveraging the New York Taxi and Limousine Commission Trip Records dataset
Language: Scala - Size: 42.3 MB - Last synced at: 4 days ago - Pushed at: about 2 years ago - Stars: 109 - Forks: 108

ev2900/Glue_Aggregate_Small_Files
PySpark script to aggregate small parquet files in a prefix into larger files. Designed to be run on AWS Glue
Language: Python - Size: 133 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 1 - Forks: 0

leoaugustosv/brl-prod-prices
Um ETL de sites de e-commerce brasileiros.
Language: Python - Size: 65.4 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 1 - Forks: 0

tirthajyoti/Spark-with-Python
Fundamentals of Spark with Python (using PySpark), code examples
Language: Jupyter Notebook - Size: 8.97 MB - Last synced at: 8 days ago - Pushed at: over 2 years ago - Stars: 347 - Forks: 271

Kaushik-Puttaswamy/Real-Time_Data_Streaming_Project
The project aims to create a real-time data application using Apache Kafka and Spark for stream processing, with data sourced from Kaggle and stored in Cassandra, prioritizing reliability, scalability, and security.
Language: Python - Size: 227 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 1 - Forks: 0

nessi-dev/nessi
A Python-based data processing and analysis tool built with PySpark and Delta Lake
Language: Go - Size: 12.7 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Language: Python - Size: 110 MB - Last synced at: 5 days ago - Pushed at: 6 months ago - Stars: 1,512 - Forks: 233

Yi-Pin-123/Pyspark-Notes Fork of LinkedInLearning/high-performance-pyspark-advanced-strategies-for-optimal-data-processing-3919191
This repository contains some Pyspark code.
Language: Python - Size: 421 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

sbl-sdsc/mmtf-pyspark
Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.
Language: Python - Size: 524 MB - Last synced at: 15 days ago - Pushed at: about 2 years ago - Stars: 67 - Forks: 27
