An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: pyspark

shreyamalogi/Retail-Pipeline

Retail insights at cloud scale — 5M+ records analyzed with PySpark on GCP

Language: Python - Size: 12.7 KB - Last synced at: about 2 hours ago - Pushed at: about 11 hours ago - Stars: 12 - Forks: 0

ethanwebber123/Airflow-ETL-ELT

Airflow-ETL-ELT is a robust data pipeline tool that enables efficient extraction, transformation, loading, and orchestration of data workflows. It offers a scalable and customizable solution for managing complex ETL and ELT processes with ease.

Size: 1000 Bytes - Last synced at: about 12 hours ago - Pushed at: about 14 hours ago - Stars: 0 - Forks: 0

zyra121/advertising-sales-prediction

This repository showcases a linear regression analysis using the Advertising dataset, demonstrating both simple and multiple regression techniques in Python. It also features a custom implementation of Gradient Descent for a deeper understanding of the concepts. 🐱💻📊

Language: Python - Size: 1.33 MB - Last synced at: about 18 hours ago - Pushed at: about 19 hours ago - Stars: 0 - Forks: 0

muhd-minhaz/AWS-Glue--Data-Copy

The function for copying data like CSV, Parquet, avro etc., from a source S3 bucket to a destination S3 bucket using AWS Glue. It includes the necessary setup for the Glue job, logging, reading data from the source bucket, and writing it to the destination bucket

Language: Python - Size: 4.88 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

dashmug/glue-utils

Python library designed to enhance the developer experience when working with AWS Glue ETL and Python Shell jobs by reducing boilerplate code, increasing type safety, and improving IDE auto-completion.

Language: Python - Size: 762 KB - Last synced at: about 24 hours ago - Pushed at: 1 day ago - Stars: 8 - Forks: 2

Will436851/Raspberry-pi-OS-Documentation

Raspberry pi OS installation and system call demonstration

Language: Shell - Size: 1.32 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

surajvm1/learningSWE

An end-to-end software engineering project built for learning purposes. Tech: Mobile dev, Web dev, Backend dev, Data Engineering, Docker. It's a skeleton project touching every domain to briefly understand the world of SWE. Will increase complexity as I find bandwidth.

Language: Python - Size: 28.3 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 1

apache/linkis

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

Language: Java - Size: 87.2 MB - Last synced at: about 19 hours ago - Pushed at: 4 days ago - Stars: 3,371 - Forks: 1,164

Ahm-rgb/Alpha-SQL

Official repository for the paper "Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search" [ICML'25]

Language: Python - Size: 787 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 1

ibis-project/ibis

the portable Python dataframe library

Language: Python - Size: 175 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 5,798 - Forks: 642

alexott/databricks-playground

Code samples, etc. for Databricks

Language: Python - Size: 202 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 64 - Forks: 18

microsoft/SynapseML

Simple and Distributed Machine Learning

Language: Scala - Size: 156 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 5,134 - Forks: 846

ivinnyaraujo/dataengineer-datascience-python

Data Engineer | Data Science | Azure | Fabric | Notebooks | Python | PySpark | R | More

Language: Jupyter Notebook - Size: 7.67 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

VladimirZelenokor1/Big-Data-Project---Predicting-Trip-Fares-with-Spark-Hive

A CRISP-DM–based big data pipeline for predicting NYC ride-sharing trip fares: ingesting 2024 TLC data via Sqoop into HDFS/Hive, performing ETL and feature engineering with Spark & PySpark, training and tuning Linear Regression & Gradient Boosted Tree models, and outlining end-to-end deployment.

Language: Java - Size: 906 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

opendp/tumult-analytics

Tumult Analytics is a Python library for privately computing aggregate queries on tabular data. It is built atop the Tumult Core library.

Language: Python - Size: 4.86 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 3 - Forks: 0

opentargets/gentropy

Open Targets python framework for post-GWAS analysis

Language: Jupyter Notebook - Size: 31.6 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 39 - Forks: 10

ego-creator/hepmassClassification

Pipeline PySpark pour la classification de particules en physique des hautes énergies (dataset HEPMASS). Inclut le prétraitement distribué, l'entraînement de modèles (régression logistique, arbres de décision), l'évaluation et des visualisations clés. Optimisé pour Hadoop/Spark.

Size: 1.95 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

Zain970/ETL-Data-Pipline-Using-Apache-Airflow

Utilize Sqoop to import data from relational databases and ingest files from S3 buckets into HDFS.Apply complex transformations using Apache Spark to prepare data for analysis and reporting. Create and manage Hive tables for structured data storage and query optimization.Load processed data into HBase, making it accessible for various teams and app

Language: Python - Size: 3.91 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive AI workloads.

Language: Rust - Size: 4.01 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 765 - Forks: 34

EmmanuelleTOCS/steam-game-recommender

A powerful recommendation system for Steam games, combining Content-Based and Collaborative Filtering techniques. Built with Python, Scikit-learn, and Streamlit to deliver accurate, real-time game recommendations. Perfect for gamers and data scientists interested in building intelligent recommendation engines.

Language: Jupyter Notebook - Size: 3.55 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

mitchelllisle/sparkdantic

✨ A Pydantic to PySpark schema library

Language: Python - Size: 3.4 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 91 - Forks: 15

apache/incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

Language: C++ - Size: 17.1 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 276 - Forks: 69

vitalibo/norma

Data validation via JSON schema

Language: Python - Size: 154 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

ndomah1/AWS-YouTube-Data-Analysis

Automated an AWS-native pipeline that ingests YouTube trending data and surfaces 30 B+ cross-regional views in a sub-second QuickSight dashboard for data-driven ad targeting.

Language: Python - Size: 420 KB - Last synced at: 3 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

AlexIoannides/pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

Language: Python - Size: 769 KB - Last synced at: 3 days ago - Pushed at: over 2 years ago - Stars: 1,923 - Forks: 763

sparkdq-community/sparkdq

A declarative PySpark framework for row- and aggregate-level data quality validation.

Language: Python - Size: 7.43 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 46 - Forks: 5

Anirudh-rao/Machine-Learning-Pyspark

This repository covers all the basics of using spark for machine learning

Language: Jupyter Notebook - Size: 4.66 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

vgrcontreras/mobility-app-anomalies-detector

A data engineering project that simulates and processes ride-hailing app data to detect potential fraud and operational anomalies using Pydantic for validation, MongoDB for storage, and Apache Airflow for orchestration. Ride data is generated in JSONL format and uploaded to AWS S3 as part of the ingestion pipeline.

Language: Python - Size: 139 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

Ayan007JBond/Sensor-Data-Analytics

This notebook showcases a complete machine learning workflow—from data preprocessing to model evaluation—for a binary classification task. It includes key techniques like feature scaling, handling class imbalance, and threshold tuning to improve prediction accuracy.

Language: Jupyter Notebook - Size: 10.2 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

Kvj-Harsha/earthquakepipeline

Azure df + databricks + synapse + pipeline + powerbi

Language: Jupyter Notebook - Size: 98.6 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

jupyter-incubator/sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

Language: Python - Size: 2.68 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1,355 - Forks: 452

jelmerk/hnswlib

Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs

Language: Java - Size: 1.23 MB - Last synced at: 4 days ago - Pushed at: 2 months ago - Stars: 274 - Forks: 56

GoogleCloudPlatform/dataproc-templates

Dataproc templates and pipelines for solving in-cloud data tasks

Language: Python - Size: 18.6 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 128 - Forks: 99

Subhajit-Chowdhury/RAW-Spotify-Data-into-Insights-with-AWS

Unlocking Spotify insights with an AWS data pipeline: S3 data lake -> Glue ETL-> Athena queries -> QuickSight Dashboard

Language: Python - Size: 253 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

HuuVuong0912/rag-llm-based-recommender

Explore a smarter way to shop online with this full-stack project built on the infrastructure of Google Cloud Platform (GCP) for RAG based e-commerce with LLM.

Language: TypeScript - Size: 4.2 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

JuanLara18/Distributed-Translation-System

High-performance distributed translation system for large multilingual datasets using PySpark and OpenAI. Supports caching, checkpointing, and metadata-preserving Stata translation.

Language: Python - Size: 276 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

mafew98/StreamAnalyzer

The Aim of this project is to query a source of streaming data and run analytics on it. To achieve this, kafka is used to handle the streaming data, pySpark is used to analyse the data and logstash/kibana/opensearch are used to visualize the data.

Language: Python - Size: 2.78 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

ev2900/Glue_Examples

PySpark code samples designed for AWS Glue

Language: Python - Size: 53.7 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

slevine/pyspark-pandas-vs-pandas

Dataframe Performance Comparison - Polars, Pandas on Spark, and Pandas

Language: Jupyter Notebook - Size: 351 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3 - Forks: 0

mmooney512/kvk

Loading files into delta table warehouse

Language: Python - Size: 1.24 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

astrolabsoftware/fink-filters

Define your filters to create your alert stream in Fink!

Language: Python - Size: 63.7 MB - Last synced at: 1 day ago - Pushed at: 6 days ago - Stars: 1 - Forks: 9

Flixteu356/BigData-Architecture

Big Data system predicts pandemic risk (COVID-19) via data analysis, ML modeling, and real-time dashboard.

Language: Python - Size: 35.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

jadianes/spark-py-notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Language: Jupyter Notebook - Size: 2.2 MB - Last synced at: 3 days ago - Pushed at: about 1 year ago - Stars: 1,652 - Forks: 917

PAFOULEH5/Loan-Default-Prediction

A machine learning project to predict loan default risk using financial and credit history data. Built as part of a team capstone project in master degree at Deakin University.

Language: HTML - Size: 1.6 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

Impesud/ai-mlops-project

Production-ready MLOps pipeline with PySpark, MLflow, and LLM-based data reporting.

Language: Python - Size: 65.5 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

sb-ai-lab/RePlay

A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models

Language: Python - Size: 36.1 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 337 - Forks: 34

databrickslabs/dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

Language: Python - Size: 11.1 MB - Last synced at: about 17 hours ago - Pushed at: 20 days ago - Stars: 406 - Forks: 73

mrpowers-io/quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

Language: Python - Size: 1.98 MB - Last synced at: 4 days ago - Pushed at: 3 months ago - Stars: 672 - Forks: 98

MauricioFBL/mktg-capstones

This repository contains an end-to-end data analytics pipelines, leveraging AWS and Airflow for orchestration. The project enables data integration, storage, transformation, and visualization

Language: Jupyter Notebook - Size: 1.46 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 2 - Forks: 0

mahmoudparsian/pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark

Language: Jupyter Notebook - Size: 8.96 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1,223 - Forks: 478

anonymo2239/Big-Data-Churn-Analyzer

Scalable customer churn prediction using PySpark. Includes EDA, feature engineering, modeling, and real-time inference on new data.

Language: Jupyter Notebook - Size: 1.47 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

JohnSnowLabs/spark-nlp

State of the Art Natural Language Processing

Language: Scala - Size: 3.45 GB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3,982 - Forks: 726

logicalclocks/hopsworks

Hopsworks - Data-Intensive AI platform with a Feature Store

Language: Java - Size: 152 MB - Last synced at: 4 days ago - Pushed at: 4 months ago - Stars: 1,225 - Forks: 151

alexott/cyber-spark-data-connectors

Cybersecurity-related custom data connectors for Spark

Language: Python - Size: 589 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

Language: Python - Size: 9.71 MB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 1,025 - Forks: 143

arverma/TowardsDataEngineering

This repo contains commands that data engineers use in day to day work.

Language: Python - Size: 81.1 KB - Last synced at: 1 day ago - Pushed at: over 2 years ago - Stars: 61 - Forks: 43

ceyhun262613/thesis-sentiment-analysis

Greek Twitter Sentiment Analysis Using BERT and Machine Learning This project applies BERT (Greek uncased model) alongside traditional machine learning models (Logistic Regression, Naive Bayes) to perform sentiment classification on Greek Twitter data.

Language: Python - Size: 1.32 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

raphaelauv/docker-pyspark

slim pyspark docker image

Language: Dockerfile - Size: 12.7 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 2 - Forks: 0

jelmerk/hnswlib-spark

Spark library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs

Language: Scala - Size: 1.34 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 12 - Forks: 2

BassBoulder/PythonPySpark

Central location for nifty Python & PySpark variant scripts

Language: Jupyter Notebook - Size: 1.15 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

guidok91/spark-structured-streaming-kafka

Spark Structured Streaming data pipeline that processes movie ratings data in real-time.

Language: Python - Size: 192 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 13 - Forks: 4

Hazim-HF/Data-Management

This repository covers data management and big data technologies, including databases, querying, and big data processing. Topics include Hadoop (MapReduce, HDFS), Apache Spark, data security, and optimization techniques. Students will learn Spark’s architecture, data distribution, parallel computing, and memory caching to enhance big data solutions

Language: Jupyter Notebook - Size: 42.8 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

commoncrawl/cc-pyspark

Process Common Crawl data with Python and Spark

Language: Python - Size: 157 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 430 - Forks: 89

guidok91/spark-movies-etl

Spark data pipeline that processes movie ratings data.

Language: Python - Size: 3.88 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 28 - Forks: 12

woonstadrotterdam/pyspark-testframework

Quickly create dataquality tests for PySpark DataFrames.

Language: Python - Size: 149 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 11 - Forks: 3

mihirchhiber/Network-Intrusion-Detector

Network Intrusion Detector is a distributed intrusion detection system built with PySpark. It preprocesses, encodes, and models network traffic data to detect anomalies using a Random Forest classifier, achieving high accuracy and efficiency through feature selection and scalable data processing. The system is suitable for large-scale environments

Language: Jupyter Notebook - Size: 860 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

AdityaSreevatsaK/PySpark-Pipeline

A collection of PySpark projects showcasing scalable data processing, transformation pipelines, and big data analytics using Apache Spark.

Language: Jupyter Notebook - Size: 1.66 MB - Last synced at: 7 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.

Language: Jupyter Notebook - Size: 35.1 MB - Last synced at: 10 days ago - Pushed at: almost 3 years ago - Stars: 1,436 - Forks: 553

canimus/cuallee

Possibly the fastest DataFrame-agnostic quality check library in town.

Language: Python - Size: 2.29 MB - Last synced at: 8 days ago - Pushed at: 13 days ago - Stars: 190 - Forks: 21

isabekov/pyspark-cookbook

A collection of useful copy-pasteable PySpark code snippets with output

Language: CSS - Size: 538 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

iobruno/data-engineering-examples

Data Engineering examples for Airflow, Prefect; dbt for BigQuery, Redshift, ClickHouse, Postgres, DuckDB; PySpark for Batch processing; Kafka for Stream processing

Language: Python - Size: 5.02 MB - Last synced at: 9 days ago - Pushed at: 3 months ago - Stars: 65 - Forks: 2

Wb-az/MLib-PySpark-SoundLevel-Prediction

Creates a ML Pipeline leveraging PySpark SQL and PySpark MLib to predict sound level

Language: Jupyter Notebook - Size: 972 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

majchandra/covid19-data-analysis-spark

Projet Big Data & Machine Learning avec PySpark : analyse et clustering des cas COVID-19 dans le monde (2020–2023).

Language: Jupyter Notebook - Size: 5.35 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

G-Research/spark-extension

A library that provides useful extensions to Apache Spark and PySpark.

Language: Scala - Size: 1.1 MB - Last synced at: 9 days ago - Pushed at: 2 months ago - Stars: 224 - Forks: 28

maltzsama/sumeh

Sumeh — Unified Data Quality Framework Sumeh is a unified data quality validation framework supporting multiple backends (PySpark, Dask, Polars, DuckDB, Pandas) with centralized rule configuration.

Language: Python - Size: 1.61 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 2 - Forks: 0

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

Language: Python - Size: 23.9 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 304 - Forks: 63

shauryashaurya/learn-data-munging

Notes on Data Engineering with Pandas, PySpark, Dask, Ray, Arrow DataFusion, Polars etc.

Language: Jupyter Notebook - Size: 627 MB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 47 - Forks: 21

astrolabsoftware/fink-science

Define your science modules to add values to Fink alerts!

Language: Jupyter Notebook - Size: 747 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 12 - Forks: 17

rizkipragustono/data_analysis_spark

Exploration: Data Analysis using Spark

Language: Jupyter Notebook - Size: 8.79 KB - Last synced at: 2 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

lykmapipo/Python-Spark-Log-Analysis

Python scripts to process, and analyze log files using PySpark.

Language: Python - Size: 131 KB - Last synced at: about 6 hours ago - Pushed at: 11 months ago - Stars: 6 - Forks: 0

ayshanbagir/pyspark_Movies

PySpark project using DataFrames

Language: Jupyter Notebook - Size: 1.11 MB - Last synced at: 11 days ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 2

capitalone/datacompy

Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

Language: Python - Size: 11.5 MB - Last synced at: 12 days ago - Pushed at: 16 days ago - Stars: 571 - Forks: 141

SathyaV99/hadoop-spark-traffic-predictor-toronto

🚦 Toronto Traffic Prediction with Apache Spark, Hadoop and SparkML. Used Random Forest as the model for prediction

Language: Jupyter Notebook - Size: 31.9 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

davidzajac1/zillacode

Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake

Language: Python - Size: 2.68 MB - Last synced at: 8 days ago - Pushed at: 4 months ago - Stars: 177 - Forks: 38

datamole-ai/pysparkdt

An open-source Python library for simplifying local testing of Databricks workflows that use PySpark and Delta tables.

Language: Python - Size: 57.6 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 28 - Forks: 0

Nike-Inc/koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

Language: Python - Size: 7.96 MB - Last synced at: 12 days ago - Pushed at: 16 days ago - Stars: 638 - Forks: 36

fabricks-framework/fabricks

Language: Python - Size: 1.25 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 5 - Forks: 0

quintoandar/butterfree

A tool for building feature stores.

Language: Python - Size: 1.43 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 302 - Forks: 37

lyhue1991/eat_pyspark_in_10_days

pyspark🍒🥭 is delicious,just eat it!😋😋

Language: Python - Size: 4.66 MB - Last synced at: 9 days ago - Pushed at: over 2 years ago - Stars: 802 - Forks: 217

MrPowers/mack

Delta Lake helper methods in PySpark

Language: Python - Size: 2.81 MB - Last synced at: 10 days ago - Pushed at: 9 months ago - Stars: 325 - Forks: 42

MrPowers/chispa

PySpark test helper methods with beautiful error messages

Language: Python - Size: 3.6 MB - Last synced at: 12 days ago - Pushed at: about 2 months ago - Stars: 692 - Forks: 72

microsoft/Azure-Databricks-NYC-Taxi-Workshop

An Azure Databricks workshop leveraging the New York Taxi and Limousine Commission Trip Records dataset

Language: Scala - Size: 42.3 MB - Last synced at: 4 days ago - Pushed at: about 2 years ago - Stars: 109 - Forks: 108

ev2900/Glue_Aggregate_Small_Files

PySpark script to aggregate small parquet files in a prefix into larger files. Designed to be run on AWS Glue

Language: Python - Size: 133 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 1 - Forks: 0

leoaugustosv/brl-prod-prices

Um ETL de sites de e-commerce brasileiros.

Language: Python - Size: 65.4 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 1 - Forks: 0

tirthajyoti/Spark-with-Python

Fundamentals of Spark with Python (using PySpark), code examples

Language: Jupyter Notebook - Size: 8.97 MB - Last synced at: 8 days ago - Pushed at: over 2 years ago - Stars: 347 - Forks: 271

Kaushik-Puttaswamy/Real-Time_Data_Streaming_Project

The project aims to create a real-time data application using Apache Kafka and Spark for stream processing, with data sourced from Kaggle and stored in Cassandra, prioritizing reliability, scalability, and security.

Language: Python - Size: 227 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 1 - Forks: 0

nessi-dev/nessi

A Python-based data processing and analysis tool built with PySpark and Delta Lake

Language: Go - Size: 12.7 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

hi-primus/optimus

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

Language: Python - Size: 110 MB - Last synced at: 5 days ago - Pushed at: 6 months ago - Stars: 1,512 - Forks: 233

Yi-Pin-123/Pyspark-Notes Fork of LinkedInLearning/high-performance-pyspark-advanced-strategies-for-optimal-data-processing-3919191

This repository contains some Pyspark code.

Language: Python - Size: 421 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

sbl-sdsc/mmtf-pyspark

Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.

Language: Python - Size: 524 MB - Last synced at: 15 days ago - Pushed at: about 2 years ago - Stars: 67 - Forks: 27