An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: pyspark

dmatrix/spark-declarative-pipelines

A collection of modern Spark Declarative Pipeline examples and implementations demonstrating different data processing paradigms and frameworks.

Language: Python - Size: 4.4 MB - Last synced at: about 9 hours ago - Pushed at: about 10 hours ago - Stars: 0 - Forks: 0

tatianaespinola/conjunto-de-dados-de-transacoes-financeiras-limpeza-e-breve-analise

Usei um dataset do Kaggle para fazer meu primeiro projeto de engenharia de dados, visando entender mais sobre o universo financeiro, mas principalmente aplicar a linguagem PySpark e boas práticas de código.

Language: Jupyter Notebook - Size: 420 KB - Last synced at: about 9 hours ago - Pushed at: about 11 hours ago - Stars: 0 - Forks: 0

ev2900/Glue_Aggregate_Small_Files

PySpark script to aggregate small parquet files in a prefix into larger files. Designed to be run on AWS Glue

Language: Python - Size: 134 KB - Last synced at: about 13 hours ago - Pushed at: about 15 hours ago - Stars: 1 - Forks: 0

ev2900/Glue_Examples

PySpark code samples designed for AWS Glue

Language: Python - Size: 67.4 KB - Last synced at: about 13 hours ago - Pushed at: about 15 hours ago - Stars: 1 - Forks: 0

matteogrechi/pyspark-template-devcontainer

A minimal, opinionated template for building and running Apache Spark Python (PySpark) applications in a development container. The devcontainer includes a 3‑node Spark standalone cluster (1 master + 2 workers) for local development and testing.

Language: Jupyter Notebook - Size: 10.7 KB - Last synced at: about 14 hours ago - Pushed at: about 16 hours ago - Stars: 0 - Forks: 0

dan1elt0m/sqlfluff-pyspark

Lints / fixes pyspark sql strings using sqlfluff. Comes with precommit hooks

Language: Python - Size: 68.4 KB - Last synced at: about 18 hours ago - Pushed at: about 19 hours ago - Stars: 0 - Forks: 0

kanitvural/aws-data-science-data-engineering-mlops-infra

Ultimate AWS Data & AI Platform: Real-time flight delay predictions with complete DE, DS, MLOps, Web App & Multi-Agent LLM - All deployed via CDK self-mutating pipelines

Language: Jupyter Notebook - Size: 352 MB - Last synced at: about 23 hours ago - Pushed at: 1 day ago - Stars: 3 - Forks: 0

Bellamy0719/aws-pyspark-data-lakehouse-pipeline

Portfolio project demonstrating a cloud data lakehouse architecture on AWS with Databricks and PySpark. Includes raw, processed, and curated layers for scalable analytics.

Language: Jupyter Notebook - Size: 2.78 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

lacharydev/databricks-ai-segmentation

AI segmentation POC showing how Databricks PySpark pipelines can enrich customer data for activation in Salesforce Data Cloud and Agentforce. Demonstrates data wrangling with DataFrames and CRM Analytics visualization for AI-driven engagements.

Language: Jupyter Notebook - Size: 2.56 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

Ivanissimo/databricks-projects

A portfolio of data engineering projects using Databricks, PySpark, and Delta Lake

Language: Jupyter Notebook - Size: 497 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

kasrakhaksar/data-analysis

Projects To Analyze Various Data 📊

Language: Python - Size: 21.5 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

spindle-health/spindle-token

PySpark implementation of the Open Privacy Preserving Record Linkage (OPPRL) specification.

Language: Python - Size: 1.85 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 20 - Forks: 1

denisecase/pro-analytics-apache-starter

Try Apache Spark with Python

Language: Python - Size: 93.8 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 6

microsoft/SynapseML

Simple and Distributed Machine Learning

Language: Scala - Size: 166 MB - Last synced at: about 19 hours ago - Pushed at: 3 days ago - Stars: 5,174 - Forks: 852

Psingh12354/Pyspark

Language: Jupyter Notebook - Size: 1.74 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 4 - Forks: 2

ciawes/spark-analytics-starter

📊 Streamline your data workflow with this PySpark starter for Telco Customer Churn, from data cleaning to feature engineering in one command.

Language: Java - Size: 1.51 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

opendp/tumult-analytics

Tumult Analytics is a Python library for privately computing aggregate queries on tabular data. It is built atop the Tumult Core library.

Language: Python - Size: 5.54 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 12 - Forks: 3

SalimM21/Football-Data-Engineering-avec-PySpark-Databricks-et-Power-BI

Construction d’un pipeline PySpark pour l’analyse des performances footballistiques saison par saison, avec calcul de KPI, ranking des équipes et stockage optimisé en Parquet partitionné.

Language: Python - Size: 10.2 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

shauryashaurya/The-Silmaril

Practice #ontology engineering with Python (and other languages).

Language: Jupyter Notebook - Size: 268 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 18 - Forks: 6

AsgharAZ/Data-Science

This repository includes projects and resources from my Data Science course in Python. It covers data analysis techniques like cleaning, EDA, statistical modeling, machine learning, and visualization. I use libraries like Pandas, NumPy, Matplotlib, and MongoDB to analyze datasets and build predictive models.

Language: Jupyter Notebook - Size: 396 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive AI workloads.

Language: Rust - Size: 6.47 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1,021 - Forks: 60

ethanwebber123/Airflow-ETL-ELT

Airflow-ETL-ELT is a robust data pipeline tool that enables efficient extraction, transformation, loading, and orchestration of data workflows. It offers a scalable and customizable solution for managing complex ETL and ELT processes with ease.

Size: 1000 Bytes - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

astrolabsoftware/fink-science

Define your science modules to add values to Fink alerts!

Language: Jupyter Notebook - Size: 863 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 12 - Forks: 17

Will436851/Raspberry-pi-OS-Documentation

Raspberry pi OS installation and system call demonstration

Language: Shell - Size: 1.32 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

commoncrawl/cc-pyspark

Process Common Crawl data with Python and Spark

Language: Python - Size: 255 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 441 - Forks: 90

jgarciadiaz10/Vaga_Sr_Auditoria_Continua_Metodologia_e_IA

📊 Explore simulated financial transactions and AI logs for the Sr. Auditor Analytics challenge, enhancing continuous auditing through data analysis and risk indicators.

Language: C - Size: 1.48 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

imran5570/pyNotes

📝 Simplify your note-taking with pyNotes, a user-friendly desktop app for adding, editing, and organizing notes using Python and PyQt6.

Language: Python - Size: 1.33 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

PAFOULEH5/Loan-Default-Prediction

A machine learning project to predict loan default risk using financial and credit history data. Built as part of a team capstone project in master degree at Deakin University.

Language: HTML - Size: 1.6 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 1

opentargets/gentropy

Open Targets python framework for post-GWAS analysis

Language: Jupyter Notebook - Size: 36.6 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 46 - Forks: 15

Flixteu356/BigData-Architecture

Big Data system predicts pandemic risk (COVID-19) via data analysis, ML modeling, and real-time dashboard.

Language: Python - Size: 35.2 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

Bilal-iqbal01/Voltmart-Orders-Data-Cleaning

🧹 Clean and preprocess Voltmart's orders dataset for accurate demand forecasting using PySpark, ensuring high data quality and consistency.

Language: Python - Size: 9.67 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

jy1212686/eta-etl-spark

🚖 Ingest and analyze NYC yellow taxi data with a streamlined ETL pipeline, featuring data cleaning, analytics, and business-ready outputs.

Language: Jupyter Notebook - Size: 1.34 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

all-in-one67890/credit-recommendation-engine

💳 Generate personalized credit product recommendations using machine learning to match users with optimal financial options based on their unique profiles.

Size: 1.3 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

danuah/prices-predictor-system

📈 Predict future prices using this efficient system, enabling informed decisions in finance and investment sectors.

Size: 1.29 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

Language: Python - Size: 13.5 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1,350 - Forks: 170

hq969/AWSFaultTracker

AWSFaultTracker is an end-to-end AWS Data Engineering and Monitoring System built to track and analyze service disruptions across AWS in real time. The project automates the process of fetching AWS service health updates, transforming them into structured datasets, and visualizing trends to identify reliability patterns.

Language: Python - Size: 12.7 KB - Last synced at: 2 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

akrem-sl/bank-marketing-sql-analysis

🏦 Analyze bank marketing campaigns to enhance customer engagement and ROI using SQL, Power BI, and Excel for data-driven insights and strategies.

Size: 306 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

Ahm-rgb/Alpha-SQL

Official repository for the paper "Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search" [ICML'25]

Language: Python - Size: 787 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 4 - Forks: 1

ego-creator/hepmassClassification

Pipeline PySpark pour la classification de particules en physique des hautes énergies (dataset HEPMASS). Inclut le prétraitement distribué, l'entraînement de modèles (régression logistique, arbres de décision), l'évaluation et des visualisations clés. Optimisé pour Hadoop/Spark.

Size: 1.95 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

jackstanmma/Customer-Sentiment-Prediction-PySpark

🔍 Predict customer sentiment using PySpark and Databricks, analyzing 49,000+ e-commerce reviews to drive insights for improved business strategies.

Language: Python - Size: 2.45 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

EmmanuelleTOCS/steam-game-recommender

A powerful recommendation system for Steam games, combining Content-Based and Collaborative Filtering techniques. Built with Python, Scikit-learn, and Streamlit to deliver accurate, real-time game recommendations. Perfect for gamers and data scientists interested in building intelligent recommendation engines.

Language: Jupyter Notebook - Size: 3.55 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

muhd-minhaz/AWS-Glue--Data-Copy

The function for copying data like CSV, Parquet, avro etc., from a source S3 bucket to a destination S3 bucket using AWS Glue. It includes the necessary setup for the Glue job, logging, reading data from the source bucket, and writing it to the destination bucket

Language: Python - Size: 4.88 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

Ayan007JBond/Sensor-Data-Analytics

This notebook showcases a complete machine learning workflow—from data preprocessing to model evaluation—for a binary classification task. It includes key techniques like feature scaling, handling class imbalance, and threshold tuning to improve prediction accuracy.

Language: Jupyter Notebook - Size: 10.2 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

mahmoudparsian/big-data-mapreduce-course

Big Data Modeling, MapReduce, Spark, PySpark @ Santa Clara University

Language: HTML - Size: 614 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 161 - Forks: 143

BadBoy0170/Log-Processing-Automation

Automated log analysis pipeline using PySpark to simulate Azure Databricks workflows. This project parses raw Apache logs, identifies operational patterns and anomalies using SQL queries, and exports results for visualization in Tableau — reducing manual review time in log analysis.

Language: Python - Size: 231 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

Yussufkadir/Currency-Forecast-Webpage

A full-stack financial application featuring real-time forex rates and machine learning-powered price predictions across multiple currency pairs. Built with automated data pipelines and Prophet time series forecasting.

Language: Jupyter Notebook - Size: 124 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 2 - Forks: 0

aymane-maghouti/Big-Data-Project

This project aims to predict smartphone prices using a combination of batch and stream processing techniques in a Big Data environment. The architecture follows the Lambda Architecture pattern, providing both real-time and batch processing capabilities to users.

Language: Python - Size: 960 KB - Last synced at: 5 days ago - Pushed at: over 1 year ago - Stars: 18 - Forks: 2

mlverse/pysparklyr

Extension to {sparklyr} that allows you to interact with Spark & Databricks Connect

Language: R - Size: 865 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 17 - Forks: 4

apache/incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

Language: C++ - Size: 17.5 MB - Last synced at: 5 days ago - Pushed at: 17 days ago - Stars: 309 - Forks: 77

apache/linkis

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

Language: Java - Size: 87.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 3,392 - Forks: 1,170

opendp/tumult-core

Tumult Core is a collection of composable components for implementing algorithms to perform differentially private computations.

Language: Python - Size: 12.5 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 3 - Forks: 1

alisonpezzott/useful-snippets

A collection of useful code snippets for common development tasks of an analytics data engineer Microsoft based.

Language: Jupyter Notebook - Size: 74.2 KB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 3 - Forks: 3

HuuVuong0912/rag-llm-based-recommender

Explore a smarter way to shop online with this full-stack project built on the infrastructure of Google Cloud Platform (GCP) for RAG based e-commerce with LLM.

Language: TypeScript - Size: 4.2 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 3 - Forks: 1

TsungTseTu122/CloudComputing--MovieLens-Big-Data-Analytics-on-the-Cloud

This project analyzes the MovieLens dataset using PySpark, Hadoop HDFS, and Docker to perform clustering, classification, and association rule mining on user-movie interactions. The system runs in a containerized cloud environment with Spark clusters, enabling scalable big data processing.

Language: Jupyter Notebook - Size: 1.54 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

graphframes/graphframes

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

Language: Scala - Size: 5.4 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1,086 - Forks: 254

ibis-project/ibis

the portable Python dataframe library

Language: Python - Size: 180 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 6,178 - Forks: 680

srimantapal205/Subject-Wise-Question---Answer

This branch focuses on building Data Engineering Interview Question and Answer

Language: Python - Size: 829 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Language: Scala - Size: 39.5 MB - Last synced at: 2 days ago - Pushed at: over 1 year ago - Stars: 147 - Forks: 34

hendhamdi/Sentiment-Analysis-Spark-NLP

This project uses a Spark pipeline (PySpark) to analyze the sentiment of user reviews.

Language: HTML - Size: 1.35 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

alexott/cyber-spark-data-connectors

Cybersecurity-related custom data connectors for Spark

Language: Python - Size: 443 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 1

Patricia7sp/vaga_linkedin

Projeto de coleta e processamento de vagas do LinkedIn usando agentes IA

Language: Python - Size: 646 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

typedef-ai/fenic

Build reliable AI and agentic applications with DataFrames

Language: Python - Size: 21.5 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 373 - Forks: 23

puran2003gupta/Real-Time-Job-Market-Data-Pipeline

Built a cloud-native ETL pipeline using Python, AWS, and Snowflake to analyze public job listings. Automated data ingestion with AWS Lambda, transformation with AWS Glue (PySpark)Stored cleaned data in Amazon S3 and loaded it into Snowflake for advanced analytics. Visualized job trends, skill demand, and salaries using interactive Tableau dashboard

Language: Python - Size: 281 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

uber/petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Language: Python - Size: 2.56 MB - Last synced at: 10 days ago - Pushed at: about 2 months ago - Stars: 1,865 - Forks: 285

dashmug/glue-utils

Python library designed to enhance the developer experience when working with AWS Glue ETL and Python Shell jobs by reducing boilerplate code, increasing type safety, and improving IDE auto-completion.

Language: Python - Size: 895 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 8 - Forks: 1

sb-ai-lab/RePlay

A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models

Language: Python - Size: 41.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 368 - Forks: 35

GoogleCloudPlatform/dataproc-templates

Dataproc templates and pipelines for solving in-cloud data tasks

Language: Python - Size: 18.8 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 135 - Forks: 106

AMRprojects/RecommendationSys

Practice building a recommendation system using pyspark in databricks

Language: Jupyter Notebook - Size: 863 KB - Last synced at: 10 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

h2oai/sparkling-water

Sparkling Water provides H2O functionality inside Spark cluster

Language: Scala - Size: 46.8 MB - Last synced at: 5 days ago - Pushed at: 2 months ago - Stars: 977 - Forks: 360

databrickslabs/dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

Language: Python - Size: 11 MB - Last synced at: 4 days ago - Pushed at: 13 days ago - Stars: 430 - Forks: 83

abshek7/Big-data

A repository for documenting the learning related to theory and practical notes of big data computing.

Language: Python - Size: 1.03 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

Kesara03/PySpark-DBT_Data_Engineering_Project

A Data Engineering pipeline using PySpark and DBT (data build tool) following a Medallion Architecture (Bronze, Silver, Gold layers) on the Databricks platform. Uses PySpark Structured Streaming for incremental loading and real-time data processing from a Data Lake source (simulated Uber dataset in CSV files).

Language: Jupyter Notebook - Size: 291 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 0

microsoft/unified-data-foundation-with-fabric-solution-accelerator

Unified Data Foundation with Microsoft Fabric with Options to Integrate with Azure Databricks and Microsoft Purview

Language: Jupyter Notebook - Size: 29.4 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3 - Forks: 6

lougail/pyspark-data-processing-tutorial

Language: Jupyter Notebook - Size: 280 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

canimus/cuallee

Possibly the fastest DataFrame-agnostic quality check library in town.

Language: Python - Size: 2.38 MB - Last synced at: 9 days ago - Pushed at: 11 days ago - Stars: 223 - Forks: 21

fuadonates/sql-migration-modernization

SQL Server to Databricks migration expertise - T-SQL translation, stored procedure modernization, Delta Lake patterns for financial services

Language: Python - Size: 12.7 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

fuadonates/data-engineering

A collection of data engineering projects, proofs-of-concept (POCs), and proofs-of-knowledge (POKs) using technologies like Python, Spark, SQL, and cloud platforms.

Size: 5.86 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

GregoryKogan/crypto-trading-data-pipeline

Real-time crypto trading data pipeline using Apache Spark, Kafka, and Airflow. Containerized microservices architecture for streaming analytics.

Language: Python - Size: 21.5 KB - Last synced at: 6 days ago - Pushed at: 13 days ago - Stars: 2 - Forks: 0

kevinsames/spark-fuse

spark-fuse is an open-source toolkit for PySpark — providing utilities, connectors, and tools to fuse your data workflows together.

Language: Python - Size: 139 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

srwi/PyCharm-SparkDataFrameViewer

View Spark DataFrames from the PyCharm debugger with just two clicks

Language: Kotlin - Size: 171 KB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

Kesara03/Taxi-Data-Engineering-Azure

Industry-level DE project automating NYC Taxi data ingestion (API to ADLS Gen2) using parameterized ADF and structured ETL with PySpark, Delta Lake, and Medallion Architecture.

Language: Jupyter Notebook - Size: 26.5 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

jupyter-incubator/sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

Language: Python - Size: 2.36 MB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 1,363 - Forks: 455

Morphl-AI/MorphL-Community-Edition

MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization

Language: Python - Size: 143 KB - Last synced at: 13 days ago - Pushed at: about 6 years ago - Stars: 260 - Forks: 29

opentargets/timeseries

Assess the evolution over time of evidence supporting disease-target associations in the Open Targets Platform.

Language: Python - Size: 1.84 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 2 - Forks: 1

mojkej/football_es

Projet d’ingénierie des données sur la Liga espagnole (2012–2021). Pipeline ETL automatisé et analyse des performances via PySpark. Extraction, transformation et visualisation des données de football.

Language: Jupyter Notebook - Size: 252 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

Nike-Inc/koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

Language: Python - Size: 7.97 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 644 - Forks: 37

KevinShindel/MachineLearning

Pandas, Sci-kit, SparkML

Language: Jupyter Notebook - Size: 153 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

Daniel-jcVv/insurance-analytics-medallion

End-to-end insurance analytics platform using Medallion Architecture (Bronze/Silver/Gold) with Microsoft Fabric, PySpark, and Power BI.

Language: Jupyter Notebook - Size: 1.97 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

joaovnovais/pipeline_AWSGlue_PySpark

This project showcases a complete data engineering pipeline on AWS, following best practices in data ingestion, transformation, and analytics — ready for real-world production use or integration with BI tools such as QuickSight or Power BI.

Size: 3.73 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

Ling-Yun-Huang/spark-gcp-data-optimization

PySpark/GCP Dataproc pipeline achieving a 40x speedup in ML data preprocessing via TFRecord optimisation.

Language: Jupyter Notebook - Size: 10.4 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 1 - Forks: 0

Navyam-04/Python-ETL-Pipeline-on-Databricks

An end-to-end ETL pipeline built on Databricks to ingest, clean, and enrich retail sales data with weather information using Python and Pandas.

Language: Jupyter Notebook - Size: 1.06 MB - Last synced at: 15 days ago - Pushed at: 16 days ago - Stars: 2 - Forks: 0

tech-aakash/Spark-Based-Alzheimer-s-Diagnosis-from-Clinical-Cognitive-Tabular-Data

This project implements a Spark-based pipeline for early-stage diagnosis of Alzheimer’s disease (AD) using clinical and cognitive tabular data. The pipeline demonstrates distributed data preprocessing, statistical analysis, and model training across single-node and multi-node Spark clusters.

Language: Python - Size: 2.52 MB - Last synced at: 15 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

jzmnd/pants-pyspark-mve

Minimum viable example repo for running Pants with PySpark libraries

Language: Python - Size: 58.6 KB - Last synced at: 15 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

hi-primus/optimus

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

Language: Python - Size: 110 MB - Last synced at: 14 days ago - Pushed at: 11 months ago - Stars: 1,522 - Forks: 234

astrolabsoftware/fink-filters

Define your filters to create your alert stream in Fink!

Language: Python - Size: 67.7 MB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 11

NHSDigital/data-validation-engine

Data Validation Engine source code

Language: Python - Size: 403 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 5 - Forks: 0

jafeerr/spark-data-test

Spark Data Test - A PySpark-based automation testing utility to compare Spark DataFrames

Language: Python - Size: 34.2 KB - Last synced at: 16 days ago - Pushed at: 17 days ago - Stars: 5 - Forks: 0

jrlasak/databricks_apparel_streaming

Databricks DLT Apparel Pipeline Project: Learn medallion architecture, streaming, and data engineering with Delta Live Tables. Includes synthetic data, step-by-step guide, and certification prep.

Language: Python - Size: 1.18 MB - Last synced at: 16 days ago - Pushed at: 17 days ago - Stars: 22 - Forks: 14

alishazaveri-dev/Amazon-Best-Selling-Books-Analysis

This Python project analyzes real-time data from Amazon's bestseller lists using Databricks and PySpark.

Language: Jupyter Notebook - Size: 164 KB - Last synced at: 16 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

OmarAlhaz/E-Commerce-Sales-Forecasting-with-PySpark

Forecasting e-commerce product demand using PySpark MLlib. Includes data preprocessing, feature engineering, Random Forest modeling, and evaluation via Mean Absolute Error.

Language: Python - Size: 6.17 MB - Last synced at: 17 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0