GitHub topics: pyspark
dmatrix/spark-declarative-pipelines
A collection of modern Spark Declarative Pipeline examples and implementations demonstrating different data processing paradigms and frameworks.
Language: Python - Size: 4.4 MB - Last synced at: about 9 hours ago - Pushed at: about 10 hours ago - Stars: 0 - Forks: 0
tatianaespinola/conjunto-de-dados-de-transacoes-financeiras-limpeza-e-breve-analise
Usei um dataset do Kaggle para fazer meu primeiro projeto de engenharia de dados, visando entender mais sobre o universo financeiro, mas principalmente aplicar a linguagem PySpark e boas práticas de código.
Language: Jupyter Notebook - Size: 420 KB - Last synced at: about 9 hours ago - Pushed at: about 11 hours ago - Stars: 0 - Forks: 0
ev2900/Glue_Aggregate_Small_Files
PySpark script to aggregate small parquet files in a prefix into larger files. Designed to be run on AWS Glue
Language: Python - Size: 134 KB - Last synced at: about 13 hours ago - Pushed at: about 15 hours ago - Stars: 1 - Forks: 0
ev2900/Glue_Examples
PySpark code samples designed for AWS Glue
Language: Python - Size: 67.4 KB - Last synced at: about 13 hours ago - Pushed at: about 15 hours ago - Stars: 1 - Forks: 0
matteogrechi/pyspark-template-devcontainer
A minimal, opinionated template for building and running Apache Spark Python (PySpark) applications in a development container. The devcontainer includes a 3‑node Spark standalone cluster (1 master + 2 workers) for local development and testing.
Language: Jupyter Notebook - Size: 10.7 KB - Last synced at: about 14 hours ago - Pushed at: about 16 hours ago - Stars: 0 - Forks: 0
dan1elt0m/sqlfluff-pyspark
Lints / fixes pyspark sql strings using sqlfluff. Comes with precommit hooks
Language: Python - Size: 68.4 KB - Last synced at: about 18 hours ago - Pushed at: about 19 hours ago - Stars: 0 - Forks: 0
kanitvural/aws-data-science-data-engineering-mlops-infra
Ultimate AWS Data & AI Platform: Real-time flight delay predictions with complete DE, DS, MLOps, Web App & Multi-Agent LLM - All deployed via CDK self-mutating pipelines
Language: Jupyter Notebook - Size: 352 MB - Last synced at: about 23 hours ago - Pushed at: 1 day ago - Stars: 3 - Forks: 0
Bellamy0719/aws-pyspark-data-lakehouse-pipeline
Portfolio project demonstrating a cloud data lakehouse architecture on AWS with Databricks and PySpark. Includes raw, processed, and curated layers for scalable analytics.
Language: Jupyter Notebook - Size: 2.78 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0
lacharydev/databricks-ai-segmentation
AI segmentation POC showing how Databricks PySpark pipelines can enrich customer data for activation in Salesforce Data Cloud and Agentforce. Demonstrates data wrangling with DataFrames and CRM Analytics visualization for AI-driven engagements.
Language: Jupyter Notebook - Size: 2.56 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0
Ivanissimo/databricks-projects
A portfolio of data engineering projects using Databricks, PySpark, and Delta Lake
Language: Jupyter Notebook - Size: 497 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0
kasrakhaksar/data-analysis
Projects To Analyze Various Data 📊
Language: Python - Size: 21.5 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0
spindle-health/spindle-token
PySpark implementation of the Open Privacy Preserving Record Linkage (OPPRL) specification.
Language: Python - Size: 1.85 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 20 - Forks: 1
denisecase/pro-analytics-apache-starter
Try Apache Spark with Python
Language: Python - Size: 93.8 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 6
microsoft/SynapseML
Simple and Distributed Machine Learning
Language: Scala - Size: 166 MB - Last synced at: about 19 hours ago - Pushed at: 3 days ago - Stars: 5,174 - Forks: 852
Psingh12354/Pyspark
Language: Jupyter Notebook - Size: 1.74 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 4 - Forks: 2
ciawes/spark-analytics-starter
📊 Streamline your data workflow with this PySpark starter for Telco Customer Churn, from data cleaning to feature engineering in one command.
Language: Java - Size: 1.51 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0
opendp/tumult-analytics
Tumult Analytics is a Python library for privately computing aggregate queries on tabular data. It is built atop the Tumult Core library.
Language: Python - Size: 5.54 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 12 - Forks: 3
SalimM21/Football-Data-Engineering-avec-PySpark-Databricks-et-Power-BI
Construction d’un pipeline PySpark pour l’analyse des performances footballistiques saison par saison, avec calcul de KPI, ranking des équipes et stockage optimisé en Parquet partitionné.
Language: Python - Size: 10.2 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0
shauryashaurya/The-Silmaril
Practice #ontology engineering with Python (and other languages).
Language: Jupyter Notebook - Size: 268 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 18 - Forks: 6
AsgharAZ/Data-Science
This repository includes projects and resources from my Data Science course in Python. It covers data analysis techniques like cleaning, EDA, statistical modeling, machine learning, and visualization. I use libraries like Pandas, NumPy, Matplotlib, and MongoDB to analyze datasets and build predictive models.
Language: Jupyter Notebook - Size: 396 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0
lakehq/sail
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive AI workloads.
Language: Rust - Size: 6.47 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1,021 - Forks: 60
ethanwebber123/Airflow-ETL-ELT
Airflow-ETL-ELT is a robust data pipeline tool that enables efficient extraction, transformation, loading, and orchestration of data workflows. It offers a scalable and customizable solution for managing complex ETL and ELT processes with ease.
Size: 1000 Bytes - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0
astrolabsoftware/fink-science
Define your science modules to add values to Fink alerts!
Language: Jupyter Notebook - Size: 863 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 12 - Forks: 17
Will436851/Raspberry-pi-OS-Documentation
Raspberry pi OS installation and system call demonstration
Language: Shell - Size: 1.32 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0
commoncrawl/cc-pyspark
Process Common Crawl data with Python and Spark
Language: Python - Size: 255 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 441 - Forks: 90
jgarciadiaz10/Vaga_Sr_Auditoria_Continua_Metodologia_e_IA
📊 Explore simulated financial transactions and AI logs for the Sr. Auditor Analytics challenge, enhancing continuous auditing through data analysis and risk indicators.
Language: C - Size: 1.48 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0
imran5570/pyNotes
📝 Simplify your note-taking with pyNotes, a user-friendly desktop app for adding, editing, and organizing notes using Python and PyQt6.
Language: Python - Size: 1.33 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0
PAFOULEH5/Loan-Default-Prediction
A machine learning project to predict loan default risk using financial and credit history data. Built as part of a team capstone project in master degree at Deakin University.
Language: HTML - Size: 1.6 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 1
opentargets/gentropy
Open Targets python framework for post-GWAS analysis
Language: Jupyter Notebook - Size: 36.6 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 46 - Forks: 15
Flixteu356/BigData-Architecture
Big Data system predicts pandemic risk (COVID-19) via data analysis, ML modeling, and real-time dashboard.
Language: Python - Size: 35.2 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0
Bilal-iqbal01/Voltmart-Orders-Data-Cleaning
🧹 Clean and preprocess Voltmart's orders dataset for accurate demand forecasting using PySpark, ensuring high data quality and consistency.
Language: Python - Size: 9.67 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0
jy1212686/eta-etl-spark
🚖 Ingest and analyze NYC yellow taxi data with a streamlined ETL pipeline, featuring data cleaning, analytics, and business-ready outputs.
Language: Jupyter Notebook - Size: 1.34 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0
all-in-one67890/credit-recommendation-engine
💳 Generate personalized credit product recommendations using machine learning to match users with optimal financial options based on their unique profiles.
Size: 1.3 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0
danuah/prices-predictor-system
📈 Predict future prices using this efficient system, enabling informed decisions in finance and investment sectors.
Size: 1.29 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0
narwhals-dev/narwhals
Lightweight and extensible compatibility layer between dataframe libraries!
Language: Python - Size: 13.5 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1,350 - Forks: 170
hq969/AWSFaultTracker
AWSFaultTracker is an end-to-end AWS Data Engineering and Monitoring System built to track and analyze service disruptions across AWS in real time. The project automates the process of fetching AWS service health updates, transforming them into structured datasets, and visualizing trends to identify reliability patterns.
Language: Python - Size: 12.7 KB - Last synced at: 2 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0
akrem-sl/bank-marketing-sql-analysis
🏦 Analyze bank marketing campaigns to enhance customer engagement and ROI using SQL, Power BI, and Excel for data-driven insights and strategies.
Size: 306 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0
Ahm-rgb/Alpha-SQL
Official repository for the paper "Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search" [ICML'25]
Language: Python - Size: 787 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 4 - Forks: 1
ego-creator/hepmassClassification
Pipeline PySpark pour la classification de particules en physique des hautes énergies (dataset HEPMASS). Inclut le prétraitement distribué, l'entraînement de modèles (régression logistique, arbres de décision), l'évaluation et des visualisations clés. Optimisé pour Hadoop/Spark.
Size: 1.95 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0
jackstanmma/Customer-Sentiment-Prediction-PySpark
🔍 Predict customer sentiment using PySpark and Databricks, analyzing 49,000+ e-commerce reviews to drive insights for improved business strategies.
Language: Python - Size: 2.45 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0
EmmanuelleTOCS/steam-game-recommender
A powerful recommendation system for Steam games, combining Content-Based and Collaborative Filtering techniques. Built with Python, Scikit-learn, and Streamlit to deliver accurate, real-time game recommendations. Perfect for gamers and data scientists interested in building intelligent recommendation engines.
Language: Jupyter Notebook - Size: 3.55 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0
muhd-minhaz/AWS-Glue--Data-Copy
The function for copying data like CSV, Parquet, avro etc., from a source S3 bucket to a destination S3 bucket using AWS Glue. It includes the necessary setup for the Glue job, logging, reading data from the source bucket, and writing it to the destination bucket
Language: Python - Size: 4.88 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0
Ayan007JBond/Sensor-Data-Analytics
This notebook showcases a complete machine learning workflow—from data preprocessing to model evaluation—for a binary classification task. It includes key techniques like feature scaling, handling class imbalance, and threshold tuning to improve prediction accuracy.
Language: Jupyter Notebook - Size: 10.2 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0
mahmoudparsian/big-data-mapreduce-course
Big Data Modeling, MapReduce, Spark, PySpark @ Santa Clara University
Language: HTML - Size: 614 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 161 - Forks: 143
BadBoy0170/Log-Processing-Automation
Automated log analysis pipeline using PySpark to simulate Azure Databricks workflows. This project parses raw Apache logs, identifies operational patterns and anomalies using SQL queries, and exports results for visualization in Tableau — reducing manual review time in log analysis.
Language: Python - Size: 231 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0
Yussufkadir/Currency-Forecast-Webpage
A full-stack financial application featuring real-time forex rates and machine learning-powered price predictions across multiple currency pairs. Built with automated data pipelines and Prophet time series forecasting.
Language: Jupyter Notebook - Size: 124 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 2 - Forks: 0
aymane-maghouti/Big-Data-Project
This project aims to predict smartphone prices using a combination of batch and stream processing techniques in a Big Data environment. The architecture follows the Lambda Architecture pattern, providing both real-time and batch processing capabilities to users.
Language: Python - Size: 960 KB - Last synced at: 5 days ago - Pushed at: over 1 year ago - Stars: 18 - Forks: 2
mlverse/pysparklyr
Extension to {sparklyr} that allows you to interact with Spark & Databricks Connect
Language: R - Size: 865 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 17 - Forks: 4
apache/incubator-graphar
An open source, standard data file format for graph data storage and retrieval.
Language: C++ - Size: 17.5 MB - Last synced at: 5 days ago - Pushed at: 17 days ago - Stars: 309 - Forks: 77
apache/linkis
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Language: Java - Size: 87.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 3,392 - Forks: 1,170
opendp/tumult-core
Tumult Core is a collection of composable components for implementing algorithms to perform differentially private computations.
Language: Python - Size: 12.5 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 3 - Forks: 1
alisonpezzott/useful-snippets
A collection of useful code snippets for common development tasks of an analytics data engineer Microsoft based.
Language: Jupyter Notebook - Size: 74.2 KB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 3 - Forks: 3
HuuVuong0912/rag-llm-based-recommender
Explore a smarter way to shop online with this full-stack project built on the infrastructure of Google Cloud Platform (GCP) for RAG based e-commerce with LLM.
Language: TypeScript - Size: 4.2 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 3 - Forks: 1
TsungTseTu122/CloudComputing--MovieLens-Big-Data-Analytics-on-the-Cloud
This project analyzes the MovieLens dataset using PySpark, Hadoop HDFS, and Docker to perform clustering, classification, and association rule mining on user-movie interactions. The system runs in a containerized cloud environment with Spark clusters, enabling scalable big data processing.
Language: Jupyter Notebook - Size: 1.54 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0
graphframes/graphframes
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs
Language: Scala - Size: 5.4 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1,086 - Forks: 254
ibis-project/ibis
the portable Python dataframe library
Language: Python - Size: 180 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 6,178 - Forks: 680
srimantapal205/Subject-Wise-Question---Answer
This branch focuses on building Data Engineering Interview Question and Answer
Language: Python - Size: 829 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0
archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Language: Scala - Size: 39.5 MB - Last synced at: 2 days ago - Pushed at: over 1 year ago - Stars: 147 - Forks: 34
hendhamdi/Sentiment-Analysis-Spark-NLP
This project uses a Spark pipeline (PySpark) to analyze the sentiment of user reviews.
Language: HTML - Size: 1.35 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0
alexott/cyber-spark-data-connectors
Cybersecurity-related custom data connectors for Spark
Language: Python - Size: 443 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 1
Patricia7sp/vaga_linkedin
Projeto de coleta e processamento de vagas do LinkedIn usando agentes IA
Language: Python - Size: 646 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0
typedef-ai/fenic
Build reliable AI and agentic applications with DataFrames
Language: Python - Size: 21.5 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 373 - Forks: 23
puran2003gupta/Real-Time-Job-Market-Data-Pipeline
Built a cloud-native ETL pipeline using Python, AWS, and Snowflake to analyze public job listings. Automated data ingestion with AWS Lambda, transformation with AWS Glue (PySpark)Stored cleaned data in Amazon S3 and loaded it into Snowflake for advanced analytics. Visualized job trends, skill demand, and salaries using interactive Tableau dashboard
Language: Python - Size: 281 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0
uber/petastorm
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Language: Python - Size: 2.56 MB - Last synced at: 10 days ago - Pushed at: about 2 months ago - Stars: 1,865 - Forks: 285
dashmug/glue-utils
Python library designed to enhance the developer experience when working with AWS Glue ETL and Python Shell jobs by reducing boilerplate code, increasing type safety, and improving IDE auto-completion.
Language: Python - Size: 895 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 8 - Forks: 1
sb-ai-lab/RePlay
A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models
Language: Python - Size: 41.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 368 - Forks: 35
GoogleCloudPlatform/dataproc-templates
Dataproc templates and pipelines for solving in-cloud data tasks
Language: Python - Size: 18.8 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 135 - Forks: 106
AMRprojects/RecommendationSys
Practice building a recommendation system using pyspark in databricks
Language: Jupyter Notebook - Size: 863 KB - Last synced at: 10 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0
h2oai/sparkling-water
Sparkling Water provides H2O functionality inside Spark cluster
Language: Scala - Size: 46.8 MB - Last synced at: 5 days ago - Pushed at: 2 months ago - Stars: 977 - Forks: 360
databrickslabs/dbldatagen
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
Language: Python - Size: 11 MB - Last synced at: 4 days ago - Pushed at: 13 days ago - Stars: 430 - Forks: 83
abshek7/Big-data
A repository for documenting the learning related to theory and practical notes of big data computing.
Language: Python - Size: 1.03 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0
Kesara03/PySpark-DBT_Data_Engineering_Project
A Data Engineering pipeline using PySpark and DBT (data build tool) following a Medallion Architecture (Bronze, Silver, Gold layers) on the Databricks platform. Uses PySpark Structured Streaming for incremental loading and real-time data processing from a Data Lake source (simulated Uber dataset in CSV files).
Language: Jupyter Notebook - Size: 291 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 0
microsoft/unified-data-foundation-with-fabric-solution-accelerator
Unified Data Foundation with Microsoft Fabric with Options to Integrate with Azure Databricks and Microsoft Purview
Language: Jupyter Notebook - Size: 29.4 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3 - Forks: 6
lougail/pyspark-data-processing-tutorial
Language: Jupyter Notebook - Size: 280 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0
canimus/cuallee
Possibly the fastest DataFrame-agnostic quality check library in town.
Language: Python - Size: 2.38 MB - Last synced at: 9 days ago - Pushed at: 11 days ago - Stars: 223 - Forks: 21
fuadonates/sql-migration-modernization
SQL Server to Databricks migration expertise - T-SQL translation, stored procedure modernization, Delta Lake patterns for financial services
Language: Python - Size: 12.7 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0
fuadonates/data-engineering
A collection of data engineering projects, proofs-of-concept (POCs), and proofs-of-knowledge (POKs) using technologies like Python, Spark, SQL, and cloud platforms.
Size: 5.86 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0
GregoryKogan/crypto-trading-data-pipeline
Real-time crypto trading data pipeline using Apache Spark, Kafka, and Airflow. Containerized microservices architecture for streaming analytics.
Language: Python - Size: 21.5 KB - Last synced at: 6 days ago - Pushed at: 13 days ago - Stars: 2 - Forks: 0
kevinsames/spark-fuse
spark-fuse is an open-source toolkit for PySpark — providing utilities, connectors, and tools to fuse your data workflows together.
Language: Python - Size: 139 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0
srwi/PyCharm-SparkDataFrameViewer
View Spark DataFrames from the PyCharm debugger with just two clicks
Language: Kotlin - Size: 171 KB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0
Kesara03/Taxi-Data-Engineering-Azure
Industry-level DE project automating NYC Taxi data ingestion (API to ADLS Gen2) using parameterized ADF and structured ETL with PySpark, Delta Lake, and Medallion Architecture.
Language: Jupyter Notebook - Size: 26.5 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0
jupyter-incubator/sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Language: Python - Size: 2.36 MB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 1,363 - Forks: 455
Morphl-AI/MorphL-Community-Edition
MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization
Language: Python - Size: 143 KB - Last synced at: 13 days ago - Pushed at: about 6 years ago - Stars: 260 - Forks: 29
opentargets/timeseries
Assess the evolution over time of evidence supporting disease-target associations in the Open Targets Platform.
Language: Python - Size: 1.84 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 2 - Forks: 1
mojkej/football_es
Projet d’ingénierie des données sur la Liga espagnole (2012–2021). Pipeline ETL automatisé et analyse des performances via PySpark. Extraction, transformation et visualisation des données de football.
Language: Jupyter Notebook - Size: 252 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0
Nike-Inc/koheesio
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
Language: Python - Size: 7.97 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 644 - Forks: 37
KevinShindel/MachineLearning
Pandas, Sci-kit, SparkML
Language: Jupyter Notebook - Size: 153 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0
Daniel-jcVv/insurance-analytics-medallion
End-to-end insurance analytics platform using Medallion Architecture (Bronze/Silver/Gold) with Microsoft Fabric, PySpark, and Power BI.
Language: Jupyter Notebook - Size: 1.97 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0
joaovnovais/pipeline_AWSGlue_PySpark
This project showcases a complete data engineering pipeline on AWS, following best practices in data ingestion, transformation, and analytics — ready for real-world production use or integration with BI tools such as QuickSight or Power BI.
Size: 3.73 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0
Ling-Yun-Huang/spark-gcp-data-optimization
PySpark/GCP Dataproc pipeline achieving a 40x speedup in ML data preprocessing via TFRecord optimisation.
Language: Jupyter Notebook - Size: 10.4 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 1 - Forks: 0
Navyam-04/Python-ETL-Pipeline-on-Databricks
An end-to-end ETL pipeline built on Databricks to ingest, clean, and enrich retail sales data with weather information using Python and Pandas.
Language: Jupyter Notebook - Size: 1.06 MB - Last synced at: 15 days ago - Pushed at: 16 days ago - Stars: 2 - Forks: 0
tech-aakash/Spark-Based-Alzheimer-s-Diagnosis-from-Clinical-Cognitive-Tabular-Data
This project implements a Spark-based pipeline for early-stage diagnosis of Alzheimer’s disease (AD) using clinical and cognitive tabular data. The pipeline demonstrates distributed data preprocessing, statistical analysis, and model training across single-node and multi-node Spark clusters.
Language: Python - Size: 2.52 MB - Last synced at: 15 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0
jzmnd/pants-pyspark-mve
Minimum viable example repo for running Pants with PySpark libraries
Language: Python - Size: 58.6 KB - Last synced at: 15 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0
hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Language: Python - Size: 110 MB - Last synced at: 14 days ago - Pushed at: 11 months ago - Stars: 1,522 - Forks: 234
astrolabsoftware/fink-filters
Define your filters to create your alert stream in Fink!
Language: Python - Size: 67.7 MB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 11
NHSDigital/data-validation-engine
Data Validation Engine source code
Language: Python - Size: 403 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 5 - Forks: 0
jafeerr/spark-data-test
Spark Data Test - A PySpark-based automation testing utility to compare Spark DataFrames
Language: Python - Size: 34.2 KB - Last synced at: 16 days ago - Pushed at: 17 days ago - Stars: 5 - Forks: 0
jrlasak/databricks_apparel_streaming
Databricks DLT Apparel Pipeline Project: Learn medallion architecture, streaming, and data engineering with Delta Live Tables. Includes synthetic data, step-by-step guide, and certification prep.
Language: Python - Size: 1.18 MB - Last synced at: 16 days ago - Pushed at: 17 days ago - Stars: 22 - Forks: 14
alishazaveri-dev/Amazon-Best-Selling-Books-Analysis
This Python project analyzes real-time data from Amazon's bestseller lists using Databricks and PySpark.
Language: Jupyter Notebook - Size: 164 KB - Last synced at: 16 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0
OmarAlhaz/E-Commerce-Sales-Forecasting-with-PySpark
Forecasting e-commerce product demand using PySpark MLlib. Includes data preprocessing, feature engineering, Random Forest modeling, and evaluation via Mean Absolute Error.
Language: Python - Size: 6.17 MB - Last synced at: 17 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0