pyspark | Topic | Ecosyste.ms: Repos

Topic: "pyspark"

sbl-sdsc/mmtf-workshop-2018

Structural Bioinformatics Training Workshop & Hackathon 2018

Language: Jupyter Notebook - Size: 369 MB - Last synced at: 2 months ago - Pushed at: about 3 years ago - Stars: 60 - Forks: 31

tomaztk/Spark-for-data-engineers

Apache Spark for data engineers

Language: Jupyter Notebook - Size: 6.9 MB - Last synced at: 3 months ago - Pushed at: almost 3 years ago - Stars: 55 - Forks: 22

kaantas/spark-twitter-sentiment-analysis

Sentiment Analysis of a Twitter Topic with Spark Structured Streaming

Language: Python - Size: 2.93 KB - Last synced at: about 22 hours ago - Pushed at: over 6 years ago - Stars: 55 - Forks: 34

chenxingqiang/DNN-YouTube-RecSys

This repository provides a comprehensive implementation of a deep neural network-based recommendation system similar to YouTube's. The repo is organized to include the core Python implementation of the model and a Spark-based Scala solution for data generation and model serving.

Language: Python - Size: 1.86 MB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 54 - Forks: 32

allisonwang-db/pyspark-data-sources

Custom PySpark Data Sources

Language: Python - Size: 665 KB - Last synced at: 25 days ago - Pushed at: about 2 months ago - Stars: 53 - Forks: 9

vim89/datapipelines-essentials-python

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Language: Python - Size: 1.76 MB - Last synced at: about 1 year ago - Pushed at: about 2 years ago - Stars: 53 - Forks: 34

dimajix/spark-training

Repository used for Spark Trainings

Language: Jupyter Notebook - Size: 9 MB - Last synced at: 2 months ago - Pushed at: about 2 years ago - Stars: 53 - Forks: 66

Spratiher9/Sparkora

Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟

Language: HTML - Size: 1.23 MB - Last synced at: about 1 month ago - Pushed at: over 3 years ago - Stars: 53 - Forks: 7

luisbelloch/data_processing_course

Some class materials for a data processing course using PySpark

Language: Python - Size: 563 KB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 51 - Forks: 24

NHSDigital/data-analytics-services

This repo collects the open-source work of the Analytics Service within NHS Digital Data Services

Size: 81.1 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 50 - Forks: 2

jplane/pyspark-devcontainer

A simple VS Code devcontainer setup for local PySpark development

Language: Jupyter Notebook - Size: 318 KB - Last synced at: 2 months ago - Pushed at: almost 2 years ago - Stars: 50 - Forks: 28

jmcmt87/spark_app_twitter

A data engineering project (Twitter monitor app)

Language: Python - Size: 1.76 MB - Last synced at: over 2 years ago - Pushed at: about 3 years ago - Stars: 50 - Forks: 8

shauryashaurya/learn-data-munging

Notes on Data Engineering with Pandas, PySpark, Dask, Ray, Arrow DataFusion, Polars etc.

Language: Jupyter Notebook - Size: 631 MB - Last synced at: 7 days ago - Pushed at: about 1 month ago - Stars: 47 - Forks: 21

idealo/terraform-emr-pyspark 📦

Quickstart PySpark with Anaconda on AWS/EMR using Terraform

Language: HCL - Size: 11.7 KB - Last synced at: 4 months ago - Pushed at: 6 months ago - Stars: 47 - Forks: 33

sllynn/spark-xgboost

A Python wrapper for XGBoost4J-Spark classes.

Language: Python - Size: 456 KB - Last synced at: 7 months ago - Pushed at: about 1 year ago - Stars: 47 - Forks: 13

zero323/pyspark-asyncactions

Asynchronous actions for PySpark

Language: Python - Size: 80.1 KB - Last synced at: 11 days ago - Pushed at: over 3 years ago - Stars: 47 - Forks: 2

sparkdq-community/sparkdq

A declarative PySpark framework for row- and aggregate-level data quality validation.

Language: Python - Size: 7.43 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 46 - Forks: 5

criteo/cluster-pack

A library on top of either pex or conda-pack to make your Python code easily available on a cluster

Language: Python - Size: 436 KB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 45 - Forks: 21

dogukannulu/streaming_data_processing

Create a streaming data, transfer it to Kafka, modify it with PySpark, take it to ElasticSearch and MinIO

Language: Python - Size: 1.81 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 44 - Forks: 17

rafaelpierre/pyjaws

PyJaws: A Pythonic Way to Define Databricks Jobs and Workflows

Language: Python - Size: 3.46 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 43 - Forks: 4

G-Research/spark-dgraph-connector

A connector for Apache Spark and PySpark to Dgraph databases.

Language: Scala - Size: 1.44 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 43 - Forks: 12

ManuelGuerra1987/data-engineering-zoomcamp-notes

Detailed notes and homeworks from 2025 Data Engineering Zoomcamp by Datatalks.Club

Language: Python - Size: 14.8 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 43 - Forks: 0

Wittline/apache-spark-docker

Dockerizing an Apache Spark Standalone Cluster

Language: VBA - Size: 63.7 MB - Last synced at: 3 months ago - Pushed at: almost 3 years ago - Stars: 43 - Forks: 27

airscholar/SparkingFlow

This project demonstrates how to use Apache Airflow to submit jobs to Apache spark cluster in different programming laguages using Python, Scala and Java as an example.

Language: Java - Size: 94.7 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 42 - Forks: 27

kaantas/kafka-twitter-spark-streaming

Counting Tweets Per User in Real-Time

Language: Python - Size: 2.93 KB - Last synced at: about 22 hours ago - Pushed at: almost 8 years ago - Stars: 42 - Forks: 15

mrugankray/Big-Data-Cluster

The goal of this project is to build a docker cluster that gives access to Hadoop, HDFS, Hive, PySpark, Sqoop, Airflow, Kafka, Flume, Postgres, Cassandra, Hue, Zeppelin, Kadmin, Kafka Control Center and pgAdmin. This cluster is solely intended for usage in a development environment. Do not use it to run any production workloads.

Language: Shell - Size: 118 KB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 41 - Forks: 15

opentargets/gentropy

Open Targets python framework for post-GWAS analysis

Language: Jupyter Notebook - Size: 31.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 39 - Forks: 11

MrPowers/ceja

PySpark phonetic and string matching algorithms

Language: Python - Size: 32.2 KB - Last synced at: 5 days ago - Pushed at: over 1 year ago - Stars: 39 - Forks: 6

godfanmiao/ML-Kaggle-Github-2022

《 Python机器学习及实践：从零开始通往Kaggle竞赛之路（2022年度版）》全书数据和开源代码

Language: Jupyter Notebook - Size: 168 MB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 39 - Forks: 11

jubins/Spark-And-MLlib-Projects

This repository contains Spark, MLlib, PySpark and Dataframes projects

Language: Jupyter Notebook - Size: 101 KB - Last synced at: over 1 year ago - Pushed at: over 7 years ago - Stars: 39 - Forks: 97

Wh1isper/sparglim

Sparglim✨ makes PySpark App Configurable and Deploy Spark Connect Server Easier!

Language: Python - Size: 151 KB - Last synced at: 5 days ago - Pushed at: 4 months ago - Stars: 37 - Forks: 4

awslabs/amazon-emr-vscode-toolkit

A VS Code Extension to make it easier to manage and develop Spark jobs on EMR

Language: TypeScript - Size: 907 KB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 37 - Forks: 5

garystafford/emr-demo

Project files for the post: Running PySpark Applications on Amazon EMR: Methods for Interacting with PySpark on Amazon Elastic MapReduce.

Language: Python - Size: 691 KB - Last synced at: almost 2 years ago - Pushed at: almost 3 years ago - Stars: 37 - Forks: 17

san089/Cloudera_Material

Cloudera_Material: Study Material to help people preparing for Cloudera CCA Spark and Hadoop Developer Exam (CCA175). Feel free to collaborate.

Size: 9.02 MB - Last synced at: 3 months ago - Pushed at: about 5 years ago - Stars: 37 - Forks: 30

gmrqs/lasagna

A Docker Compose template that builds a interactive development environment for PySpark with Jupyter Lab, MinIO as object storage, Hive Metastore, Trino and Kafka

Language: Jupyter Notebook - Size: 11.7 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 36 - Forks: 11

josephmachado/docker_for_data_engineers

Code for blog at: https://www.startdataengineering.com/post/docker-for-de/

Language: C - Size: 561 KB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 36 - Forks: 15

asifahmed90/pyspark-ML-in-Colab

Pyspark in Google Colab: A simple machine learning (Linear Regression) model

Language: Jupyter Notebook - Size: 93.8 KB - Last synced at: 19 days ago - Pushed at: about 6 years ago - Stars: 36 - Forks: 98

basin-etl/basin

Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser

Language: TypeScript - Size: 7.08 MB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 35 - Forks: 3

mjhea0/flask-spark-docker

Just a boilerplate for PySpark and Flask

Language: Python - Size: 206 KB - Last synced at: 3 months ago - Pushed at: almost 7 years ago - Stars: 35 - Forks: 18

Spratiher9/SparkDataset

Instant search for and access to many datasets in Pyspark.

Language: Jupyter Notebook - Size: 14.9 MB - Last synced at: 18 days ago - Pushed at: over 2 years ago - Stars: 34 - Forks: 8

AaronYang2333/DSCI_553

USC :v: 2020 Spring DSCI 553 (Foundations and Applications of Data Mining) 数据挖掘基础与应用 Score: :nine::four:

Language: ReScript - Size: 265 MB - Last synced at: over 2 years ago - Pushed at: over 4 years ago - Stars: 34 - Forks: 21

razevedo1994/razv-data-engineering

Portfolio of projects and studies conducted in data engineering.

Language: Jupyter Notebook - Size: 70.2 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 33 - Forks: 3

DataEngineering-LATAM/Spark-StudyClub

Grupo de Estudios de Apache Spark organizado por la comunidad Data Engineering Latam

Language: Jupyter Notebook - Size: 4.83 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 33 - Forks: 24

feng-li/dlsa

Distributed least squares approximation (dlsa) implemented with Apache Spark

Language: Python - Size: 276 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 33 - Forks: 46

garystafford/pyspark-setup-demo

Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks

Language: Jupyter Notebook - Size: 3.07 MB - Last synced at: almost 2 years ago - Pushed at: over 4 years ago - Stars: 33 - Forks: 30

mikulskibartosz/check-engine 📦

Data validation library for PySpark 3.0.0

Language: Python - Size: 94.7 KB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 32 - Forks: 4

lbdeoliveira/song-playlist-recommendation

This project was a joint effort by Lucas De Oliveira, Chandrish Ambati, and Anish Mukherjee to create a song and playlist embeddings for recommendations in a distributed fashion using a 1M playlist dataset by Spotify.

Language: HTML - Size: 225 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 32 - Forks: 12

ksindi/kafka-compose

:musical_score: Docker compose files for various kafka stacks

Language: Python - Size: 24.4 KB - Last synced at: 7 days ago - Pushed at: over 7 years ago - Stars: 32 - Forks: 10

mikeroyal/Apache-Spark-Guide

Apache Spark Guide

Language: Python - Size: 237 KB - Last synced at: 8 days ago - Pushed at: over 3 years ago - Stars: 31 - Forks: 11

datamole-ai/pysparkdt

An open-source Python library for simplifying local testing of Databricks workflows that use PySpark and Delta tables.

Language: Python - Size: 64.5 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 30 - Forks: 0

astrolabsoftware/spark3D

Spark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteorology, …

Language: Scala - Size: 24.8 MB - Last synced at: 13 days ago - Pushed at: about 2 years ago - Stars: 30 - Forks: 16

ykursadkaya/pyspark-Docker

PySpark in Docker Containers

Language: Dockerfile - Size: 16.6 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 30 - Forks: 7

abhishek-ch/streamlit-healthcare-ML-App

Streamlit example showing Scikit Learn & Pyspark ML over Healthcare data ! Its simple !!

Language: Python - Size: 3.05 MB - Last synced at: 24 days ago - Pushed at: over 4 years ago - Stars: 30 - Forks: 11

cerndb/SparkDLTrigger

Code and links to the data for the article "Machine Learning Pipelines with Modern Big DataTools for High Energy Physics"

Language: Jupyter Notebook - Size: 11.9 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 29 - Forks: 12

isarn/isarn-sketches-spark

Routines and data structures for using isarn-sketches idiomatically in Apache Spark

Language: Scala - Size: 1.33 MB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 29 - Forks: 12

Graphlet-AI/graphlet

PyPi module for Graphlet AI Knowledge Graph Factory

Language: Python - Size: 20.4 MB - Last synced at: 7 days ago - Pushed at: about 2 years ago - Stars: 29 - Forks: 1

urigoren/decorators4DS

Useful decorators every Data Scientist should know

Language: Python - Size: 118 KB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 29 - Forks: 5

roshankoirala/pySpark_tutorial

Implementation of Spark code in Jupyter notebook. Topics include: RDDs and DataFrame, exploratory data analysis (EDA), handling multiple DataFrames, visualization, Machine Learning

Language: Jupyter Notebook - Size: 202 KB - Last synced at: 3 months ago - Pushed at: almost 5 years ago - Stars: 29 - Forks: 26

nikhitmago/lookalike-modelling

Finding customer lookalikes using Machine Learning in PySpark

Language: Python - Size: 17.6 KB - Last synced at: over 2 years ago - Pushed at: almost 7 years ago - Stars: 29 - Forks: 18

kitestring/DataQuest

Data Science Massive Open Online Course: All the code, notes and supplementary materials generated during the course of my data scientific learning.

Language: Jupyter Notebook - Size: 71.5 MB - Last synced at: 14 days ago - Pushed at: about 7 years ago - Stars: 29 - Forks: 15

jldbc/gutenberg

A content-based recommender system for books using the Project Gutenberg text corpus

Language: Python - Size: 13.2 MB - Last synced at: 22 days ago - Pushed at: over 8 years ago - Stars: 29 - Forks: 12

guidok91/spark-movies-etl

Spark data pipeline that processes movie ratings data.

Language: Python - Size: 3.81 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 28 - Forks: 12

mozilla/python_mozetl

ETL jobs for Firefox Telemetry

Language: Python - Size: 2.69 MB - Last synced at: 7 days ago - Pushed at: 2 months ago - Stars: 28 - Forks: 28

jihyeonseong/ESG-AI-investment-by-streamlit

ESG-investment AI

Language: Jupyter Notebook - Size: 32.6 MB - Last synced at: 3 months ago - Pushed at: 8 months ago - Stars: 28 - Forks: 7

alexott/spark-playground

Playing with different packages of the Apache Spark

Language: Scala - Size: 48.8 KB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 28 - Forks: 13

sbl-sdsc/mmtf-genomics

Methods for mapping genomic data onto 3D protein structure.

Language: Jupyter Notebook - Size: 6.17 MB - Last synced at: 2 months ago - Pushed at: about 3 years ago - Stars: 28 - Forks: 10

datayoga-io/lineage

Generate beautiful documentation for your data pipelines in markdown format

Language: TypeScript - Size: 461 KB - Last synced at: about 2 hours ago - Pushed at: over 3 years ago - Stars: 28 - Forks: 0

SemyonSinchenko/flake8-pyspark-with-column

A flake8 plugin that detects of usage withColumn in a loop or inside reduce

Language: Python - Size: 169 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 27 - Forks: 1

godfanmiao/PyAI-Github-2024

《 Python人工智能编程实践（2024年度版）》全书数据和开源代码

Language: Jupyter Notebook - Size: 182 MB - Last synced at: 2 months ago - Pushed at: 7 months ago - Stars: 27 - Forks: 6

Thanaraklee/Real-Time-PySpark

This project introduces PySpark, a powerful open-source framework for distributed data processing. We explore its architecture, components, and applications for real-time data analysis.

Language: Python - Size: 329 MB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 27 - Forks: 13

jmcurbelo/pyspark-ingenieria-de-datos

Este repositorio contiene el material del curso de Udemy Big Data y Spark: ingeniería de datos con Python y pyspark. En este curso, aprenderás a utilizar las herramientas y técnicas necesarias para trabajar con grandes conjuntos de datos utilizando la librería pyspark.

Language: Python - Size: 1.56 MB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 27 - Forks: 46