pyspark | Topic | Ecosyste.ms: Repos

Topic: "pyspark"

ibis-project/ibis

the portable Python dataframe library

Language: Python - Size: 176 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 5,847 - Forks: 648

microsoft/SynapseML

Simple and Distributed Machine Learning

Language: Scala - Size: 157 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 5,139 - Forks: 845

JohnSnowLabs/spark-nlp

State of the Art Natural Language Processing

Language: Scala - Size: 3.4 GB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 3,993 - Forks: 725

apache/linkis

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

Language: Java - Size: 87.2 MB - Last synced at: 4 days ago - Pushed at: 25 days ago - Stars: 3,374 - Forks: 1,166

AlexIoannides/pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

Language: Python - Size: 769 KB - Last synced at: 24 days ago - Pushed at: over 2 years ago - Stars: 1,923 - Forks: 763

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Language: Python - Size: 2.69 MB - Last synced at: 13 days ago - Pushed at: over 1 year ago - Stars: 1,840 - Forks: 280

awesome-spark/awesome-spark

A curated list of awesome Apache Spark packages and resources.

Language: Shell - Size: 231 KB - Last synced at: 1 day ago - Pushed at: 8 months ago - Stars: 1,802 - Forks: 340

jadianes/spark-py-notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Language: Jupyter Notebook - Size: 2.2 MB - Last synced at: 24 days ago - Pushed at: over 1 year ago - Stars: 1,652 - Forks: 917

hi-primus/optimus

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

Language: Python - Size: 110 MB - Last synced at: 2 days ago - Pushed at: 7 months ago - Stars: 1,513 - Forks: 232

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.

Language: Jupyter Notebook - Size: 35.1 MB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 1,436 - Forks: 553

jupyter-incubator/sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

Language: Python - Size: 2.68 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 1,355 - Forks: 452

logicalclocks/hopsworks

Hopsworks - Data-Intensive AI platform with a Feature Store

Language: Java - Size: 152 MB - Last synced at: 25 days ago - Pushed at: 4 months ago - Stars: 1,225 - Forks: 151

mahmoudparsian/pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark

Language: Jupyter Notebook - Size: 8.96 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 1,223 - Forks: 478

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

Language: Python - Size: 10.6 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,125 - Forks: 149

mahmoudparsian/data-algorithms-book

MapReduce, Spark, Java, and Scala for Data Algorithms Book

Language: Java - Size: 397 MB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 1,075 - Forks: 661

h2oai/sparkling-water

Sparkling Water provides H2O functionality inside Spark cluster

Language: Scala - Size: 46.8 MB - Last synced at: 1 day ago - Pushed at: 7 months ago - Stars: 973 - Forks: 360

WeBankFinTech/Scriptis

Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.

Language: Vue - Size: 50.3 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 808 - Forks: 266

lyhue1991/eat_pyspark_in_10_days

pyspark🍒🥭 is delicious，just eat it!😋😋

Language: Python - Size: 4.66 MB - Last synced at: 29 days ago - Pushed at: over 2 years ago - Stars: 802 - Forks: 217

HariSekhon/DevOps-Python-tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

Language: Python - Size: 3.11 MB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 798 - Forks: 348

kuwala-io/kuwala

Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times

Language: JavaScript - Size: 7.79 MB - Last synced at: 3 months ago - Pushed at: almost 3 years ago - Stars: 792 - Forks: 54

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive AI workloads.

Language: Rust - Size: 3.38 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 789 - Forks: 38

MrPowers/chispa

PySpark test helper methods with beautiful error messages

Language: Python - Size: 3.63 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 697 - Forks: 72

ankurchavda/SparkLearning

A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.

Size: 1.31 MB - Last synced at: 3 days ago - Pushed at: about 3 years ago - Stars: 673 - Forks: 78

mrpowers-io/quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

Language: Python - Size: 1.98 MB - Last synced at: 4 days ago - Pushed at: 4 months ago - Stars: 672 - Forks: 98

Nike-Inc/koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

Language: Python - Size: 7.96 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 638 - Forks: 36

capitalone/datacompy

Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

Language: Python - Size: 11.6 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 576 - Forks: 143

kevinschaich/pyspark-cheatsheet

🐍 Quick reference guide to common patterns & functions in PySpark.

Size: 49.8 KB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 519 - Forks: 167

ericxiao251/spark-syntax

This is a repo documenting the best practices in PySpark.

Language: Jupyter Notebook - Size: 4.7 MB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 462 - Forks: 77

cartershanklin/pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

Language: Python - Size: 16.8 MB - Last synced at: 3 months ago - Pushed at: 8 months ago - Stars: 451 - Forks: 220

commoncrawl/cc-pyspark

Process Common Crawl data with Python and Spark

Language: Python - Size: 169 KB - Last synced at: 10 days ago - Pushed at: 26 days ago - Stars: 432 - Forks: 89

databrickslabs/dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

Language: Python - Size: 11.1 MB - Last synced at: 13 days ago - Pushed at: about 1 month ago - Stars: 407 - Forks: 74

ekampf/PySpark-Boilerplate

A boilerplate for writing PySpark Jobs

Language: Python - Size: 10.7 KB - Last synced at: 28 days ago - Pushed at: over 1 year ago - Stars: 396 - Forks: 155

CamDavidsonPilon/tdigest

t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark

Language: Python - Size: 91.8 KB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 395 - Forks: 54

cluster-apps-on-docker/spark-standalone-cluster-on-docker

Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:

Language: Jupyter Notebook - Size: 419 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 381 - Forks: 181

awesome-spark/spark-gotchas 📦

Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks

Size: 188 KB - Last synced at: about 1 month ago - Pushed at: about 8 years ago - Stars: 363 - Forks: 80

huseinzol05/Gather-Deployment 📦

Gathers Python deployment, infrastructure and practices.

Language: Jupyter Notebook - Size: 146 MB - Last synced at: 3 months ago - Pushed at: almost 2 years ago - Stars: 352 - Forks: 99

tirthajyoti/Spark-with-Python

Fundamentals of Spark with Python (using PySpark), code examples

Language: Jupyter Notebook - Size: 8.97 MB - Last synced at: 28 days ago - Pushed at: over 2 years ago - Stars: 347 - Forks: 271

sb-ai-lab/RePlay

A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models

Language: Python - Size: 36.1 MB - Last synced at: 24 days ago - Pushed at: 26 days ago - Stars: 337 - Forks: 34

MrPowers/mack

Delta Lake helper methods in PySpark

Language: Python - Size: 2.81 MB - Last synced at: 30 days ago - Pushed at: 10 months ago - Stars: 325 - Forks: 42

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

Language: Python - Size: 23.9 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 304 - Forks: 63

quintoandar/butterfree

A tool for building feature stores.

Language: Python - Size: 1.49 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 303 - Forks: 37

apache/incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

Language: C++ - Size: 17.1 MB - Last synced at: 6 days ago - Pushed at: 12 days ago - Stars: 281 - Forks: 69

jelmerk/hnswlib

Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs

Language: Java - Size: 1.23 MB - Last synced at: 24 days ago - Pushed at: 3 months ago - Stars: 274 - Forks: 56

XD-DENG/Spark-practice

Apache Spark (PySpark) Practice on Real Data

Language: Jupyter Notebook - Size: 13 MB - Last synced at: about 1 month ago - Pushed at: over 5 years ago - Stars: 274 - Forks: 136

Morphl-AI/MorphL-Community-Edition

MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization

Language: Python - Size: 143 KB - Last synced at: about 12 hours ago - Pushed at: over 5 years ago - Stars: 262 - Forks: 29

paypal/gimel

Big Data Processing Framework - Unified Data API or SQL on Any Storage

Language: Scala - Size: 62.1 MB - Last synced at: 30 days ago - Pushed at: 6 months ago - Stars: 245 - Forks: 82

runawayhorse001/LearningApacheSpark

LearningApacheSpark

Language: Python - Size: 82.5 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 231 - Forks: 112

G-Research/spark-extension

A library that provides useful extensions to Apache Spark and PySpark.

Language: Scala - Size: 1.1 MB - Last synced at: 14 days ago - Pushed at: 3 months ago - Stars: 224 - Forks: 28

titicaca/spark-iforest

Isolation Forest on Spark

Language: Scala - Size: 74.2 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 221 - Forks: 91

mahmoudparsian/data-algorithms-with-spark

O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian

Language: Python - Size: 44.9 MB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 215 - Forks: 93

Azure/azure-cosmosdb-spark 📦

Apache Spark Connector for Azure Cosmos DB

Size: 192 MB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 203 - Forks: 120

jonesberg/DataAnalysisWithPythonAndPySpark

Code repository for the "PySpark in Action" book

Language: Python - Size: 9.77 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 201 - Forks: 121

canimus/cuallee

Possibly the fastest DataFrame-agnostic quality check library in town.

Language: Python - Size: 2.29 MB - Last synced at: 6 days ago - Pushed at: 13 days ago - Stars: 195 - Forks: 20

dvgodoy/handyspark

HandySpark - bringing pandas-like capabilities to Spark dataframes

Language: Jupyter Notebook - Size: 1.68 MB - Last synced at: 28 days ago - Pushed at: about 6 years ago - Stars: 193 - Forks: 26

davidzajac1/zillacode

Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake

Language: Python - Size: 2.68 MB - Last synced at: 2 days ago - Pushed at: 4 months ago - Stars: 191 - Forks: 41

databrickslabs/automl-toolkit 📦

Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.

Language: HTML - Size: 158 MB - Last synced at: 3 months ago - Pushed at: about 4 years ago - Stars: 191 - Forks: 44

josephmachado/data_engineering_best_practices

Sample project to demonstrate data engineering best practices

Language: Python - Size: 644 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 185 - Forks: 32

epam/OSCI

Open Source Contributor Index

Language: Python - Size: 1.05 MB - Last synced at: 13 days ago - Pushed at: 4 months ago - Stars: 169 - Forks: 107

mahmoudparsian/big-data-mapreduce-course

Big Data Modeling, MapReduce, Spark, PySpark @ Santa Clara University

Language: HTML - Size: 601 MB - Last synced at: 22 days ago - Pushed at: 7 months ago - Stars: 158 - Forks: 143

Fryingpannn/WallStreetBets_BigDataAnalysis

Research project aimed to classify the best stock research posts from r/WallStreetBets for you. 😏

Language: Jupyter Notebook - Size: 34.6 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 156 - Forks: 39

awantik/pyspark-learning

Updated repository

Language: Jupyter Notebook - Size: 67.2 MB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 156 - Forks: 125

archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Language: Scala - Size: 39.5 MB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 144 - Forks: 33

RubensZimbres/Repo-2019

BERT, AWS RDS, AWS Forecast, EMR Spark Cluster, Hive, Serverless, Google Assistant + Raspberry Pi, Infrared, Google Cloud Platform Natural Language, Anomaly detection, Tensorflow, Mathematics

Language: Jupyter Notebook - Size: 57.8 MB - Last synced at: 3 months ago - Pushed at: almost 4 years ago - Stars: 138 - Forks: 73

alanchn31/Movalytics-Data-Warehouse

Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow

Language: Python - Size: 717 KB - Last synced at: 7 months ago - Pushed at: about 5 years ago - Stars: 133 - Forks: 31

GoogleCloudPlatform/dataproc-templates

Dataproc templates and pipelines for solving in-cloud data tasks

Language: Python - Size: 18.6 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 129 - Forks: 100

kavgan/phrase-at-scale

Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English

Language: Python - Size: 80.6 MB - Last synced at: 3 months ago - Pushed at: almost 6 years ago - Stars: 128 - Forks: 45

LearningJournal/Spark-Streaming-In-Python

Apache Spark 3 - Structured Streaming Course Material

Language: Python - Size: 19.4 MB - Last synced at: 2 months ago - Pushed at: almost 2 years ago - Stars: 121 - Forks: 159

zero323/pyspark-stubs 📦

Apache (Py)Spark type annotations (stub files).

Language: Python - Size: 1.3 MB - Last synced at: 23 days ago - Pushed at: almost 3 years ago - Stars: 117 - Forks: 37

edyoda/pyspark-tutorial

PySpark Code for Hands-on Learners

Language: Jupyter Notebook - Size: 53.3 MB - Last synced at: 11 days ago - Pushed at: over 5 years ago - Stars: 116 - Forks: 120

harishpuvvada/BitCoin-Value-Predictor 📦

[NOT MAINTAINED] Predicting Bit coin price using Time series analysis and sentiment analysis of tweets on bitcoin

Language: Jupyter Notebook - Size: 3.07 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 113 - Forks: 29

microsoft/Azure-Databricks-NYC-Taxi-Workshop

An Azure Databricks workshop leveraging the New York Taxi and Limousine Commission Trip Records dataset

Language: Scala - Size: 42.3 MB - Last synced at: 2 days ago - Pushed at: about 2 years ago - Stars: 109 - Forks: 109

andfanilo/pyspark-tutorial

Jupyter notebooks for pyspark tutorials given at University

Language: Jupyter Notebook - Size: 4.52 MB - Last synced at: 21 days ago - Pushed at: 7 months ago - Stars: 107 - Forks: 75

wadhwasahil/Relation_Extraction

Relation Extraction using Deep learning(CNN)

Language: Python - Size: 4.89 MB - Last synced at: over 1 year ago - Pushed at: over 8 years ago - Stars: 99 - Forks: 53

minio/spark-select

A library for Spark DataFrame using MinIO Select API

Language: Scala - Size: 65.4 KB - Last synced at: 2 days ago - Pushed at: over 5 years ago - Stars: 98 - Forks: 19

mitchelllisle/sparkdantic

✨ A Pydantic to PySpark schema library

Language: Python - Size: 3.26 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 94 - Forks: 15

adornes/spark_python_ml_examples

Spark 2.0 Python Machine Learning examples

Language: Python - Size: 13.7 KB - Last synced at: about 2 years ago - Pushed at: over 5 years ago - Stars: 93 - Forks: 42

itsjafer/jupyterlab-sparkmonitor

JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook

Language: JavaScript - Size: 4.08 MB - Last synced at: 7 days ago - Pushed at: over 2 years ago - Stars: 92 - Forks: 23

hyunjoonbok/PySpark

PySpark functions and utilities with examples. Assists ETL process of data modeling

Language: Jupyter Notebook - Size: 3.79 MB - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 89 - Forks: 73

mahmoudparsian/pyspark-algorithms

PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2

Language: Python - Size: 40.5 MB - Last synced at: 3 months ago - Pushed at: over 5 years ago - Stars: 84 - Forks: 44

awesome-spark/learn-by-examples 📦

Real-world Spark pipelines examples

Language: Scala - Size: 1.1 MB - Last synced at: 10 days ago - Pushed at: over 7 years ago - Stars: 83 - Forks: 30

sabman/PySparkGeoAnalysis

:globe_with_meridians: Interactive Workshop on GeoAnalysis using PySpark

Language: Jupyter Notebook - Size: 18.5 MB - Last synced at: 6 days ago - Pushed at: over 8 years ago - Stars: 79 - Forks: 34

groda/big_data

Big Data essentials: Hadoop, MapReduce, Spark. Explore tutorials and demos in Jupyter notebooks—most are self-contained and live, ready to run with a click.

Language: Jupyter Notebook - Size: 51.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 78 - Forks: 27

anovos/anovos

Anovos - An Open Source Library for Scalable feature engineering Using Apache-Spark

Language: Jupyter Notebook - Size: 88.5 MB - Last synced at: 21 days ago - Pushed at: about 2 years ago - Stars: 76 - Forks: 25

myamafuj/hadoop-hive-spark-docker

Hadoop-Hive-Spark cluster + Jupyter on Docker

Language: Jupyter Notebook - Size: 75.2 KB - Last synced at: about 2 months ago - Pushed at: 6 months ago - Stars: 73 - Forks: 52

martandsingh/ApacheSpark

This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.

Language: Python - Size: 141 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 71 - Forks: 47

src-d/jgit-spark-connector

jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.

Language: Scala - Size: 8.55 MB - Last synced at: 24 days ago - Pushed at: over 6 years ago - Stars: 71 - Forks: 32

yugokato/Spark-and-Kafka_IoT-Data-Processing-and-Analytics

Final Project for IoT: Big Data Processing and Analytics class. Analyzing U.S nationwide temperature from IoT sensors in real-time

Language: Python - Size: 159 KB - Last synced at: 3 months ago - Pushed at: over 8 years ago - Stars: 70 - Forks: 27

anguenot/pyspark-cassandra

pyspark-cassandra is a Python port of the awesome @datastax Spark Cassandra connector. Compatible w/ Spark 2.0, 2.1, 2.2, 2.3 and 2.4

Language: Python - Size: 8.25 MB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 69 - Forks: 28

zaratsian/Spark

Apache Spark (Scala, PySpark, SparkR) Code, Tricks, and References

Language: Jupyter Notebook - Size: 8.37 MB - Last synced at: 2 months ago - Pushed at: over 6 years ago - Stars: 69 - Forks: 37

sbl-sdsc/mmtf-pyspark

Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.

Language: Python - Size: 524 MB - Last synced at: 4 days ago - Pushed at: about 2 years ago - Stars: 68 - Forks: 27

sberbank-ai-lab/RePlay 📦

RecSys Library

Language: Python - Size: 12.7 MB - Last synced at: 7 months ago - Pushed at: about 3 years ago - Stars: 68 - Forks: 6