Topic: "pyspark"
ibis-project/ibis
the portable Python dataframe library
Language: Python - Size: 176 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 5,847 - Forks: 648

microsoft/SynapseML
Simple and Distributed Machine Learning
Language: Scala - Size: 157 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 5,139 - Forks: 845

JohnSnowLabs/spark-nlp
State of the Art Natural Language Processing
Language: Scala - Size: 3.4 GB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 3,993 - Forks: 725

apache/linkis
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Language: Java - Size: 87.2 MB - Last synced at: 4 days ago - Pushed at: 25 days ago - Stars: 3,374 - Forks: 1,166

AlexIoannides/pyspark-example-project
Implementing best practices for PySpark ETL jobs and applications.
Language: Python - Size: 769 KB - Last synced at: 24 days ago - Pushed at: over 2 years ago - Stars: 1,923 - Forks: 763

uber/petastorm
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Language: Python - Size: 2.69 MB - Last synced at: 13 days ago - Pushed at: over 1 year ago - Stars: 1,840 - Forks: 280

awesome-spark/awesome-spark
A curated list of awesome Apache Spark packages and resources.
Language: Shell - Size: 231 KB - Last synced at: 1 day ago - Pushed at: 8 months ago - Stars: 1,802 - Forks: 340

jadianes/spark-py-notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Language: Jupyter Notebook - Size: 2.2 MB - Last synced at: 24 days ago - Pushed at: over 1 year ago - Stars: 1,652 - Forks: 917

hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Language: Python - Size: 110 MB - Last synced at: 2 days ago - Pushed at: 7 months ago - Stars: 1,513 - Forks: 232

ptyadana/SQL-Data-Analysis-and-Visualization-Projects
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Language: Jupyter Notebook - Size: 35.1 MB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 1,436 - Forks: 553

jupyter-incubator/sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Language: Python - Size: 2.68 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 1,355 - Forks: 452

logicalclocks/hopsworks
Hopsworks - Data-Intensive AI platform with a Feature Store
Language: Java - Size: 152 MB - Last synced at: 25 days ago - Pushed at: 4 months ago - Stars: 1,225 - Forks: 151

mahmoudparsian/pyspark-tutorial
PySpark-Tutorial provides basic algorithms using PySpark
Language: Jupyter Notebook - Size: 8.96 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 1,223 - Forks: 478

narwhals-dev/narwhals
Lightweight and extensible compatibility layer between dataframe libraries!
Language: Python - Size: 10.6 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,125 - Forks: 149

mahmoudparsian/data-algorithms-book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Language: Java - Size: 397 MB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 1,075 - Forks: 661

h2oai/sparkling-water
Sparkling Water provides H2O functionality inside Spark cluster
Language: Scala - Size: 46.8 MB - Last synced at: 1 day ago - Pushed at: 7 months ago - Stars: 973 - Forks: 360

WeBankFinTech/Scriptis
Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
Language: Vue - Size: 50.3 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 808 - Forks: 266

lyhue1991/eat_pyspark_in_10_days
pyspark🍒🥭 is delicious,just eat it!😋😋
Language: Python - Size: 4.66 MB - Last synced at: 29 days ago - Pushed at: over 2 years ago - Stars: 802 - Forks: 217

HariSekhon/DevOps-Python-tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Language: Python - Size: 3.11 MB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 798 - Forks: 348

kuwala-io/kuwala
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times
Language: JavaScript - Size: 7.79 MB - Last synced at: 3 months ago - Pushed at: almost 3 years ago - Stars: 792 - Forks: 54

lakehq/sail
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive AI workloads.
Language: Rust - Size: 3.38 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 789 - Forks: 38

MrPowers/chispa
PySpark test helper methods with beautiful error messages
Language: Python - Size: 3.63 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 697 - Forks: 72

ankurchavda/SparkLearning
A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.
Size: 1.31 MB - Last synced at: 3 days ago - Pushed at: about 3 years ago - Stars: 673 - Forks: 78

mrpowers-io/quinn
pyspark methods to enhance developer productivity 📣 👯 🎉
Language: Python - Size: 1.98 MB - Last synced at: 4 days ago - Pushed at: 4 months ago - Stars: 672 - Forks: 98

Nike-Inc/koheesio
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
Language: Python - Size: 7.96 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 638 - Forks: 36

capitalone/datacompy
Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!
Language: Python - Size: 11.6 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 576 - Forks: 143

kevinschaich/pyspark-cheatsheet
🐍 Quick reference guide to common patterns & functions in PySpark.
Size: 49.8 KB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 519 - Forks: 167

ericxiao251/spark-syntax
This is a repo documenting the best practices in PySpark.
Language: Jupyter Notebook - Size: 4.7 MB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 462 - Forks: 77

cartershanklin/pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Language: Python - Size: 16.8 MB - Last synced at: 3 months ago - Pushed at: 8 months ago - Stars: 451 - Forks: 220

commoncrawl/cc-pyspark
Process Common Crawl data with Python and Spark
Language: Python - Size: 169 KB - Last synced at: 10 days ago - Pushed at: 26 days ago - Stars: 432 - Forks: 89

databrickslabs/dbldatagen
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
Language: Python - Size: 11.1 MB - Last synced at: 13 days ago - Pushed at: about 1 month ago - Stars: 407 - Forks: 74

ekampf/PySpark-Boilerplate
A boilerplate for writing PySpark Jobs
Language: Python - Size: 10.7 KB - Last synced at: 28 days ago - Pushed at: over 1 year ago - Stars: 396 - Forks: 155

CamDavidsonPilon/tdigest
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
Language: Python - Size: 91.8 KB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 395 - Forks: 54

cluster-apps-on-docker/spark-standalone-cluster-on-docker
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:
Language: Jupyter Notebook - Size: 419 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 381 - Forks: 181

awesome-spark/spark-gotchas 📦
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
Size: 188 KB - Last synced at: about 1 month ago - Pushed at: about 8 years ago - Stars: 363 - Forks: 80

huseinzol05/Gather-Deployment 📦
Gathers Python deployment, infrastructure and practices.
Language: Jupyter Notebook - Size: 146 MB - Last synced at: 3 months ago - Pushed at: almost 2 years ago - Stars: 352 - Forks: 99

tirthajyoti/Spark-with-Python
Fundamentals of Spark with Python (using PySpark), code examples
Language: Jupyter Notebook - Size: 8.97 MB - Last synced at: 28 days ago - Pushed at: over 2 years ago - Stars: 347 - Forks: 271

sb-ai-lab/RePlay
A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models
Language: Python - Size: 36.1 MB - Last synced at: 24 days ago - Pushed at: 26 days ago - Stars: 337 - Forks: 34

MrPowers/mack
Delta Lake helper methods in PySpark
Language: Python - Size: 2.81 MB - Last synced at: 30 days ago - Pushed at: 10 months ago - Stars: 325 - Forks: 42

josephmachado/efficient_data_processing_spark
Code for "Efficient Data Processing in Spark" Course
Language: Python - Size: 23.9 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 304 - Forks: 63

quintoandar/butterfree
A tool for building feature stores.
Language: Python - Size: 1.49 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 303 - Forks: 37

apache/incubator-graphar
An open source, standard data file format for graph data storage and retrieval.
Language: C++ - Size: 17.1 MB - Last synced at: 6 days ago - Pushed at: 12 days ago - Stars: 281 - Forks: 69

jelmerk/hnswlib
Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs
Language: Java - Size: 1.23 MB - Last synced at: 24 days ago - Pushed at: 3 months ago - Stars: 274 - Forks: 56

XD-DENG/Spark-practice
Apache Spark (PySpark) Practice on Real Data
Language: Jupyter Notebook - Size: 13 MB - Last synced at: about 1 month ago - Pushed at: over 5 years ago - Stars: 274 - Forks: 136

Morphl-AI/MorphL-Community-Edition
MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization
Language: Python - Size: 143 KB - Last synced at: about 12 hours ago - Pushed at: over 5 years ago - Stars: 262 - Forks: 29

paypal/gimel
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Language: Scala - Size: 62.1 MB - Last synced at: 30 days ago - Pushed at: 6 months ago - Stars: 245 - Forks: 82

runawayhorse001/LearningApacheSpark
LearningApacheSpark
Language: Python - Size: 82.5 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 231 - Forks: 112

G-Research/spark-extension
A library that provides useful extensions to Apache Spark and PySpark.
Language: Scala - Size: 1.1 MB - Last synced at: 14 days ago - Pushed at: 3 months ago - Stars: 224 - Forks: 28

titicaca/spark-iforest
Isolation Forest on Spark
Language: Scala - Size: 74.2 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 221 - Forks: 91

mahmoudparsian/data-algorithms-with-spark
O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
Language: Python - Size: 44.9 MB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 215 - Forks: 93

Azure/azure-cosmosdb-spark 📦
Apache Spark Connector for Azure Cosmos DB
Size: 192 MB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 203 - Forks: 120

jonesberg/DataAnalysisWithPythonAndPySpark
Code repository for the "PySpark in Action" book
Language: Python - Size: 9.77 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 201 - Forks: 121

canimus/cuallee
Possibly the fastest DataFrame-agnostic quality check library in town.
Language: Python - Size: 2.29 MB - Last synced at: 6 days ago - Pushed at: 13 days ago - Stars: 195 - Forks: 20

dvgodoy/handyspark
HandySpark - bringing pandas-like capabilities to Spark dataframes
Language: Jupyter Notebook - Size: 1.68 MB - Last synced at: 28 days ago - Pushed at: about 6 years ago - Stars: 193 - Forks: 26

davidzajac1/zillacode
Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake
Language: Python - Size: 2.68 MB - Last synced at: 2 days ago - Pushed at: 4 months ago - Stars: 191 - Forks: 41

databrickslabs/automl-toolkit 📦
Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.
Language: HTML - Size: 158 MB - Last synced at: 3 months ago - Pushed at: about 4 years ago - Stars: 191 - Forks: 44

josephmachado/data_engineering_best_practices
Sample project to demonstrate data engineering best practices
Language: Python - Size: 644 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 185 - Forks: 32

epam/OSCI
Open Source Contributor Index
Language: Python - Size: 1.05 MB - Last synced at: 13 days ago - Pushed at: 4 months ago - Stars: 169 - Forks: 107

mahmoudparsian/big-data-mapreduce-course
Big Data Modeling, MapReduce, Spark, PySpark @ Santa Clara University
Language: HTML - Size: 601 MB - Last synced at: 22 days ago - Pushed at: 7 months ago - Stars: 158 - Forks: 143

Fryingpannn/WallStreetBets_BigDataAnalysis
Research project aimed to classify the best stock research posts from r/WallStreetBets for you. 😏
Language: Jupyter Notebook - Size: 34.6 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 156 - Forks: 39

awantik/pyspark-learning
Updated repository
Language: Jupyter Notebook - Size: 67.2 MB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 156 - Forks: 125

archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Language: Scala - Size: 39.5 MB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 144 - Forks: 33

RubensZimbres/Repo-2019
BERT, AWS RDS, AWS Forecast, EMR Spark Cluster, Hive, Serverless, Google Assistant + Raspberry Pi, Infrared, Google Cloud Platform Natural Language, Anomaly detection, Tensorflow, Mathematics
Language: Jupyter Notebook - Size: 57.8 MB - Last synced at: 3 months ago - Pushed at: almost 4 years ago - Stars: 138 - Forks: 73

alanchn31/Movalytics-Data-Warehouse
Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow
Language: Python - Size: 717 KB - Last synced at: 7 months ago - Pushed at: about 5 years ago - Stars: 133 - Forks: 31

GoogleCloudPlatform/dataproc-templates
Dataproc templates and pipelines for solving in-cloud data tasks
Language: Python - Size: 18.6 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 129 - Forks: 100

kavgan/phrase-at-scale
Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English
Language: Python - Size: 80.6 MB - Last synced at: 3 months ago - Pushed at: almost 6 years ago - Stars: 128 - Forks: 45

LearningJournal/Spark-Streaming-In-Python
Apache Spark 3 - Structured Streaming Course Material
Language: Python - Size: 19.4 MB - Last synced at: 2 months ago - Pushed at: almost 2 years ago - Stars: 121 - Forks: 159

zero323/pyspark-stubs 📦
Apache (Py)Spark type annotations (stub files).
Language: Python - Size: 1.3 MB - Last synced at: 23 days ago - Pushed at: almost 3 years ago - Stars: 117 - Forks: 37

edyoda/pyspark-tutorial
PySpark Code for Hands-on Learners
Language: Jupyter Notebook - Size: 53.3 MB - Last synced at: 11 days ago - Pushed at: over 5 years ago - Stars: 116 - Forks: 120

harishpuvvada/BitCoin-Value-Predictor 📦
[NOT MAINTAINED] Predicting Bit coin price using Time series analysis and sentiment analysis of tweets on bitcoin
Language: Jupyter Notebook - Size: 3.07 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 113 - Forks: 29

microsoft/Azure-Databricks-NYC-Taxi-Workshop
An Azure Databricks workshop leveraging the New York Taxi and Limousine Commission Trip Records dataset
Language: Scala - Size: 42.3 MB - Last synced at: 2 days ago - Pushed at: about 2 years ago - Stars: 109 - Forks: 109

andfanilo/pyspark-tutorial
Jupyter notebooks for pyspark tutorials given at University
Language: Jupyter Notebook - Size: 4.52 MB - Last synced at: 21 days ago - Pushed at: 7 months ago - Stars: 107 - Forks: 75

wadhwasahil/Relation_Extraction
Relation Extraction using Deep learning(CNN)
Language: Python - Size: 4.89 MB - Last synced at: over 1 year ago - Pushed at: over 8 years ago - Stars: 99 - Forks: 53

minio/spark-select
A library for Spark DataFrame using MinIO Select API
Language: Scala - Size: 65.4 KB - Last synced at: 2 days ago - Pushed at: over 5 years ago - Stars: 98 - Forks: 19

mitchelllisle/sparkdantic
✨ A Pydantic to PySpark schema library
Language: Python - Size: 3.26 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 94 - Forks: 15

adornes/spark_python_ml_examples
Spark 2.0 Python Machine Learning examples
Language: Python - Size: 13.7 KB - Last synced at: about 2 years ago - Pushed at: over 5 years ago - Stars: 93 - Forks: 42

itsjafer/jupyterlab-sparkmonitor
JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook
Language: JavaScript - Size: 4.08 MB - Last synced at: 7 days ago - Pushed at: over 2 years ago - Stars: 92 - Forks: 23

hyunjoonbok/PySpark
PySpark functions and utilities with examples. Assists ETL process of data modeling
Language: Jupyter Notebook - Size: 3.79 MB - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 89 - Forks: 73

mahmoudparsian/pyspark-algorithms
PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
Language: Python - Size: 40.5 MB - Last synced at: 3 months ago - Pushed at: over 5 years ago - Stars: 84 - Forks: 44

awesome-spark/learn-by-examples 📦
Real-world Spark pipelines examples
Language: Scala - Size: 1.1 MB - Last synced at: 10 days ago - Pushed at: over 7 years ago - Stars: 83 - Forks: 30

sabman/PySparkGeoAnalysis
:globe_with_meridians: Interactive Workshop on GeoAnalysis using PySpark
Language: Jupyter Notebook - Size: 18.5 MB - Last synced at: 6 days ago - Pushed at: over 8 years ago - Stars: 79 - Forks: 34

groda/big_data
Big Data essentials: Hadoop, MapReduce, Spark. Explore tutorials and demos in Jupyter notebooks—most are self-contained and live, ready to run with a click.
Language: Jupyter Notebook - Size: 51.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 78 - Forks: 27

anovos/anovos
Anovos - An Open Source Library for Scalable feature engineering Using Apache-Spark
Language: Jupyter Notebook - Size: 88.5 MB - Last synced at: 21 days ago - Pushed at: about 2 years ago - Stars: 76 - Forks: 25

myamafuj/hadoop-hive-spark-docker
Hadoop-Hive-Spark cluster + Jupyter on Docker
Language: Jupyter Notebook - Size: 75.2 KB - Last synced at: about 2 months ago - Pushed at: 6 months ago - Stars: 73 - Forks: 52

martandsingh/ApacheSpark
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
Language: Python - Size: 141 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 71 - Forks: 47

src-d/jgit-spark-connector
jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.
Language: Scala - Size: 8.55 MB - Last synced at: 24 days ago - Pushed at: over 6 years ago - Stars: 71 - Forks: 32

yugokato/Spark-and-Kafka_IoT-Data-Processing-and-Analytics
Final Project for IoT: Big Data Processing and Analytics class. Analyzing U.S nationwide temperature from IoT sensors in real-time
Language: Python - Size: 159 KB - Last synced at: 3 months ago - Pushed at: over 8 years ago - Stars: 70 - Forks: 27

anguenot/pyspark-cassandra
pyspark-cassandra is a Python port of the awesome @datastax Spark Cassandra connector. Compatible w/ Spark 2.0, 2.1, 2.2, 2.3 and 2.4
Language: Python - Size: 8.25 MB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 69 - Forks: 28

zaratsian/Spark
Apache Spark (Scala, PySpark, SparkR) Code, Tricks, and References
Language: Jupyter Notebook - Size: 8.37 MB - Last synced at: 2 months ago - Pushed at: over 6 years ago - Stars: 69 - Forks: 37

sbl-sdsc/mmtf-pyspark
Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.
Language: Python - Size: 524 MB - Last synced at: 4 days ago - Pushed at: about 2 years ago - Stars: 68 - Forks: 27

sberbank-ai-lab/RePlay 📦
RecSys Library
Language: Python - Size: 12.7 MB - Last synced at: 7 months ago - Pushed at: about 3 years ago - Stars: 68 - Forks: 6

iobruno/data-engineering-examples
Data Engineering examples for Airflow, Prefect; dbt for BigQuery, Redshift, ClickHouse, Postgres, DuckDB; PySpark for Batch processing; Kafka for Stream processing
Language: Python - Size: 5.07 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 66 - Forks: 2

castanan/w2v
Word2Vec models with Twitter data using Spark. Blog:
Language: Jupyter Notebook - Size: 1.4 MB - Last synced at: over 1 year ago - Pushed at: over 6 years ago - Stars: 66 - Forks: 29

alexott/databricks-playground
Code samples, etc. for Databricks
Language: Python - Size: 202 KB - Last synced at: 8 days ago - Pushed at: 23 days ago - Stars: 64 - Forks: 19

sodadata/soda-spark
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
Language: Python - Size: 118 KB - Last synced at: 8 days ago - Pushed at: almost 3 years ago - Stars: 64 - Forks: 8

Mooseburger1/Springboard-Data-Science-Immersive
Language: Jupyter Notebook - Size: 206 MB - Last synced at: 2 months ago - Pushed at: almost 7 years ago - Stars: 63 - Forks: 24

tubular/sparkly
Helpers & syntactic sugar for PySpark.
Language: Python - Size: 1.54 MB - Last synced at: 19 days ago - Pushed at: almost 2 years ago - Stars: 62 - Forks: 9

arverma/TowardsDataEngineering
This repo contains commands that data engineers use in day to day work.
Language: Python - Size: 81.1 KB - Last synced at: 6 days ago - Pushed at: over 2 years ago - Stars: 61 - Forks: 43

tomaztk/Azure-Databricks
Azure Databricks - Advent of 2020 Blogposts
Language: Jupyter Notebook - Size: 44.9 MB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 60 - Forks: 49

sbl-sdsc/mmtf-workshop-2018
Structural Bioinformatics Training Workshop & Hackathon 2018
Language: Jupyter Notebook - Size: 369 MB - Last synced at: 2 months ago - Pushed at: about 3 years ago - Stars: 60 - Forks: 31
