GitHub topics: pyspark-dataframes
maltzsama/sumeh
Sumeh — Unified Data Quality Framework Sumeh is a unified data quality validation framework supporting multiple backends (PySpark, Dask, Polars, DuckDB, Pandas) with centralized rule configuration.
Language: Python - Size: 1.69 MB - Last synced at: 16 days ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 0

sbl-sdsc/df-parallel
Comparison of Dataframe libraries for parallel processing of large tabular files on CPU and GPU.
Language: Jupyter Notebook - Size: 3.33 MB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 3

RJBarker/home_sales
Use PySpark and SparkSQL to execute SQL queries through a temporary view of the DataFrame created. Conduct additional queries on cached and partitioned data to determine runtime comparisons.
Language: Jupyter Notebook - Size: 146 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

mhaseebtariq/pyspark-helpers
Useful helper functions for PySpark dataframe operations
Language: Jupyter Notebook - Size: 94.7 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 1
