An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: pyspark-dataframes

maltzsama/sumeh

Sumeh — Unified Data Quality Framework Sumeh is a unified data quality validation framework supporting multiple backends (PySpark, Dask, Polars, DuckDB, Pandas) with centralized rule configuration.

Language: Python - Size: 1.69 MB - Last synced at: 16 days ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 0

sbl-sdsc/df-parallel

Comparison of Dataframe libraries for parallel processing of large tabular files on CPU and GPU.

Language: Jupyter Notebook - Size: 3.33 MB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 3

RJBarker/home_sales

Use PySpark and SparkSQL to execute SQL queries through a temporary view of the DataFrame created. Conduct additional queries on cached and partitioned data to determine runtime comparisons.

Language: Jupyter Notebook - Size: 146 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

mhaseebtariq/pyspark-helpers

Useful helper functions for PySpark dataframe operations

Language: Jupyter Notebook - Size: 94.7 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 1