GitHub topics: pyarrow
narwhals-dev/narwhals
Lightweight and extensible compatibility layer between dataframe libraries!
Language: Python - Size: 8.97 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 930 - Forks: 136

ibis-project/ibis
the portable Python dataframe library
Language: Python - Size: 173 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 5,698 - Forks: 631

vaexio/vaex
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Language: Python - Size: 133 MB - Last synced at: 6 days ago - Pushed at: 6 months ago - Stars: 8,365 - Forks: 599

zen-xu/pyarrow-stubs
Type annotations for pyarrow
Language: Python - Size: 541 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 25 - Forks: 18

milenkovicm/ballista_python
Ballista cluster pyarrow udf support
Language: Rust - Size: 152 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

DanielAvdar/pandas-pyarrow
Seamlessly switch Pandas DataFrame backend to PyArrow.
Language: Python - Size: 873 KB - Last synced at: 4 days ago - Pushed at: 9 days ago - Stars: 8 - Forks: 2

uber/petastorm
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Language: Python - Size: 2.69 MB - Last synced at: 10 days ago - Pushed at: over 1 year ago - Stars: 1,829 - Forks: 281

dr-saad-la/Pyarrow-Tuts
Pyarrow Tutorials
Language: Jupyter Notebook - Size: 17.6 KB - Last synced at: 11 days ago - Pushed at: 8 months ago - Stars: 4 - Forks: 0

cning112/fastflight
FastFlight is a high-performance data transfer framework using Apache Arrow Flight for efficient, modular, and pluggable data streaming with optional FastAPI integration for HTTP-based access.
Language: Python - Size: 1.58 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 1 - Forks: 0

legout/pydala2
poor man´s data lake - Simple api to efficiently query your parquet datasets using Duckdb or polars
Language: Python - Size: 4.67 MB - Last synced at: 8 days ago - Pushed at: 26 days ago - Stars: 6 - Forks: 0

trustedshops-public/schema2pyarrow
Converts AsyncApi and JsonSchema to PyArrow schema
Language: Python - Size: 31.3 KB - Last synced at: 20 days ago - Pushed at: 2 months ago - Stars: 5 - Forks: 0

kraina-ai/overturemaestro
An open-source tool for reading OvertureMaps data with multiprocessing and additional Quality-of-Life features
Language: Python - Size: 1.25 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 24 - Forks: 0

RandomFractals/chicago-crimes
Exploring Chicago crimes dataset with Jupyter notebooks, DuckDB, Malloy and new Panel/PyScript data and dashboard tools.
Language: Jupyter Notebook - Size: 188 MB - Last synced at: 17 days ago - Pushed at: about 2 years ago - Stars: 38 - Forks: 4

polsm91/acero-delta-lake-streaming
Proof Of Concept to pull news from RSS feeds, and store them in a Data Lake using Delta Lake's "delta-rs" as a writer, and "PyArrow Acero" as the streaming and compute engine.
Language: Python - Size: 9.77 KB - Last synced at: 20 days ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

namansnghl/SQLify
Text (biz req) to SQL Semantic Parser with LLMs Transfer Learning. This will help Analysts query DB without knowing SQL.
Language: Jupyter Notebook - Size: 2.89 MB - Last synced at: 16 days ago - Pushed at: 12 months ago - Stars: 1 - Forks: 0

edisedis777/Coffee-Shops-Analysis
This project analyzes the Foursquare Open Source Places dataset to explore the distribution of coffee shops across the United States, with a special focus on Portland, Oregon.
Language: HTML - Size: 12.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

HuangRicky/manylinux2014builds
manylinux2014 Python pkg builds
Size: 5.86 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

treynas/runetick-v1-etl-portfolio
An ETL pipeline for the OSRS Trading App that extracts, transforms, and loads trading data and news from RuneScape Wiki APIs and RSS feeds into structured Parquet files stored in Google Cloud Storage. Deployed on Cloud Run and orchestrated via Cloud Scheduler.
Language: Python - Size: 13.7 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

icaropires/pdf2dataset
Converts a whole subdirectory with a big (or small) volume of PDF documents to a dataset (pandas DataFrame) with error tracking and choice of features
Language: Python - Size: 301 KB - Last synced at: 8 days ago - Pushed at: 3 months ago - Stars: 20 - Forks: 4

iljavaleev/arrow_examples
apache arrow cpp examples
Language: Jupyter Notebook - Size: 157 KB - Last synced at: 30 days ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

d-chris/federleicht
lightweigth function decorators to cache your `pandas.DataFrame` as feather.
Language: Python - Size: 219 KB - Last synced at: 12 days ago - Pushed at: 5 months ago - Stars: 2 - Forks: 0

thread53/pqviewer
View Apache Parquet Files In Your Terminal
Language: Python - Size: 239 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 0

anto18671/arrow-datasets
A high-performance Rust utility that converts large image datasets into chunked Apache Arrow files for efficient storage and processing.
Language: Rust - Size: 13.7 KB - Last synced at: about 1 hour ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

psmyth94/biosets
A bioinformatics extension of 🤗 Datasets library, built for ML applications on biological and omics data, offering easy integration of metadata and low-code data management tools.
Language: Python - Size: 279 KB - Last synced at: 3 days ago - Pushed at: 5 months ago - Stars: 3 - Forks: 0

tdiprima/arrow-and-parquet
Exploring Apache Arrow and Apache Parquet
Language: Jupyter Notebook - Size: 63.5 KB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

No-Country-simulation/c22-29-ft-data-bi
🚀Optimización del control de inventario para BottleFlow Logistics: un enfoque estratégico basado en datos #Supply Chain🚀
Language: Jupyter Notebook - Size: 2.76 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 2 - Forks: 0

kivanc57/mongodb_operations
This project demonstrates MongoDB CRUD operations, data modeling, and advanced Atlas Search & Atlas Vector Search features with Hugging Face, PyMongo, and PyArrow to efficiently process, query data and get efficient results.
Language: Jupyter Notebook - Size: 49.8 KB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

lykmapipo/NYC-TLC-Trip-Data
Python scripts to download, process, and analyze the New York City Taxi and Limousine Commission (TLC) Trip Record Data dataset
Language: Jupyter Notebook - Size: 100 MB - Last synced at: 18 days ago - Pushed at: 8 months ago - Stars: 4 - Forks: 1

lykmapipo/Python-Spark-Log-Analysis
Python scripts to process, and analyze log files using PySpark.
Language: Python - Size: 131 KB - Last synced at: 18 days ago - Pushed at: 9 months ago - Stars: 5 - Forks: 0

goalzz85/sql2arrow
SQL2Arrow, short for 'SQL to Arrow,' is a Python library that provides convenient and high-performance methods to parse INSERT SQL statements into Arrow arrays. It is particularly useful for analyzing data dumped by mysqldump or other tools.
Language: Rust - Size: 157 KB - Last synced at: 19 days ago - Pushed at: 4 months ago - Stars: 5 - Forks: 0

Murtaza-arif/all-you-need-to-know-for-data-engineer
This repository is designed to showcase various aspects of data engineering, including tools, frameworks, and end-to-end projects. It covers everything from data ingestion and transformation to data warehousing and cloud-based solutions.
Language: Python - Size: 168 KB - Last synced at: 14 days ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

stefur/swemaps
Maps of Sweden in GeoParquet
Language: Python - Size: 2.06 MB - Last synced at: 2 days ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

pyodide/pyodide-numpy-2.0-rebuilds
Rebuilds against NumPy v2 for Pyodide packages: LightGBM and PyArrow in light of the NumPy 2.0 migration for Pyodide 0.27
Size: 10.7 KB - Last synced at: 1 day ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

d-chris/federleicht-benchmark
small script to benchmark `federleicht`
Language: Python - Size: 323 KB - Last synced at: 29 days ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

345950647/clickhouse_types
Converting ClickHouse types into other schemas' types
Language: Python - Size: 14.6 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

Utlak88/NarrativeWave-App
Django app that collects data from parquet files, provides the data in json format, and then consumes and returns the data.
Language: HTML - Size: 294 MB - Last synced at: 5 months ago - Pushed at: almost 4 years ago - Stars: 0 - Forks: 0

ismailhammounou/db2ixf
db2ixf is a python package with a CLI that simplifies the parsing and processing of IBM Integration eXchange Format (IXF) files.
Language: Python - Size: 1 MB - Last synced at: 14 days ago - Pushed at: about 1 year ago - Stars: 17 - Forks: 1

dacort/faker-cli
Command-line interface to quickly generate fake CSV and JSON data
Language: Python - Size: 81.1 KB - Last synced at: 14 days ago - Pushed at: 9 months ago - Stars: 72 - Forks: 6

legout/pydala 📦
Poor mans simple python api for creating a local or remote datalake based on several (pyarrow) datasets using duckdb
Language: Python - Size: 14.1 MB - Last synced at: 8 days ago - Pushed at: almost 2 years ago - Stars: 9 - Forks: 1

xbrianh/xdlake
A loose implementation of the deltalake protocol, written in Python on top of pyarrow, focused on extensibility, customizability, and distributed data.
Language: Python - Size: 185 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

milesgranger/flaco
(PoC) A very memory-efficient way to read data from PostgreSQL
Language: Rust - Size: 146 KB - Last synced at: 18 days ago - Pushed at: over 2 years ago - Stars: 15 - Forks: 0

wheretrue/biobear
Work with bioinformatic files using Arrow, Polars, and/or DuckDB
Language: Rust - Size: 1.99 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 145 - Forks: 7

BenyaminZojaji/mongodb_tutorial
MongoDB tutorial repository
Language: Python - Size: 3.91 KB - Last synced at: 8 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

derak-isaack/UberEATSAnalytics
Analyze Uber Eats Menu big data for various analytics
Language: Jupyter Notebook - Size: 2.84 MB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

rhanyele/tcg-pokemon-data-pipeline
O projeto Pokémon TCG Data Pipeline visa criar uma solução de pipeline de dados para coletar, transformar e analisar informações sobre as cartas de Pokémon TCG (Trading Card Game).
Language: Python - Size: 39.1 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

jaysnm/dremio-arrow
Dremio Arrow Flight Client
Language: Python - Size: 1.79 MB - Last synced at: 25 days ago - Pushed at: about 1 year ago - Stars: 4 - Forks: 5

agutiernc/data-eng-zoomcamp
Data Engineering Zoomcamp 2024
Language: Jupyter Notebook - Size: 1 MB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

miraisolutions/apache-arrow-flight-python-example
Code examples / snippets for website news post
Language: Python - Size: 3.91 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 1 - Forks: 1

JakubPluta/flowsome
Minimal framework for building and executing data workflows on a single machine
Language: Python - Size: 226 KB - Last synced at: 6 months ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

SaelKimberly/rxls
Reading both XLSX and XLSB files, fast and memory-safe, with Python, into PyArrow
Language: Jupyter Notebook - Size: 64.5 MB - Last synced at: 8 months ago - Pushed at: about 1 year ago - Stars: 9 - Forks: 2

runsascoded/parquet-diff-test
Demonstrate differences in Parquet files generated by pyarrow on macOS vs. {Ubuntu, Windows}.
Language: Python - Size: 42 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

dimdasci/howto-mongo-bulk-update-from-parquet
MongoDB Bulk Update Tutorial with AWS Glue, PyArrow, Parquet, and Infrastructure as Code (IaC)
Language: Python - Size: 21.5 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

tezzytezzy/us-flight-delay
Complete Guide to Data Munging
Language: Jupyter Notebook - Size: 619 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

k3ssdev/ParquetScriptTools
Colección de scripts en Python con PyArrow y Pandas para facilitar el manejo eficiente de archivos Parquet. Incluye herramientas para visualizar esquemas, convertir a CSV, verificar duplicados y fusionar archivos Parquet.
Language: Python - Size: 27.3 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

wanghalan/dspg22_pyarrow-example
Saving large files on GitHub
Language: Python - Size: 964 MB - Last synced at: over 1 year ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

hansalemaos/procmondf
provides a convenient and efficient solution for capturing and analyzing system activity logs using Procmon and converting them to the pandas compatible Parquet file format (2% of the original pml file size)
Language: Python - Size: 3.55 MB - Last synced at: 1 day ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

FCP-INDI/b2t-prototype 📦
Organize neuroimaging data derivatives into parquet tables
Language: Python - Size: 3.09 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 3 - Forks: 0

amoeba/pyarrow-ipc-example
An example showing how to send compressed RecordBatches over HTTP with PyArrow.
Language: Python - Size: 1000 Bytes - Last synced at: 2 days ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

leehuwuj/lake-inspector
Inspect your lakehouse data by using PyArrow
Language: Python - Size: 447 KB - Last synced at: 5 days ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

mercator-labs/oakstore
highspeed timeseries pandas dataframe database
Language: Python - Size: 21.5 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 4 - Forks: 1

adavis444/pyarrow-alpine-wheel
Dockerfile and Python 3.9 wheel for PyArrow 3.0.0 built on Alpine 3.14 (does not include Plasma or Parquet)
Language: Dockerfile - Size: 4.97 MB - Last synced at: about 2 years ago - Pushed at: almost 4 years ago - Stars: 1 - Forks: 0

evi1ghost/csv_to_avro_or_parquet
CSV converter
Language: Python - Size: 364 KB - Last synced at: almost 2 years ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

svjack/PyArrowExpressionCastToolkit
A small cast tookit class drived from _ParquetDatasetV2 to support cast in filters argument
Language: Python - Size: 28.3 KB - Last synced at: about 2 years ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0

kiwi0fruit/featherhelper
Concise interface to cache numpy arrays and pandas dataframes
Language: Python - Size: 47.9 KB - Last synced at: 15 days ago - Pushed at: about 6 years ago - Stars: 3 - Forks: 0

bundgus/pydata2parquet
Convert data to the parquet format with Python dask and pyarrow.
Language: Python - Size: 2.38 MB - Last synced at: about 2 years ago - Pushed at: almost 7 years ago - Stars: 0 - Forks: 0
