An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: pyarrow

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

Language: Python - Size: 8.97 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 930 - Forks: 136

ibis-project/ibis

the portable Python dataframe library

Language: Python - Size: 173 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 5,698 - Forks: 631

vaexio/vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Language: Python - Size: 133 MB - Last synced at: 6 days ago - Pushed at: 6 months ago - Stars: 8,365 - Forks: 599

zen-xu/pyarrow-stubs

Type annotations for pyarrow

Language: Python - Size: 541 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 25 - Forks: 18

milenkovicm/ballista_python

Ballista cluster pyarrow udf support

Language: Rust - Size: 152 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

DanielAvdar/pandas-pyarrow

Seamlessly switch Pandas DataFrame backend to PyArrow.

Language: Python - Size: 873 KB - Last synced at: 4 days ago - Pushed at: 9 days ago - Stars: 8 - Forks: 2

uber/petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Language: Python - Size: 2.69 MB - Last synced at: 10 days ago - Pushed at: over 1 year ago - Stars: 1,829 - Forks: 281

dr-saad-la/Pyarrow-Tuts

Pyarrow Tutorials

Language: Jupyter Notebook - Size: 17.6 KB - Last synced at: 11 days ago - Pushed at: 8 months ago - Stars: 4 - Forks: 0

cning112/fastflight

FastFlight is a high-performance data transfer framework using Apache Arrow Flight for efficient, modular, and pluggable data streaming with optional FastAPI integration for HTTP-based access.

Language: Python - Size: 1.58 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 1 - Forks: 0

legout/pydala2

poor man´s data lake - Simple api to efficiently query your parquet datasets using Duckdb or polars

Language: Python - Size: 4.67 MB - Last synced at: 8 days ago - Pushed at: 26 days ago - Stars: 6 - Forks: 0

trustedshops-public/schema2pyarrow

Converts AsyncApi and JsonSchema to PyArrow schema

Language: Python - Size: 31.3 KB - Last synced at: 20 days ago - Pushed at: 2 months ago - Stars: 5 - Forks: 0

kraina-ai/overturemaestro

An open-source tool for reading OvertureMaps data with multiprocessing and additional Quality-of-Life features

Language: Python - Size: 1.25 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 24 - Forks: 0

RandomFractals/chicago-crimes

Exploring Chicago crimes dataset with Jupyter notebooks, DuckDB, Malloy and new Panel/PyScript data and dashboard tools.

Language: Jupyter Notebook - Size: 188 MB - Last synced at: 17 days ago - Pushed at: about 2 years ago - Stars: 38 - Forks: 4

polsm91/acero-delta-lake-streaming

Proof Of Concept to pull news from RSS feeds, and store them in a Data Lake using Delta Lake's "delta-rs" as a writer, and "PyArrow Acero" as the streaming and compute engine.

Language: Python - Size: 9.77 KB - Last synced at: 20 days ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

namansnghl/SQLify

Text (biz req) to SQL Semantic Parser with LLMs Transfer Learning. This will help Analysts query DB without knowing SQL.

Language: Jupyter Notebook - Size: 2.89 MB - Last synced at: 16 days ago - Pushed at: 12 months ago - Stars: 1 - Forks: 0

edisedis777/Coffee-Shops-Analysis

This project analyzes the Foursquare Open Source Places dataset to explore the distribution of coffee shops across the United States, with a special focus on Portland, Oregon.

Language: HTML - Size: 12.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

HuangRicky/manylinux2014builds

manylinux2014 Python pkg builds

Size: 5.86 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

treynas/runetick-v1-etl-portfolio

An ETL pipeline for the OSRS Trading App that extracts, transforms, and loads trading data and news from RuneScape Wiki APIs and RSS feeds into structured Parquet files stored in Google Cloud Storage. Deployed on Cloud Run and orchestrated via Cloud Scheduler.

Language: Python - Size: 13.7 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

icaropires/pdf2dataset

Converts a whole subdirectory with a big (or small) volume of PDF documents to a dataset (pandas DataFrame) with error tracking and choice of features

Language: Python - Size: 301 KB - Last synced at: 8 days ago - Pushed at: 3 months ago - Stars: 20 - Forks: 4

iljavaleev/arrow_examples

apache arrow cpp examples

Language: Jupyter Notebook - Size: 157 KB - Last synced at: 30 days ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

d-chris/federleicht

lightweigth function decorators to cache your `pandas.DataFrame` as feather.

Language: Python - Size: 219 KB - Last synced at: 12 days ago - Pushed at: 5 months ago - Stars: 2 - Forks: 0

thread53/pqviewer

View Apache Parquet Files In Your Terminal

Language: Python - Size: 239 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 0

anto18671/arrow-datasets

A high-performance Rust utility that converts large image datasets into chunked Apache Arrow files for efficient storage and processing.

Language: Rust - Size: 13.7 KB - Last synced at: about 1 hour ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

psmyth94/biosets

A bioinformatics extension of 🤗 Datasets library, built for ML applications on biological and omics data, offering easy integration of metadata and low-code data management tools.

Language: Python - Size: 279 KB - Last synced at: 3 days ago - Pushed at: 5 months ago - Stars: 3 - Forks: 0

tdiprima/arrow-and-parquet

Exploring Apache Arrow and Apache Parquet

Language: Jupyter Notebook - Size: 63.5 KB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

No-Country-simulation/c22-29-ft-data-bi

🚀Optimización del control de inventario para BottleFlow Logistics: un enfoque estratégico basado en datos #Supply Chain🚀

Language: Jupyter Notebook - Size: 2.76 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 2 - Forks: 0

kivanc57/mongodb_operations

This project demonstrates MongoDB CRUD operations, data modeling, and advanced Atlas Search & Atlas Vector Search features with Hugging Face, PyMongo, and PyArrow to efficiently process, query data and get efficient results.

Language: Jupyter Notebook - Size: 49.8 KB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

lykmapipo/NYC-TLC-Trip-Data

Python scripts to download, process, and analyze the New York City Taxi and Limousine Commission (TLC) Trip Record Data dataset

Language: Jupyter Notebook - Size: 100 MB - Last synced at: 18 days ago - Pushed at: 8 months ago - Stars: 4 - Forks: 1

lykmapipo/Python-Spark-Log-Analysis

Python scripts to process, and analyze log files using PySpark.

Language: Python - Size: 131 KB - Last synced at: 18 days ago - Pushed at: 9 months ago - Stars: 5 - Forks: 0

goalzz85/sql2arrow

SQL2Arrow, short for 'SQL to Arrow,' is a Python library that provides convenient and high-performance methods to parse INSERT SQL statements into Arrow arrays. It is particularly useful for analyzing data dumped by mysqldump or other tools.

Language: Rust - Size: 157 KB - Last synced at: 19 days ago - Pushed at: 4 months ago - Stars: 5 - Forks: 0

Murtaza-arif/all-you-need-to-know-for-data-engineer

This repository is designed to showcase various aspects of data engineering, including tools, frameworks, and end-to-end projects. It covers everything from data ingestion and transformation to data warehousing and cloud-based solutions.

Language: Python - Size: 168 KB - Last synced at: 14 days ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

stefur/swemaps

Maps of Sweden in GeoParquet

Language: Python - Size: 2.06 MB - Last synced at: 2 days ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

pyodide/pyodide-numpy-2.0-rebuilds

Rebuilds against NumPy v2 for Pyodide packages: LightGBM and PyArrow in light of the NumPy 2.0 migration for Pyodide 0.27

Size: 10.7 KB - Last synced at: 1 day ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

d-chris/federleicht-benchmark

small script to benchmark `federleicht`

Language: Python - Size: 323 KB - Last synced at: 29 days ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

345950647/clickhouse_types

Converting ClickHouse types into other schemas' types

Language: Python - Size: 14.6 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

Utlak88/NarrativeWave-App

Django app that collects data from parquet files, provides the data in json format, and then consumes and returns the data.

Language: HTML - Size: 294 MB - Last synced at: 5 months ago - Pushed at: almost 4 years ago - Stars: 0 - Forks: 0

ismailhammounou/db2ixf

db2ixf is a python package with a CLI that simplifies the parsing and processing of IBM Integration eXchange Format (IXF) files.

Language: Python - Size: 1 MB - Last synced at: 14 days ago - Pushed at: about 1 year ago - Stars: 17 - Forks: 1

dacort/faker-cli

Command-line interface to quickly generate fake CSV and JSON data

Language: Python - Size: 81.1 KB - Last synced at: 14 days ago - Pushed at: 9 months ago - Stars: 72 - Forks: 6

legout/pydala 📦

Poor mans simple python api for creating a local or remote datalake based on several (pyarrow) datasets using duckdb

Language: Python - Size: 14.1 MB - Last synced at: 8 days ago - Pushed at: almost 2 years ago - Stars: 9 - Forks: 1

xbrianh/xdlake

A loose implementation of the deltalake protocol, written in Python on top of pyarrow, focused on extensibility, customizability, and distributed data.

Language: Python - Size: 185 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

milesgranger/flaco

(PoC) A very memory-efficient way to read data from PostgreSQL

Language: Rust - Size: 146 KB - Last synced at: 18 days ago - Pushed at: over 2 years ago - Stars: 15 - Forks: 0

wheretrue/biobear

Work with bioinformatic files using Arrow, Polars, and/or DuckDB

Language: Rust - Size: 1.99 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 145 - Forks: 7

BenyaminZojaji/mongodb_tutorial

MongoDB tutorial repository

Language: Python - Size: 3.91 KB - Last synced at: 8 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

derak-isaack/UberEATSAnalytics

Analyze Uber Eats Menu big data for various analytics

Language: Jupyter Notebook - Size: 2.84 MB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

rhanyele/tcg-pokemon-data-pipeline

O projeto Pokémon TCG Data Pipeline visa criar uma solução de pipeline de dados para coletar, transformar e analisar informações sobre as cartas de Pokémon TCG (Trading Card Game).

Language: Python - Size: 39.1 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

jaysnm/dremio-arrow

Dremio Arrow Flight Client

Language: Python - Size: 1.79 MB - Last synced at: 25 days ago - Pushed at: about 1 year ago - Stars: 4 - Forks: 5

agutiernc/data-eng-zoomcamp

Data Engineering Zoomcamp 2024

Language: Jupyter Notebook - Size: 1 MB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

miraisolutions/apache-arrow-flight-python-example

Code examples / snippets for website news post

Language: Python - Size: 3.91 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 1 - Forks: 1

JakubPluta/flowsome

Minimal framework for building and executing data workflows on a single machine

Language: Python - Size: 226 KB - Last synced at: 6 months ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

SaelKimberly/rxls

Reading both XLSX and XLSB files, fast and memory-safe, with Python, into PyArrow

Language: Jupyter Notebook - Size: 64.5 MB - Last synced at: 8 months ago - Pushed at: about 1 year ago - Stars: 9 - Forks: 2

runsascoded/parquet-diff-test

Demonstrate differences in Parquet files generated by pyarrow on macOS vs. {Ubuntu, Windows}.

Language: Python - Size: 42 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

dimdasci/howto-mongo-bulk-update-from-parquet

MongoDB Bulk Update Tutorial with AWS Glue, PyArrow, Parquet, and Infrastructure as Code (IaC)

Language: Python - Size: 21.5 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

tezzytezzy/us-flight-delay

Complete Guide to Data Munging

Language: Jupyter Notebook - Size: 619 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

k3ssdev/ParquetScriptTools

Colección de scripts en Python con PyArrow y Pandas para facilitar el manejo eficiente de archivos Parquet. Incluye herramientas para visualizar esquemas, convertir a CSV, verificar duplicados y fusionar archivos Parquet.

Language: Python - Size: 27.3 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

wanghalan/dspg22_pyarrow-example

Saving large files on GitHub

Language: Python - Size: 964 MB - Last synced at: over 1 year ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

hansalemaos/procmondf

provides a convenient and efficient solution for capturing and analyzing system activity logs using Procmon and converting them to the pandas compatible Parquet file format (2% of the original pml file size)

Language: Python - Size: 3.55 MB - Last synced at: 1 day ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

FCP-INDI/b2t-prototype 📦

Organize neuroimaging data derivatives into parquet tables

Language: Python - Size: 3.09 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 3 - Forks: 0

amoeba/pyarrow-ipc-example

An example showing how to send compressed RecordBatches over HTTP with PyArrow.

Language: Python - Size: 1000 Bytes - Last synced at: 2 days ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

leehuwuj/lake-inspector

Inspect your lakehouse data by using PyArrow

Language: Python - Size: 447 KB - Last synced at: 5 days ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

mercator-labs/oakstore

highspeed timeseries pandas dataframe database

Language: Python - Size: 21.5 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 4 - Forks: 1

adavis444/pyarrow-alpine-wheel

Dockerfile and Python 3.9 wheel for PyArrow 3.0.0 built on Alpine 3.14 (does not include Plasma or Parquet)

Language: Dockerfile - Size: 4.97 MB - Last synced at: about 2 years ago - Pushed at: almost 4 years ago - Stars: 1 - Forks: 0

evi1ghost/csv_to_avro_or_parquet

CSV converter

Language: Python - Size: 364 KB - Last synced at: almost 2 years ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

svjack/PyArrowExpressionCastToolkit

A small cast tookit class drived from _ParquetDatasetV2 to support cast in filters argument

Language: Python - Size: 28.3 KB - Last synced at: about 2 years ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0

kiwi0fruit/featherhelper

Concise interface to cache numpy arrays and pandas dataframes

Language: Python - Size: 47.9 KB - Last synced at: 15 days ago - Pushed at: about 6 years ago - Stars: 3 - Forks: 0

bundgus/pydata2parquet

Convert data to the parquet format with Python dask and pyarrow.

Language: Python - Size: 2.38 MB - Last synced at: about 2 years ago - Pushed at: almost 7 years ago - Stars: 0 - Forks: 0