An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: pyarrow

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

Language: Python - Size: 10.6 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,125 - Forks: 149

zen-xu/pyarrow-stubs

Type annotations for pyarrow

Language: Python - Size: 467 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 33 - Forks: 19

ibis-project/ibis

the portable Python dataframe library

Language: Python - Size: 176 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 5,847 - Forks: 648

gizmodata/gizmosql

A Flight SQL Server implementation - with DuckDB and SQLite back-ends.

Language: C++ - Size: 797 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 34 - Forks: 4

DanielAvdar/pandas-pyarrow

Seamlessly switch Pandas DataFrame backend to PyArrow.

Language: Python - Size: 1.06 MB - Last synced at: 13 days ago - Pushed at: 18 days ago - Stars: 9 - Forks: 4

cning112/fastflight

FastFlight is a high-performance data transfer framework using Apache Arrow Flight for efficient, modular, and pluggable data streaming with optional FastAPI integration for HTTP-based access.

Language: Python - Size: 2.84 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 4 - Forks: 0

itsbigspark/pymetagen

Metadata Generator

Language: Python - Size: 1.06 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

milenkovicm/ballista_python

Ballista cluster pyarrow udf support

Language: Rust - Size: 253 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 1 - Forks: 0

uber/petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Language: Python - Size: 2.69 MB - Last synced at: 13 days ago - Pushed at: over 1 year ago - Stars: 1,840 - Forks: 280

vaexio/vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Language: Python - Size: 133 MB - Last synced at: 18 days ago - Pushed at: 9 months ago - Stars: 8,387 - Forks: 598

trustedshops-public/schema2pyarrow

Converts AsyncApi and JsonSchema to PyArrow schema

Language: Python - Size: 38.1 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 7 - Forks: 0

thread53/pqviewer

View Apache Parquet Files In Your Terminal

Language: Python - Size: 244 KB - Last synced at: 22 days ago - Pushed at: 3 months ago - Stars: 15 - Forks: 0

wheretrue/biobear

Work with bioinformatic files using Arrow, Polars, and/or DuckDB

Language: Rust - Size: 2.01 MB - Last synced at: 24 days ago - Pushed at: 3 months ago - Stars: 184 - Forks: 12

legout/pydala2

poor man´s data lake - Simple api to efficiently query your parquet datasets using Duckdb or polars

Language: Python - Size: 4.72 MB - Last synced at: 6 days ago - Pushed at: 29 days ago - Stars: 6 - Forks: 0

grillazz/fastapi-polars

The FastAPI-Polars project aims to provide a fast, simple, and cost-effective solution to gather and store large volumes of data

Language: Python - Size: 1.15 MB - Last synced at: 21 days ago - Pushed at: about 1 month ago - Stars: 8 - Forks: 0

lykmapipo/Python-Spark-Log-Analysis

Python scripts to process, and analyze log files using PySpark.

Language: Python - Size: 131 KB - Last synced at: about 1 hour ago - Pushed at: 11 months ago - Stars: 6 - Forks: 0

lykmapipo/NYC-TLC-Trip-Data

Python scripts to download, process, and analyze the New York City Taxi and Limousine Commission (TLC) Trip Record Data dataset

Language: Jupyter Notebook - Size: 100 MB - Last synced at: 21 days ago - Pushed at: 10 months ago - Stars: 5 - Forks: 1

kraina-ai/overturemaestro

An open-source tool for reading OvertureMaps data with multiprocessing and additional Quality-of-Life features

Language: Python - Size: 1.39 MB - Last synced at: 5 days ago - Pushed at: 19 days ago - Stars: 27 - Forks: 0

polsm91/news-insights-deltalake-ai

AI-Powered News Analysis from RSS Feeds with PyArrow Acero and Delta Lake's "delta-rs" (POC)

Language: Python - Size: 595 KB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

edisedis777/Coffee-Shops-Analysis

This project analyzes the Foursquare Open Source Places dataset to explore the distribution of coffee shops across the United States, with a special focus on Portland, Oregon.

Language: HTML - Size: 12.6 MB - Last synced at: about 18 hours ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

dr-saad-la/Pyarrow-Tuts

Pyarrow Tutorials

Language: Jupyter Notebook - Size: 17.6 KB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 4 - Forks: 0

RandomFractals/chicago-crimes

Exploring Chicago crimes dataset with Jupyter notebooks, DuckDB, Malloy and new Panel/PyScript data and dashboard tools.

Language: Jupyter Notebook - Size: 188 MB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 38 - Forks: 4

namansnghl/SQLify

Text (biz req) to SQL Semantic Parser with LLMs Transfer Learning. This will help Analysts query DB without knowing SQL.

Language: Jupyter Notebook - Size: 2.89 MB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

HuangRicky/manylinux2014builds

manylinux2014 Python pkg builds

Size: 5.86 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

icaropires/pdf2dataset

Converts a whole subdirectory with a big (or small) volume of PDF documents to a dataset (pandas DataFrame) with error tracking and choice of features

Language: Python - Size: 301 KB - Last synced at: 2 months ago - Pushed at: 5 months ago - Stars: 20 - Forks: 4

iljavaleev/arrow_examples

apache arrow cpp examples

Language: Jupyter Notebook - Size: 157 KB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

d-chris/federleicht

lightweigth function decorators to cache your `pandas.DataFrame` as feather.

Language: Python - Size: 219 KB - Last synced at: 24 days ago - Pushed at: 7 months ago - Stars: 2 - Forks: 0

anto18671/arrow-datasets

A high-performance Rust utility that converts large image datasets into chunked Apache Arrow files for efficient storage and processing.

Language: Rust - Size: 13.7 KB - Last synced at: 8 days ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

psmyth94/biosets

A bioinformatics extension of 🤗 Datasets library, built for ML applications on biological and omics data, offering easy integration of metadata and low-code data management tools.

Language: Python - Size: 279 KB - Last synced at: 2 days ago - Pushed at: 7 months ago - Stars: 3 - Forks: 0

No-Country-simulation/c22-29-ft-data-bi

🚀Optimización del control de inventario para BottleFlow Logistics: un enfoque estratégico basado en datos #Supply Chain🚀

Language: Jupyter Notebook - Size: 2.76 MB - Last synced at: 5 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

kivanc57/mongodb_operations

This project demonstrates MongoDB CRUD operations, data modeling, and advanced Atlas Search & Atlas Vector Search features with Hugging Face, PyMongo, and PyArrow to efficiently process, query data and get efficient results.

Language: Jupyter Notebook - Size: 49.8 KB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

goalzz85/sql2arrow

SQL2Arrow, short for 'SQL to Arrow,' is a Python library that provides convenient and high-performance methods to parse INSERT SQL statements into Arrow arrays. It is particularly useful for analyzing data dumped by mysqldump or other tools.

Language: Rust - Size: 157 KB - Last synced at: 18 days ago - Pushed at: 6 months ago - Stars: 5 - Forks: 0

Murtaza-arif/all-you-need-to-know-for-data-engineer

This repository is designed to showcase various aspects of data engineering, including tools, frameworks, and end-to-end projects. It covers everything from data ingestion and transformation to data warehousing and cloud-based solutions.

Language: Python - Size: 168 KB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

stefur/swemaps

Maps of Sweden in GeoParquet

Language: Python - Size: 2.06 MB - Last synced at: 1 day ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

pyodide/pyodide-numpy-2.0-rebuilds

Rebuilds against NumPy v2 for Pyodide packages: LightGBM and PyArrow in light of the NumPy 2.0 migration for Pyodide 0.27

Size: 10.7 KB - Last synced at: about 11 hours ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

d-chris/federleicht-benchmark

small script to benchmark `federleicht`

Language: Python - Size: 323 KB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

345950647/clickhouse_types

Converting ClickHouse types into other schemas' types

Language: Python - Size: 14.6 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

Utlak88/NarrativeWave-App

Django app that collects data from parquet files, provides the data in json format, and then consumes and returns the data.

Language: HTML - Size: 294 MB - Last synced at: 7 months ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 0

ismailhammounou/db2ixf

db2ixf is a python package with a CLI that simplifies the parsing and processing of IBM Integration eXchange Format (IXF) files.

Language: Python - Size: 1 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 17 - Forks: 1

xbrianh/xdlake

A loose implementation of the deltalake protocol, written in Python on top of pyarrow, focused on extensibility, customizability, and distributed data.

Language: Python - Size: 241 KB - Last synced at: 13 days ago - Pushed at: 8 months ago - Stars: 4 - Forks: 0

dacort/faker-cli

Command-line interface to quickly generate fake CSV and JSON data

Language: Python - Size: 81.1 KB - Last synced at: 3 months ago - Pushed at: 12 months ago - Stars: 72 - Forks: 6

legout/pydala 📦

Poor mans simple python api for creating a local or remote datalake based on several (pyarrow) datasets using duckdb

Language: Python - Size: 14.1 MB - Last synced at: 6 days ago - Pushed at: almost 2 years ago - Stars: 9 - Forks: 1

milesgranger/flaco

(PoC) A very memory-efficient way to read data from PostgreSQL

Language: Rust - Size: 146 KB - Last synced at: 21 days ago - Pushed at: over 2 years ago - Stars: 15 - Forks: 0

BenyaminZojaji/mongodb_tutorial

MongoDB tutorial repository

Language: Python - Size: 3.91 KB - Last synced at: 10 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

derak-isaack/UberEATSAnalytics

Analyze Uber Eats Menu big data for various analytics

Language: Jupyter Notebook - Size: 2.84 MB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

rhanyele/tcg-pokemon-data-pipeline

O projeto Pokémon TCG Data Pipeline visa criar uma solução de pipeline de dados para coletar, transformar e analisar informações sobre as cartas de Pokémon TCG (Trading Card Game).

Language: Python - Size: 39.1 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

jaysnm/dremio-arrow

Dremio Arrow Flight Client

Language: Python - Size: 1.79 MB - Last synced at: 23 days ago - Pushed at: over 1 year ago - Stars: 4 - Forks: 5

agutiernc/data-eng-zoomcamp

Data Engineering Zoomcamp 2024

Language: Jupyter Notebook - Size: 1 MB - Last synced at: 4 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

miraisolutions/apache-arrow-flight-python-example

Code examples / snippets for website news post

Language: Python - Size: 3.91 KB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 1

JakubPluta/flowsome

Minimal framework for building and executing data workflows on a single machine

Language: Python - Size: 226 KB - Last synced at: 8 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

SaelKimberly/rxls

Reading both XLSX and XLSB files, fast and memory-safe, with Python, into PyArrow

Language: Jupyter Notebook - Size: 64.5 MB - Last synced at: 11 months ago - Pushed at: over 1 year ago - Stars: 9 - Forks: 2

runsascoded/parquet-diff-test

Demonstrate differences in Parquet files generated by pyarrow on macOS vs. {Ubuntu, Windows}.

Language: Python - Size: 42 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

dimdasci/howto-mongo-bulk-update-from-parquet

MongoDB Bulk Update Tutorial with AWS Glue, PyArrow, Parquet, and Infrastructure as Code (IaC)

Language: Python - Size: 21.5 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

tezzytezzy/us-flight-delay

Complete Guide to Data Munging

Language: Jupyter Notebook - Size: 619 KB - Last synced at: over 1 year ago - Pushed at: almost 4 years ago - Stars: 0 - Forks: 0

k3ssdev/ParquetScriptTools

Colección de scripts en Python con PyArrow y Pandas para facilitar el manejo eficiente de archivos Parquet. Incluye herramientas para visualizar esquemas, convertir a CSV, verificar duplicados y fusionar archivos Parquet.

Language: Python - Size: 27.3 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

hansalemaos/procmondf

provides a convenient and efficient solution for capturing and analyzing system activity logs using Procmon and converting them to the pandas compatible Parquet file format (2% of the original pml file size)

Language: Python - Size: 3.55 MB - Last synced at: 18 days ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

FCP-INDI/b2t-prototype 📦

Organize neuroimaging data derivatives into parquet tables

Language: Python - Size: 3.09 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 3 - Forks: 0

amoeba/pyarrow-ipc-example

An example showing how to send compressed RecordBatches over HTTP with PyArrow.

Language: Python - Size: 1000 Bytes - Last synced at: 1 day ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

leehuwuj/lake-inspector

Inspect your lakehouse data by using PyArrow

Language: Python - Size: 447 KB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

mercator-labs/oakstore

highspeed timeseries pandas dataframe database

Language: Python - Size: 21.5 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 4 - Forks: 1

adavis444/pyarrow-alpine-wheel

Dockerfile and Python 3.9 wheel for PyArrow 3.0.0 built on Alpine 3.14 (does not include Plasma or Parquet)

Language: Dockerfile - Size: 4.97 MB - Last synced at: over 2 years ago - Pushed at: almost 4 years ago - Stars: 1 - Forks: 0

1blt-archive/dspg22_pyarrow-example

Saving large files on GitHub

Language: Python - Size: 964 MB - Last synced at: 11 days ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

evi1ghost/csv_to_avro_or_parquet

CSV converter

Language: Python - Size: 364 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

svjack/PyArrowExpressionCastToolkit

A small cast tookit class drived from _ParquetDatasetV2 to support cast in filters argument

Language: Python - Size: 28.3 KB - Last synced at: over 2 years ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0

kiwi0fruit/featherhelper

Concise interface to cache numpy arrays and pandas dataframes

Language: Python - Size: 47.9 KB - Last synced at: 12 days ago - Pushed at: over 6 years ago - Stars: 3 - Forks: 0

bundgus/pydata2parquet

Convert data to the parquet format with Python dask and pyarrow.

Language: Python - Size: 2.38 MB - Last synced at: about 2 years ago - Pushed at: about 7 years ago - Stars: 0 - Forks: 0