GitHub topics: pyarrow

Repositories

uber/petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Language: Python - Size: 2.69 MB - Last synced at: about 1 hour ago - Pushed at: over 1 year ago - Stars: 1,848 - Forks: 284

gizmodata/gizmosql

🚀 GizmoSQL — High-Performance SQL Server for the Cloud

Language: C++ - Size: 834 KB - Last synced at: about 10 hours ago - Pushed at: 1 day ago - Stars: 133 - Forks: 13

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

Language: Python - Size: 11.4 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1,203 - Forks: 159

vaexio/vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Language: Python - Size: 133 MB - Last synced at: about 6 hours ago - Pushed at: 10 months ago - Stars: 8,415 - Forks: 599

milenkovicm/ballista_python

Ballista cluster pyarrow udf support

Language: Rust - Size: 157 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

ibis-project/ibis

the portable Python dataframe library

Language: Python - Size: 176 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 5,956 - Forks: 661

CangyuanLi/checkedframe

Lightweight, engine-agnostic dataframe validation

Language: Python - Size: 2.85 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 10 - Forks: 0

cning112/fastflight

FastFlight is a high-performance data transfer framework using Apache Arrow Flight for efficient, modular, and pluggable data streaming with optional FastAPI integration for HTTP-based access.

Language: Python - Size: 2.88 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 6 - Forks: 0

zen-xu/pyarrow-stubs

Type annotations for pyarrow

Language: Python - Size: 720 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 36 - Forks: 21

aekasitt/polars-v-pandas Fork of vb100/polars_vs_pandas

Quick dataframe performance comparison between Polars and Pandas, with and without PyArrow backend

Language: Python - Size: 543 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

kraina-ai/overturemaestro

An open-source tool for reading OvertureMaps data with multiprocessing and additional Quality-of-Life features

Language: Python - Size: 1.39 MB - Last synced at: 4 days ago - Pushed at: 24 days ago - Stars: 29 - Forks: 0

legout/pydala2

poor man´s data lake - Simple api to efficiently query your parquet datasets using Duckdb or polars

Language: Python - Size: 4.73 MB - Last synced at: 8 days ago - Pushed at: 17 days ago - Stars: 6 - Forks: 0

thread53/pqviewer

View Apache Parquet Files In Your Terminal

Language: Python - Size: 244 KB - Last synced at: 26 days ago - Pushed at: 4 months ago - Stars: 15 - Forks: 0

pyodide/pyodide-numpy-2.0-rebuilds 📦

Rebuilds against NumPy v2 for Pyodide packages: LightGBM and PyArrow in light of the NumPy 2.0 migration for Pyodide 0.27

Size: 11.7 KB - Last synced at: about 13 hours ago - Pushed at: 29 days ago - Stars: 0 - Forks: 0

trustedshops-public/schema2pyarrow

Converts AsyncApi and JsonSchema to PyArrow schema

Language: Python - Size: 37.1 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 7 - Forks: 0

goalzz85/sql2arrow

SQL2Arrow, short for 'SQL to Arrow,' is a Python library that provides convenient and high-performance methods to parse INSERT SQL statements into Arrow arrays. It is particularly useful for analyzing data dumped by mysqldump or other tools.

Language: Rust - Size: 157 KB - Last synced at: 26 days ago - Pushed at: 7 months ago - Stars: 6 - Forks: 0

wheretrue/biobear

Work with bioinformatic files using Arrow, Polars, and/or DuckDB

Language: Rust - Size: 2.01 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 185 - Forks: 12

itsbigspark/pymetagen

Metadata Generator

Language: Python - Size: 1.07 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

DanielAvdar/pandas-pyarrow

Seamlessly switch Pandas DataFrame backend to PyArrow.

Language: Python - Size: 1.06 MB - Last synced at: 2 days ago - Pushed at: about 2 months ago - Stars: 9 - Forks: 4

grillazz/fastapi-polars

The FastAPI-Polars project aims to provide a fast, simple, and cost-effective solution to gather and store large volumes of data

Language: Python - Size: 1.15 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 8 - Forks: 0

lykmapipo/Python-Spark-Log-Analysis

Python scripts to process, and analyze log files using PySpark.

Language: Python - Size: 131 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 0

lykmapipo/NYC-TLC-Trip-Data

Python scripts to download, process, and analyze the New York City Taxi and Limousine Commission (TLC) Trip Record Data dataset

Language: Jupyter Notebook - Size: 100 MB - Last synced at: 2 months ago - Pushed at: 11 months ago - Stars: 5 - Forks: 1

polsm91/news-insights-deltalake-ai

AI-Powered News Analysis from RSS Feeds with PyArrow Acero and Delta Lake's "delta-rs" (POC)

Language: Python - Size: 595 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

edisedis777/Coffee-Shops-Analysis

This project analyzes the Foursquare Open Source Places dataset to explore the distribution of coffee shops across the United States, with a special focus on Portland, Oregon.

Language: HTML - Size: 12.6 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

dr-saad-la/Pyarrow-Tuts

Pyarrow Tutorials

Language: Jupyter Notebook - Size: 17.6 KB - Last synced at: 4 months ago - Pushed at: 11 months ago - Stars: 4 - Forks: 0

RandomFractals/chicago-crimes

Exploring Chicago crimes dataset with Jupyter notebooks, DuckDB, Malloy and new Panel/PyScript data and dashboard tools.

Language: Jupyter Notebook - Size: 188 MB - Last synced at: 4 months ago - Pushed at: over 2 years ago - Stars: 38 - Forks: 4

namansnghl/SQLify

Text (biz req) to SQL Semantic Parser with LLMs Transfer Learning. This will help Analysts query DB without knowing SQL.

Language: Jupyter Notebook - Size: 2.89 MB - Last synced at: 4 months ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

HuangRicky/manylinux2014builds

manylinux2014 Python pkg builds

Size: 5.86 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

icaropires/pdf2dataset

Converts a whole subdirectory with a big (or small) volume of PDF documents to a dataset (pandas DataFrame) with error tracking and choice of features

Language: Python - Size: 301 KB - Last synced at: 4 months ago - Pushed at: 7 months ago - Stars: 20 - Forks: 4

iljavaleev/arrow_examples

apache arrow cpp examples

Language: Jupyter Notebook - Size: 157 KB - Last synced at: 4 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

d-chris/federleicht

lightweigth function decorators to cache your `pandas.DataFrame` as feather.

Language: Python - Size: 219 KB - Last synced at: 28 days ago - Pushed at: 8 months ago - Stars: 2 - Forks: 0

anto18671/arrow-datasets

A high-performance Rust utility that converts large image datasets into chunked Apache Arrow files for efficient storage and processing.

Language: Rust - Size: 13.7 KB - Last synced at: 1 day ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

psmyth94/biosets

A bioinformatics extension of 🤗 Datasets library, built for ML applications on biological and omics data, offering easy integration of metadata and low-code data management tools.

Language: Python - Size: 279 KB - Last synced at: 12 days ago - Pushed at: 9 months ago - Stars: 3 - Forks: 0

No-Country-simulation/c22-29-ft-data-bi

🚀Optimización del control de inventario para BottleFlow Logistics: un enfoque estratégico basado en datos #Supply Chain🚀

Language: Jupyter Notebook - Size: 2.76 MB - Last synced at: 6 months ago - Pushed at: 8 months ago - Stars: 2 - Forks: 0

kivanc57/mongodb_operations

This project demonstrates MongoDB CRUD operations, data modeling, and advanced Atlas Search & Atlas Vector Search features with Hugging Face, PyMongo, and PyArrow to efficiently process, query data and get efficient results.

Language: Jupyter Notebook - Size: 49.8 KB - Last synced at: 4 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

Murtaza-arif/all-you-need-to-know-for-data-engineer

This repository is designed to showcase various aspects of data engineering, including tools, frameworks, and end-to-end projects. It covers everything from data ingestion and transformation to data warehousing and cloud-based solutions.

Language: Python - Size: 168 KB - Last synced at: 4 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

stefur/swemaps

Maps of Sweden in GeoParquet

Language: Python - Size: 2.06 MB - Last synced at: 6 days ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

d-chris/federleicht-benchmark

small script to benchmark `federleicht`

Language: Python - Size: 323 KB - Last synced at: 4 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

345950647/clickhouse_types

Converting ClickHouse types into other schemas' types

Language: Python - Size: 14.6 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 1 - Forks: 0

Utlak88/NarrativeWave-App

Django app that collects data from parquet files, provides the data in json format, and then consumes and returns the data.

Language: HTML - Size: 294 MB - Last synced at: 9 months ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 0

ismailhammounou/db2ixf

db2ixf is a python package with a CLI that simplifies the parsing and processing of IBM Integration eXchange Format (IXF) files.

Language: Python - Size: 1 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 17 - Forks: 1

xbrianh/xdlake

A loose implementation of the deltalake protocol, written in Python on top of pyarrow, focused on extensibility, customizability, and distributed data.

Language: Python - Size: 241 KB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 4 - Forks: 0

dacort/faker-cli

Command-line interface to quickly generate fake CSV and JSON data

Language: Python - Size: 81.1 KB - Last synced at: 4 months ago - Pushed at: about 1 year ago - Stars: 72 - Forks: 6

legout/pydala 📦

Poor mans simple python api for creating a local or remote datalake based on several (pyarrow) datasets using duckdb

Language: Python - Size: 14.1 MB - Last synced at: 8 days ago - Pushed at: about 2 years ago - Stars: 9 - Forks: 1

milesgranger/flaco

(PoC) A very memory-efficient way to read data from PostgreSQL

Language: Rust - Size: 146 KB - Last synced at: 2 months ago - Pushed at: almost 3 years ago - Stars: 15 - Forks: 0

BenyaminZojaji/mongodb_tutorial

MongoDB tutorial repository

Language: Python - Size: 3.91 KB - Last synced at: 11 months ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

derak-isaack/UberEATSAnalytics

Analyze Uber Eats Menu big data for various analytics

Language: Jupyter Notebook - Size: 2.84 MB - Last synced at: 5 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

rhanyele/tcg-pokemon-data-pipeline

O projeto Pokémon TCG Data Pipeline visa criar uma solução de pipeline de dados para coletar, transformar e analisar informações sobre as cartas de Pokémon TCG (Trading Card Game).

Language: Python - Size: 39.1 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

jaysnm/dremio-arrow

Dremio Arrow Flight Client

Language: Python - Size: 1.79 MB - Last synced at: 27 days ago - Pushed at: over 1 year ago - Stars: 4 - Forks: 5

agutiernc/data-eng-zoomcamp

Data Engineering Zoomcamp 2024

Language: Jupyter Notebook - Size: 1 MB - Last synced at: 5 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

miraisolutions/apache-arrow-flight-python-example

Code examples / snippets for website news post

Language: Python - Size: 3.91 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 1

JakubPluta/flowsome

Minimal framework for building and executing data workflows on a single machine

Language: Python - Size: 226 KB - Last synced at: 9 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

SaelKimberly/rxls

Reading both XLSX and XLSB files, fast and memory-safe, with Python, into PyArrow

Language: Jupyter Notebook - Size: 64.5 MB - Last synced at: 12 months ago - Pushed at: over 1 year ago - Stars: 9 - Forks: 2

runsascoded/parquet-diff-test

Demonstrate differences in Parquet files generated by pyarrow on macOS vs. {Ubuntu, Windows}.

Language: Python - Size: 42 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

dimdasci/howto-mongo-bulk-update-from-parquet

MongoDB Bulk Update Tutorial with AWS Glue, PyArrow, Parquet, and Infrastructure as Code (IaC)

Language: Python - Size: 21.5 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

tezzytezzy/us-flight-delay

Complete Guide to Data Munging

Language: Jupyter Notebook - Size: 619 KB - Last synced at: over 1 year ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 0

k3ssdev/ParquetScriptTools

Colección de scripts en Python con PyArrow y Pandas para facilitar el manejo eficiente de archivos Parquet. Incluye herramientas para visualizar esquemas, convertir a CSV, verificar duplicados y fusionar archivos Parquet.

Language: Python - Size: 27.3 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0

hansalemaos/procmondf

provides a convenient and efficient solution for capturing and analyzing system activity logs using Procmon and converting them to the pandas compatible Parquet file format (2% of the original pml file size)

Language: Python - Size: 3.55 MB - Last synced at: 27 days ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0