An open API service providing repository metadata for many open source software ecosystems.

Topic: "deltalake"

paradedb/pg_analytics 📦

DuckDB-powered data lake analytics from Postgres

Language: Rust - Size: 814 KB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 522 - Forks: 21

databrickslabs/dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

Language: Python - Size: 11.1 MB - Last synced at: 4 days ago - Pushed at: 24 days ago - Stars: 406 - Forks: 73

delta-io/kafka-delta-ingest

A highly efficient daemon for streaming data from Kafka into Delta Lake

Language: Rust - Size: 1.85 MB - Last synced at: 2 days ago - Pushed at: about 1 month ago - Stars: 403 - Forks: 90

MrPowers/mack

Delta Lake helper methods in PySpark

Language: Python - Size: 2.81 MB - Last synced at: 13 days ago - Pushed at: 9 months ago - Stars: 325 - Forks: 42

japila-books/delta-lake-internals

The Internals of Delta Lake

Size: 191 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 183 - Forks: 36

smart-data-lake/smart-data-lake

Smart Automation Tool for building modern Data Lakes and Data Pipelines

Language: Scala - Size: 44 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 124 - Forks: 23

uname-n/deltabase

a lightweight, comprehensive solution for managing delta tables built on polars and deltalake

Language: Python - Size: 728 KB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 121 - Forks: 1

izhangzhihao/Real-time-Data-Warehouse

Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi

Language: Dockerfile - Size: 106 KB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 113 - Forks: 44

WeBankFinTech/Streamis

Streaming application development and management system, based on Linkis and DSS, planning to provide the workflow-like graphical drag-and-drop development capability.

Language: Java - Size: 72.2 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 107 - Forks: 44

anneglienke/101_upsert-delta

This repository exemplifies a simple ELT process using delta to perform upsert and remove data files that aren't in the latest state of the transaction log for the table.

Language: Python - Size: 1.17 MB - Last synced at: 9 days ago - Pushed at: over 3 years ago - Stars: 91 - Forks: 7

dacort/faker-cli

Command-line interface to quickly generate fake CSV and JSON data

Language: Python - Size: 81.1 KB - Last synced at: about 2 months ago - Pushed at: 11 months ago - Stars: 72 - Forks: 6

martandsingh/ApacheSpark

This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.

Language: Python - Size: 141 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 71 - Forks: 47

bhavink/databricks

Databricks Platform - Architecture, Security, Automation and much more!!

Language: Jupyter Notebook - Size: 14.4 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 50 - Forks: 27

flintml/flint

A self-contained, lightweight and OOB research platform for modern ML

Language: Jupyter Notebook - Size: 7.52 MB - Last synced at: 28 days ago - Pushed at: about 1 month ago - Stars: 45 - Forks: 2

buoyant-data/oxbow

Collection of AWS Lambdas for creating and managing Delta tables

Language: Rust - Size: 288 KB - Last synced at: 10 days ago - Pushed at: 29 days ago - Stars: 36 - Forks: 10

sankamuk/PysparkCheatsheet

PySpark Cheatsheet

Language: Python - Size: 11.2 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 36 - Forks: 27

leehuwuj/olh

Open source stack lakehouse

Language: Python - Size: 4.57 MB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 25 - Forks: 1

DataTech-Solutions/Threat-Detection-and-Visualization

Threat Detection and Visualization

Language: TSQL - Size: 11.9 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 25 - Forks: 153

newfront/hitchhikers_guide_to_deltalake_streaming

Don't Panic. This guide will help you when it feels like the end of the world.

Language: Jupyter Notebook - Size: 230 KB - Last synced at: 2 months ago - Pushed at: 12 months ago - Stars: 23 - Forks: 9

mrjsj/msfabricutils

Spark-free Python utilities for Microsoft Fabric focused on Data Engineering using Polars and delta-rs

Language: Python - Size: 1.38 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 22 - Forks: 4

ognis1205/delta-hub

A platform and cloud-based service for data sharing based on the Delta Sharing protocol.

Language: TypeScript - Size: 6.66 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 21 - Forks: 3

aws-samples/amazon-emr-with-delta-lake

Amazon EMR Notebook to show how to read from and write to Delta tables with Amazon EMR

Language: Jupyter Notebook - Size: 473 KB - Last synced at: about 8 hours ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 14

reisdebora/awesome-databricks

A curated list of awesome Databricks resources, including Spark

Size: 27.3 KB - Last synced at: 29 days ago - Pushed at: 11 months ago - Stars: 18 - Forks: 3

ismailhammounou/db2ixf

db2ixf is a python package with a CLI that simplifies the parsing and processing of IBM Integration eXchange Format (IXF) files.

Language: Python - Size: 1 MB - Last synced at: 15 days ago - Pushed at: about 1 year ago - Stars: 17 - Forks: 1

aravinthsci/Spark_Delta_Lake

Delta Lake Examples

Language: Jupyter Notebook - Size: 285 KB - Last synced at: over 1 year ago - Pushed at: about 5 years ago - Stars: 12 - Forks: 12

goodwillpunning/nodejs-sharing-client

A Node.js connector for Delta Sharing.

Language: JavaScript - Size: 423 KB - Last synced at: 5 days ago - Pushed at: 2 months ago - Stars: 10 - Forks: 7

bmsuisse/lakeapi

API for distributing Data Lake Data

Language: Python - Size: 15.2 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 10 - Forks: 2

taka-yayoi/public_repo

Databricksのサンプルノートブックを格納しています。

Language: Python - Size: 43.9 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 8 - Forks: 7

roeap/flight-fusion

Language: Rust - Size: 4.3 MB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 2

xbrianh/xdlake

A loose implementation of the deltalake protocol, written in Python on top of pyarrow, focused on extensibility, customizability, and distributed data.

Language: Python - Size: 241 KB - Last synced at: 4 days ago - Pushed at: 8 months ago - Stars: 4 - Forks: 0

yandex-cloud/yc-delta

Delta Lake для Yandex Data Processing

Language: Java - Size: 186 KB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 4 - Forks: 1

LeoneGarage/StreamJoin

A framework for incremental streaming joins and incremental streaming aggregations over change data feeds from Databricks Delta

Language: Python - Size: 196 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 3 - Forks: 2

alberttwong/opendatalakehouse

The Data Lakehouse Readiness Score is a quantitative measure that assesses a database's vendor support of Apache Iceberg, Apache Hudi and Delta Lake.

Size: 37.1 KB - Last synced at: 5 days ago - Pushed at: 9 months ago - Stars: 3 - Forks: 0

satyakommula96/spark_benchmark

Spark Performance Benchmark suite to evaluate all TPC-DS and TPC-H query times

Language: Scala - Size: 97.7 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 2

datalpia/laketower

Oversee your lakehouse

Language: Python - Size: 1.55 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 2 - Forks: 0

cmackenzie1/deltalake-go 📦

An implementation of Delta Lake in Go

Language: Go - Size: 79.1 KB - Last synced at: 2 months ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 0

sebastianruizm/demo-data-pipeline Fork of lbodnarin/data-pipeline

Simple data pipeline (Airflow + Spark)

Language: Python - Size: 7.61 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 2 - Forks: 0

credimi/pandora

Relational tables from nested data

Language: Scala - Size: 32.2 KB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 1

easonlai/databricks_delta_table_samples

This is a code sample repository for demonstrating how to perform Databricks Delta Table operations.

Language: HTML - Size: 23.9 MB - Last synced at: 3 months ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 2

ev2900/EMR_Studio_Delta_Lake

Deltalake examples designed to be run on AWS Elastic Map Reduce (EMR) via. EMR Studio or EMR Notebooks

Language: Jupyter Notebook - Size: 15.6 KB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 1 - Forks: 1

JayyShah/Databricks-AWS

Databricks provides a unified, open platform for all your data. It empowers data scientists, data engineers and data analysts with a simple collaborative environment to run interactive and scheduled data analysis workloads.

Language: Python - Size: 3.91 KB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

himewel/ifood-data

Ifood data wrangling with Apache Airflow and Apache Spark running on Kubernetes

Language: Python - Size: 396 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 1

jasondavindev/delta-lake-dms-cdc

Example application for DMS CDC with Delta Lake and Apache Hudi

Language: Python - Size: 69.5 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 1

vvalcristina/treinamento-dataproc-deltalake

Ambiente de treinamento para Dataproc e DeltaLake

Language: Jupyter Notebook - Size: 664 KB - Last synced at: over 2 years ago - Pushed at: almost 4 years ago - Stars: 1 - Forks: 1

mrjsj/blueno

A Python ETL library for creating declarative data pipelines.

Language: Python - Size: 2.83 MB - Last synced at: 9 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

tansu-io/example-delta-lake

Tansu schema-backed topics, instantly accessible as Delta Lake tables

Language: Just - Size: 27.3 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

tashi-2004/Apache-Airflow-Kafka-Spark-DeltaLake-Real-Time-Stream-Pipeline

This project implements a real-time data pipeline using Apache Airflow, Kafka, Apache Spark, and Delta Lake. It supports both batch (Coldpath) and real-time (Hotpath) data ingestion, processing, and storage. Airflow is used for orchestrating the data workflows.

Language: Python - Size: 12.5 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

dde-labs/self-lake

Self Sandbox for interact with any lake storage like DeltaLake, IceBerg, Hudi etc.

Language: Python - Size: 665 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

jayhan94/MiniLake

A morden mini lakehouse based on Spark and Iceberg running in the docker.

Size: 8.79 KB - Last synced at: 8 days ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

najmaelboutaheri/Patents_analysis

This repository contains code and resources for analyzing patents using Apache Spark, Python, and AWS services. The objective of this project is to extract insights and trends from patent data to inform business decisions and intellectual property strategies.

Language: Jupyter Notebook - Size: 7.79 MB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

dougdss89/wideworldadventure

This repository includes all files that compose the design and unification of the databases AdventureWorks and WideWorldAdventure project.

Language: Shell - Size: 230 KB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

palutz/rust_nextstep

A series of exercises to play with more advanced topics in Rust

Language: Rust - Size: 304 KB - Last synced at: 3 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

makism/datastack-playground

A datastack playground; includes Spark, Kafka, Airbyte, etc.

Language: Jupyter Notebook - Size: 55.7 KB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

herry13/glue-docker-image

A custom Glue Docker image

Language: Dockerfile - Size: 2.93 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

jcguidry/flight-ml-preprocess-gcp

Continuous flight event data processing using Spark Streaming, Delta Lake storage, deployed on GCP dataproc cluster.

Language: Python - Size: 13.7 KB - Last synced at: 18 days ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

data-engineer-course/taxacco

Проект № 4 для курса "Инженер данных".

Language: Jupyter Notebook - Size: 11.5 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

buoyant-data/lambda-delta-optimize

AWS Lambda function for optimizing Delta tables

Language: HCL - Size: 64.5 KB - Last synced at: 10 days ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

OpenTableFormat/OpenTableFormat.github.io

Website for open table format 🕸

Language: CSS - Size: 4.59 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

naiborhujosua/Data-Scientist-learning-path-using-databricks

This is the summary of learning Data Science using Databricks

Size: 51.8 KB - Last synced at: about 1 year ago - Pushed at: almost 4 years ago - Stars: 0 - Forks: 0