Topic: "deltalake"
paradedb/pg_analytics 📦
DuckDB-powered data lake analytics from Postgres
Language: Rust - Size: 814 KB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 522 - Forks: 21

databrickslabs/dbldatagen
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
Language: Python - Size: 11.1 MB - Last synced at: 4 days ago - Pushed at: 24 days ago - Stars: 406 - Forks: 73

delta-io/kafka-delta-ingest
A highly efficient daemon for streaming data from Kafka into Delta Lake
Language: Rust - Size: 1.85 MB - Last synced at: 2 days ago - Pushed at: about 1 month ago - Stars: 403 - Forks: 90

MrPowers/mack
Delta Lake helper methods in PySpark
Language: Python - Size: 2.81 MB - Last synced at: 13 days ago - Pushed at: 9 months ago - Stars: 325 - Forks: 42

japila-books/delta-lake-internals
The Internals of Delta Lake
Size: 191 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 183 - Forks: 36

smart-data-lake/smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Language: Scala - Size: 44 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 124 - Forks: 23

uname-n/deltabase
a lightweight, comprehensive solution for managing delta tables built on polars and deltalake
Language: Python - Size: 728 KB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 121 - Forks: 1

izhangzhihao/Real-time-Data-Warehouse
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
Language: Dockerfile - Size: 106 KB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 113 - Forks: 44

WeBankFinTech/Streamis
Streaming application development and management system, based on Linkis and DSS, planning to provide the workflow-like graphical drag-and-drop development capability.
Language: Java - Size: 72.2 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 107 - Forks: 44

anneglienke/101_upsert-delta
This repository exemplifies a simple ELT process using delta to perform upsert and remove data files that aren't in the latest state of the transaction log for the table.
Language: Python - Size: 1.17 MB - Last synced at: 9 days ago - Pushed at: over 3 years ago - Stars: 91 - Forks: 7

dacort/faker-cli
Command-line interface to quickly generate fake CSV and JSON data
Language: Python - Size: 81.1 KB - Last synced at: about 2 months ago - Pushed at: 11 months ago - Stars: 72 - Forks: 6

martandsingh/ApacheSpark
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
Language: Python - Size: 141 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 71 - Forks: 47

bhavink/databricks
Databricks Platform - Architecture, Security, Automation and much more!!
Language: Jupyter Notebook - Size: 14.4 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 50 - Forks: 27

flintml/flint
A self-contained, lightweight and OOB research platform for modern ML
Language: Jupyter Notebook - Size: 7.52 MB - Last synced at: 28 days ago - Pushed at: about 1 month ago - Stars: 45 - Forks: 2

buoyant-data/oxbow
Collection of AWS Lambdas for creating and managing Delta tables
Language: Rust - Size: 288 KB - Last synced at: 10 days ago - Pushed at: 29 days ago - Stars: 36 - Forks: 10

sankamuk/PysparkCheatsheet
PySpark Cheatsheet
Language: Python - Size: 11.2 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 36 - Forks: 27

leehuwuj/olh
Open source stack lakehouse
Language: Python - Size: 4.57 MB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 25 - Forks: 1

DataTech-Solutions/Threat-Detection-and-Visualization
Threat Detection and Visualization
Language: TSQL - Size: 11.9 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 25 - Forks: 153

newfront/hitchhikers_guide_to_deltalake_streaming
Don't Panic. This guide will help you when it feels like the end of the world.
Language: Jupyter Notebook - Size: 230 KB - Last synced at: 2 months ago - Pushed at: 12 months ago - Stars: 23 - Forks: 9

mrjsj/msfabricutils
Spark-free Python utilities for Microsoft Fabric focused on Data Engineering using Polars and delta-rs
Language: Python - Size: 1.38 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 22 - Forks: 4

ognis1205/delta-hub
A platform and cloud-based service for data sharing based on the Delta Sharing protocol.
Language: TypeScript - Size: 6.66 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 21 - Forks: 3

aws-samples/amazon-emr-with-delta-lake
Amazon EMR Notebook to show how to read from and write to Delta tables with Amazon EMR
Language: Jupyter Notebook - Size: 473 KB - Last synced at: about 8 hours ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 14

reisdebora/awesome-databricks
A curated list of awesome Databricks resources, including Spark
Size: 27.3 KB - Last synced at: 29 days ago - Pushed at: 11 months ago - Stars: 18 - Forks: 3

ismailhammounou/db2ixf
db2ixf is a python package with a CLI that simplifies the parsing and processing of IBM Integration eXchange Format (IXF) files.
Language: Python - Size: 1 MB - Last synced at: 15 days ago - Pushed at: about 1 year ago - Stars: 17 - Forks: 1

aravinthsci/Spark_Delta_Lake
Delta Lake Examples
Language: Jupyter Notebook - Size: 285 KB - Last synced at: over 1 year ago - Pushed at: about 5 years ago - Stars: 12 - Forks: 12

goodwillpunning/nodejs-sharing-client
A Node.js connector for Delta Sharing.
Language: JavaScript - Size: 423 KB - Last synced at: 5 days ago - Pushed at: 2 months ago - Stars: 10 - Forks: 7

bmsuisse/lakeapi
API for distributing Data Lake Data
Language: Python - Size: 15.2 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 10 - Forks: 2

taka-yayoi/public_repo
Databricksのサンプルノートブックを格納しています。
Language: Python - Size: 43.9 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 8 - Forks: 7

roeap/flight-fusion
Language: Rust - Size: 4.3 MB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 2

xbrianh/xdlake
A loose implementation of the deltalake protocol, written in Python on top of pyarrow, focused on extensibility, customizability, and distributed data.
Language: Python - Size: 241 KB - Last synced at: 4 days ago - Pushed at: 8 months ago - Stars: 4 - Forks: 0

yandex-cloud/yc-delta
Delta Lake для Yandex Data Processing
Language: Java - Size: 186 KB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 4 - Forks: 1

LeoneGarage/StreamJoin
A framework for incremental streaming joins and incremental streaming aggregations over change data feeds from Databricks Delta
Language: Python - Size: 196 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 3 - Forks: 2

alberttwong/opendatalakehouse
The Data Lakehouse Readiness Score is a quantitative measure that assesses a database's vendor support of Apache Iceberg, Apache Hudi and Delta Lake.
Size: 37.1 KB - Last synced at: 5 days ago - Pushed at: 9 months ago - Stars: 3 - Forks: 0

satyakommula96/spark_benchmark
Spark Performance Benchmark suite to evaluate all TPC-DS and TPC-H query times
Language: Scala - Size: 97.7 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 2

datalpia/laketower
Oversee your lakehouse
Language: Python - Size: 1.55 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 2 - Forks: 0

cmackenzie1/deltalake-go 📦
An implementation of Delta Lake in Go
Language: Go - Size: 79.1 KB - Last synced at: 2 months ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 0

sebastianruizm/demo-data-pipeline Fork of lbodnarin/data-pipeline
Simple data pipeline (Airflow + Spark)
Language: Python - Size: 7.61 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 2 - Forks: 0

credimi/pandora
Relational tables from nested data
Language: Scala - Size: 32.2 KB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 1

easonlai/databricks_delta_table_samples
This is a code sample repository for demonstrating how to perform Databricks Delta Table operations.
Language: HTML - Size: 23.9 MB - Last synced at: 3 months ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 2

ev2900/EMR_Studio_Delta_Lake
Deltalake examples designed to be run on AWS Elastic Map Reduce (EMR) via. EMR Studio or EMR Notebooks
Language: Jupyter Notebook - Size: 15.6 KB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 1 - Forks: 1

JayyShah/Databricks-AWS
Databricks provides a unified, open platform for all your data. It empowers data scientists, data engineers and data analysts with a simple collaborative environment to run interactive and scheduled data analysis workloads.
Language: Python - Size: 3.91 KB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

himewel/ifood-data
Ifood data wrangling with Apache Airflow and Apache Spark running on Kubernetes
Language: Python - Size: 396 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 1

jasondavindev/delta-lake-dms-cdc
Example application for DMS CDC with Delta Lake and Apache Hudi
Language: Python - Size: 69.5 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 1

vvalcristina/treinamento-dataproc-deltalake
Ambiente de treinamento para Dataproc e DeltaLake
Language: Jupyter Notebook - Size: 664 KB - Last synced at: over 2 years ago - Pushed at: almost 4 years ago - Stars: 1 - Forks: 1

mrjsj/blueno
A Python ETL library for creating declarative data pipelines.
Language: Python - Size: 2.83 MB - Last synced at: 9 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

tansu-io/example-delta-lake
Tansu schema-backed topics, instantly accessible as Delta Lake tables
Language: Just - Size: 27.3 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

tashi-2004/Apache-Airflow-Kafka-Spark-DeltaLake-Real-Time-Stream-Pipeline
This project implements a real-time data pipeline using Apache Airflow, Kafka, Apache Spark, and Delta Lake. It supports both batch (Coldpath) and real-time (Hotpath) data ingestion, processing, and storage. Airflow is used for orchestrating the data workflows.
Language: Python - Size: 12.5 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

dde-labs/self-lake
Self Sandbox for interact with any lake storage like DeltaLake, IceBerg, Hudi etc.
Language: Python - Size: 665 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

jayhan94/MiniLake
A morden mini lakehouse based on Spark and Iceberg running in the docker.
Size: 8.79 KB - Last synced at: 8 days ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

najmaelboutaheri/Patents_analysis
This repository contains code and resources for analyzing patents using Apache Spark, Python, and AWS services. The objective of this project is to extract insights and trends from patent data to inform business decisions and intellectual property strategies.
Language: Jupyter Notebook - Size: 7.79 MB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

dougdss89/wideworldadventure
This repository includes all files that compose the design and unification of the databases AdventureWorks and WideWorldAdventure project.
Language: Shell - Size: 230 KB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

palutz/rust_nextstep
A series of exercises to play with more advanced topics in Rust
Language: Rust - Size: 304 KB - Last synced at: 3 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

makism/datastack-playground
A datastack playground; includes Spark, Kafka, Airbyte, etc.
Language: Jupyter Notebook - Size: 55.7 KB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

herry13/glue-docker-image
A custom Glue Docker image
Language: Dockerfile - Size: 2.93 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

jcguidry/flight-ml-preprocess-gcp
Continuous flight event data processing using Spark Streaming, Delta Lake storage, deployed on GCP dataproc cluster.
Language: Python - Size: 13.7 KB - Last synced at: 18 days ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

data-engineer-course/taxacco
Проект № 4 для курса "Инженер данных".
Language: Jupyter Notebook - Size: 11.5 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

buoyant-data/lambda-delta-optimize
AWS Lambda function for optimizing Delta tables
Language: HCL - Size: 64.5 KB - Last synced at: 10 days ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

OpenTableFormat/OpenTableFormat.github.io
Website for open table format 🕸
Language: CSS - Size: 4.59 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

naiborhujosua/Data-Scientist-learning-path-using-databricks
This is the summary of learning Data Science using Databricks
Size: 51.8 KB - Last synced at: about 1 year ago - Pushed at: almost 4 years ago - Stars: 0 - Forks: 0
