GitHub topics: data-lake

Repositories

jaimealruiz/proyecto-tfg

Diseño e Implementación de interconexión entre LLM y Espacios de Datos mediante Model Context Protocol (MCP)

Language: Python - Size: 97.7 KB - Last synced at: about 19 hours ago - Pushed at: about 19 hours ago - Stars: 0 - Forks: 0

Narius2030/IMCP-Support-Blinders

This project focuses on image captioning by creating two primary models: DarkNetLM and DarkNetVG2. Both models leverage the CSP DarkNet53 architecture as the backbone of YOLOv8 for feature extraction from images. Combining with Transformers or LSTM to generating captions.

Language: Python - Size: 28.8 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

maris-development/beacon

A high-performance climate 🌍 data lake

Language: Rust - Size: 5.95 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

ahmedd38/dataengineer-portfolio

📊 End-to-end ETL pipelines, Airflow DAGs, notebook-driven analytics & data warehousing

Size: 7.81 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

cds-snc/data-lake

Infrastructure for the Platform Data Lake

Language: Python - Size: 493 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3 - Forks: 0

pixelsdb/pixels

An efficient storage and compute engine for both on-prem and cloud-native data analytics.

Language: Java - Size: 129 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 144 - Forks: 44

treeverse/lakeFS

lakeFS - Data version control for your data lake | Git for data

Language: Go - Size: 149 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 4,651 - Forks: 373

dlt-hub/dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

Language: Python - Size: 89.1 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3,558 - Forks: 261

ulbmuenster/dataasee

DatAasee - A Metadata-Lake for Libraries

Language: Makefile - Size: 3.06 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 14 - Forks: 2

apache/kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

Language: Scala - Size: 59.9 MB - Last synced at: about 15 hours ago - Pushed at: 12 days ago - Stars: 2,181 - Forks: 936

mahmoudparsian/data-warehousing

This repository is a place for the Data Warehousing course at the Information Systems & Analytics department, Santa Clara University.

Language: Jupyter Notebook - Size: 502 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 8 - Forks: 2

lakekeeper/lakekeeper

Lakekeeper is an Apache-Licensed, secure, fast and easy to use Apache Iceberg REST Catalog written in Rust.

Language: Rust - Size: 12.8 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 621 - Forks: 45

Canner/vulcan-sql

Data API Framework for AI Agents and Data Apps

Language: TypeScript - Size: 70.6 MB - Last synced at: 1 day ago - Pushed at: 10 months ago - Stars: 678 - Forks: 33

Canner/wren-engine

🤖 The Semantic Engine for Model Context Protocol(MCP) Clients and AI Agents 🔥

Language: Java - Size: 20.9 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 307 - Forks: 74

hyparam/icebird

Icebird: JavaScript Iceberg Client

Language: JavaScript - Size: 224 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 23 - Forks: 0

linkml/linkml-store

wrapper for multiple linkml storage engines

Language: Python - Size: 10.1 MB - Last synced at: 8 days ago - Pushed at: 9 days ago - Stars: 21 - Forks: 2

smart-data-lake/smart-data-lake

Smart Automation Tool for building modern Data Lakes and Data Pipelines

Language: Scala - Size: 43.2 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 123 - Forks: 22

nodestream-proj/nodestream

A Declarative framework for Building, Maintaining, and Analyzing Graph Data

Language: Python - Size: 3.46 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 44 - Forks: 13

Northwestern-CSSI/sciscinet-v2

Language: HTML - Size: 405 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

jlsilva01/adls-azure

Procedimento para criação de um Azure Data Lake Storage usando Terraform, através de uma assinatura MS Learn Sandbox

Language: HCL - Size: 27.3 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 7 - Forks: 2

aws-solutions-library-samples/data-lakes-on-aws

Enterprise-grade, production-hardened, serverless data lake on AWS

Language: Python - Size: 13.8 MB - Last synced at: 5 days ago - Pushed at: about 1 month ago - Stars: 449 - Forks: 143

Canner/vulcan-sql-examples

Curated VulcanSQL show cases

Language: Jupyter Notebook - Size: 5.61 MB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 22 - Forks: 3

EdgeLake/EdgeLake

Data Lake on the Edge

Language: Python - Size: 5.55 MB - Last synced at: about 8 hours ago - Pushed at: about 9 hours ago - Stars: 18 - Forks: 1

mauropelucchi/pyconlt2025

Data warehouses meet data lakes @ PYCONLT 2025

Size: 0 Bytes - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

richban/opendata-stack-platform

Open Data Stack Platform: a collection of projects and pipelines built with open data stack tools for scalable, observable data platform.

Language: Python - Size: 1.15 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 15 - Forks: 2

datopian/r2-bucket-uploader

Cloudflare R2 bucket File Uploader with multipart upload enabled. Tested with files up to 10 GB size.

Language: TypeScript - Size: 71.3 KB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 115 - Forks: 11

san089/goodreads_etl_pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

Language: Python - Size: 1.31 MB - Last synced at: 22 days ago - Pushed at: about 5 years ago - Stars: 1,365 - Forks: 224

swapnaxdata/Nyc-Taxi-Data-Engineering-Project

Language: Jupyter Notebook - Size: 2.94 MB - Last synced at: 22 days ago - Pushed at: 23 days ago - Stars: 0 - Forks: 0

MatsMoll/aligned

The DBT of ML, as Aligned describes data dependencies in ML systems, and reduce technical data debt

Language: Python - Size: 6.08 MB - Last synced at: 11 days ago - Pushed at: about 2 months ago - Stars: 58 - Forks: 2

garystafford/tickit-data-lake-demo

Resources for video demonstrations and blog posts related to DataOps on AWS

Language: Python - Size: 42.1 MB - Last synced at: 4 days ago - Pushed at: over 3 years ago - Stars: 175 - Forks: 108

Elkoumy/real_time_data_lake

🚀 Scalable near-real-time data pipeline using Apache Iceberg, Spark, Kafka, and Trino. ACID-compliant JSON ingestion, processing, and analytics. Dockerized for easy deployment. #DataEngineering #DataLake

Language: Python - Size: 233 KB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 0 - Forks: 0

alanchn31/Data-Engineering-Projects

Personal Data Engineering Projects

Language: Jupyter Notebook - Size: 2.92 MB - Last synced at: 28 days ago - Pushed at: over 2 years ago - Stars: 921 - Forks: 203

bytedance/bitsail

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

Language: Java - Size: 26.4 MB - Last synced at: 27 days ago - Pushed at: over 1 year ago - Stars: 1,654 - Forks: 331

stevehoober254/dataengineer-portfolio

📊 End-to-end ETL pipelines, Airflow DAGs, notebook-driven analytics & data warehousing

Size: 6.84 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

JuanPabloDiaz/Databricks-Certified-Data-Engineer-Associate Fork of derar-alhussein/Databricks-Certified-Data-Engineer-Associate

The resources of the preparation course for Databricks Data Engineer Associate certification exam

Language: Python - Size: 3.22 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

maxi-k/btrblocks

BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)

Language: C++ - Size: 247 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 238 - Forks: 20

san089/Udacity-Data-Engineering-Projects

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

Language: Python - Size: 2.03 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 1,596 - Forks: 510

yahwang/Awesome-Data-Engineering

📒(GitBook) A curated list of awesome Data Engineering resources

Size: 3.16 MB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 35 - Forks: 4

awslabs/amazon-s3-find-and-forget

Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

Language: Python - Size: 8.39 MB - Last synced at: 1 day ago - Pushed at: 10 days ago - Stars: 243 - Forks: 36

Teradata/kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

Language: Java - Size: 84.3 MB - Last synced at: 27 days ago - Pushed at: over 2 years ago - Stars: 1,111 - Forks: 577

Cesar312/retail-sales-data-lake-predictive-analytics

Language: Jupyter Notebook - Size: 509 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

utndatasystems/virtual

🗜️Compressing Parquet files using functions (TRL @NeurIPS'24, EDBT Best Demo'25)

Language: Python - Size: 693 KB - Last synced at: 13 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 1

zkan/swu-ds525

DS525

Language: Jupyter Notebook - Size: 11.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 16 - Forks: 7

KentHsu/Udacity-Data-Engineering-Nanodgree

Udacity Data Engineering Nanodegree Program

Language: Jupyter Notebook - Size: 2.12 MB - Last synced at: about 1 month ago - Pushed at: about 4 years ago - Stars: 52 - Forks: 59

AymenDaoudi/Electric-Vehicle-Charging-Simulator

Simulation, Ingestion and ETL-ing data of millions of EV charging sessions by thousands of EVs in thousands of stations around the world.

Language: Python - Size: 9.65 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

LearningJournal/SparkProgrammingInScala

Apache Spark Course Material

Language: Scala - Size: 50.9 MB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 88 - Forks: 159

jibbs1703/Tickit-Data-Pipeline

This repository demonstrates the creation of a robust data pipeline using an Orchestrator, on-prem and cloud resources. It collects data from on-premises SQL and NoSQL database and loads it into a SQL database in the cloud.

Language: Python - Size: 50.8 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

sigpwned/jdbq

JDBI-inspired Database Access Framework for Java + BigQuery

Language: Java - Size: 206 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 1

EDAO-Project/Jazero

Semantic data lake for semantic-aware table search using the example-driven query-by-table query paradigm.

Language: Java - Size: 6.38 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

uber/marmaray 📦

Generic Data Ingestion & Dispersal Library for Hadoop

Language: Java - Size: 1.61 MB - Last synced at: about 2 months ago - Pushed at: about 2 years ago - Stars: 478 - Forks: 111

Azure/AzureDataLake

Samples and Docs for Azure Data Lake Store and Analytics

Size: 51.2 MB - Last synced at: 1 day ago - Pushed at: about 2 years ago - Stars: 139 - Forks: 107

shinie19/sql-data-warehouse-project

Build a modern Data Warehouse from scratch with SQL Server, including ETL processes, data modeling and analytics.

Language: TSQL - Size: 677 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

suecodelabs/cnfuzz

Breaking Cloud Native Web APIs in their natural habitat.

Language: Go - Size: 1.83 MB - Last synced at: 3 days ago - Pushed at: about 2 years ago - Stars: 36 - Forks: 2

Pirate-Emperor/BigData-Pipeline

BigData Pipeline is a local testing environment for experimenting with various storage solutions (RDB, HDFS), query engines (Trino), schedulers (Airflow), and ETL/ELT tools (DBT). It supports MySQL, Hadoop, Hive, Kudu, and more.

Language: Dockerfile - Size: 7.95 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 3 - Forks: 0

razevedo1994/razv-data-engineering

Portfolio of projects and studies conducted in data engineering.

Language: Jupyter Notebook - Size: 70.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 33 - Forks: 3

ndomah/Azure-Medallion-Pipeline

An end-to-end Azure pipeline using Medallion Architecture.

Size: 1.03 MB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

Azure/usql

U-SQL Examples and Issue Tracking

Language: C# - Size: 21.8 MB - Last synced at: 1 day ago - Pushed at: about 2 years ago - Stars: 234 - Forks: 680

Narius2030/DataLake-Solution-IMCP

This project involved the development and implementation of a Data Lake architecture to support an AI model capable of generating image captions. The architecture was designed to efficiently ingest, process, and centralized store large volumes of image and text data.

Language: Python - Size: 193 MB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

cuebook/cuelake

Use SQL to build ELT pipelines on a data lakehouse.

Language: JavaScript - Size: 28 MB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 285 - Forks: 28

bmsuisse/lakeapi

API for distributing Data Lake Data

Language: Python - Size: 15.2 MB - Last synced at: about 2 months ago - Pushed at: 6 months ago - Stars: 10 - Forks: 2

dominique-jacque/NBA-Data-Lake

NBA Data Lake Repository contains the setup_nba_data_lake.py script, which automates the creation of a data lake for NBA analytics using AWS services. The script integrates Amazon S3, AWS Glue, and Amazon Athena, and sets up the infrastructure needed to store and query NBA-related data.

Language: Python - Size: 9.77 KB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

budproj/data

Analysis, extraction, infrastructure, and transformations of our data

Language: Dockerfile - Size: 2.11 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

shrikantnaidu/data-lakehouse-with-delta-lake

Setting up a data lakehouse with delta lake using docker

Language: Jupyter Notebook - Size: 6.58 MB - Last synced at: 4 days ago - Pushed at: 4 months ago - Stars: 2 - Forks: 0

LesiaUKR/goit-de-fp

Masters degree | Data Engineering | Final course projects | goit-de-fp

Language: Python - Size: 3.69 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

imsanjoykb/Python-MySql-Operation

This Python MySQL Repo shows you how to use MySQL Connector Python to access MySQL databases. You will learn how to connect to MySQL database and perform common database operations such as SELECT, INSERT, UPDATE, & DELETE in Python.

Language: Jupyter Notebook - Size: 5.86 KB - Last synced at: about 1 month ago - Pushed at: almost 4 years ago - Stars: 18 - Forks: 9

anuveyatsu/cloudflare-data-fabric

Cloudflare Data Fabric: Use Cloudflare's global infrastructure to build a flexible, resilient framework for data solutions.

Size: 3.91 KB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

LearningJournal/Spark-Streaming-In-Python

Apache Spark 3 - Structured Streaming Course Material

Language: Python - Size: 19.4 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 121 - Forks: 159

ozkary/data-engineering-mta-turnstile

Data Engineering - Metropolitan Transportation Authority (MTA) Subway Data Analysis

Language: Jupyter Notebook - Size: 12.9 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 25 - Forks: 4

datarootsio/terraform-module-azure-datalake 📦

Terraform module for an Azure Data Lake

Language: HCL - Size: 1.54 MB - Last synced at: 3 days ago - Pushed at: over 2 years ago - Stars: 30 - Forks: 22

apache/kyuubi-docker

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

Language: Dockerfile - Size: 25.4 KB - Last synced at: about 15 hours ago - Pushed at: 5 months ago - Stars: 13 - Forks: 7

yasarsultan/taxi-trip-analysis

The NYC Taxi Trip Batch Data Pipeline automates processing of large-scale trip data using Apache Spark and Airflow, integrating AWS S3 and Google BigQuery for storage and analytics. It features scalable, containerized workflows with robust data validation.

Language: Python - Size: 114 KB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

leo-the-nardo/combopurifier

Data Pipeline from AWS SQS/S3 to Kubernetes w/ Spark using Airflow, EKS & Data Lakehouse

Language: Python - Size: 27 MB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

AuFeld/Data_Engineering_Projects

A collection of data engineering projects: data modeling, ETL pipelines, data lakes, infrastructure configuration on AWS, data warehousing, containerization, and a dashboard to monitor data pipeline KPIs

Language: Python - Size: 316 MB - Last synced at: 5 months ago - Pushed at: about 4 years ago - Stars: 14 - Forks: 1

emakhov/local-data-stack

A docker-compose collection for local data engineering and analytics

Language: Python - Size: 398 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

PaznerA/php-minio-obj-store

cli (Laravel prompts) example usage of btrfs-like storage behind s3 interface (minio) and in-memory(redis) indexing with object metadata

Language: PHP - Size: 56.6 KB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

camunda-community-hub/zeeqs

GraphQL API for Zeebe data

Language: Kotlin - Size: 1.09 MB - Last synced at: 28 days ago - Pushed at: 7 months ago - Stars: 62 - Forks: 15

realtimedatalake/rtdl

rtdl makes it easy to build and maintain a real-time data lake

Language: Go - Size: 62.5 MB - Last synced at: 6 months ago - Pushed at: over 2 years ago - Stars: 45 - Forks: 9

aws-samples/aws-dbs-refarch-datalake

Reference Architectures for Datalakes on AWS

Language: HTML - Size: 4.52 MB - Last synced at: 5 days ago - Pushed at: almost 5 years ago - Stars: 79 - Forks: 31

OElesin/querypal

Web UI for Amazon Athena

Language: Vue - Size: 22.6 MB - Last synced at: 5 months ago - Pushed at: over 2 years ago - Stars: 55 - Forks: 26

santiagortiiz/Snowflake-Data-Pipelines

EPAM's Snowflake hands-on lab. We built a pipeline to read and load data from S3 into Snowflake, developed an ETL workflow to clean the data and stored it in a data warehouse with the 3NF and Star schemas for data mart analysis.

Language: Jupyter Notebook - Size: 30.6 MB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 0

rafie-b/Data-Warehouse-AWS-Pipeline-Chat-API

notebook guide

Language: Jupyter Notebook - Size: 1.61 MB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

datamindedbe/lighthouse 📦

Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.

Language: Scala - Size: 1.21 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 60 - Forks: 10