GitHub topics: data-lake
jaimealruiz/proyecto-tfg
Diseño e Implementación de interconexión entre LLM y Espacios de Datos mediante Model Context Protocol (MCP)
Language: Python - Size: 97.7 KB - Last synced at: about 19 hours ago - Pushed at: about 19 hours ago - Stars: 0 - Forks: 0

Narius2030/IMCP-Support-Blinders
This project focuses on image captioning by creating two primary models: DarkNetLM and DarkNetVG2. Both models leverage the CSP DarkNet53 architecture as the backbone of YOLOv8 for feature extraction from images. Combining with Transformers or LSTM to generating captions.
Language: Python - Size: 28.8 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

maris-development/beacon
A high-performance climate 🌍 data lake
Language: Rust - Size: 5.95 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

ahmedd38/dataengineer-portfolio
📊 End-to-end ETL pipelines, Airflow DAGs, notebook-driven analytics & data warehousing
Size: 7.81 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

cds-snc/data-lake
Infrastructure for the Platform Data Lake
Language: Python - Size: 493 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3 - Forks: 0

pixelsdb/pixels
An efficient storage and compute engine for both on-prem and cloud-native data analytics.
Language: Java - Size: 129 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 144 - Forks: 44

treeverse/lakeFS
lakeFS - Data version control for your data lake | Git for data
Language: Go - Size: 149 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 4,651 - Forks: 373

dlt-hub/dlt
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Language: Python - Size: 89.1 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3,558 - Forks: 261

ulbmuenster/dataasee
DatAasee - A Metadata-Lake for Libraries
Language: Makefile - Size: 3.06 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 14 - Forks: 2

apache/kyuubi
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Language: Scala - Size: 59.9 MB - Last synced at: about 15 hours ago - Pushed at: 12 days ago - Stars: 2,181 - Forks: 936

mahmoudparsian/data-warehousing
This repository is a place for the Data Warehousing course at the Information Systems & Analytics department, Santa Clara University.
Language: Jupyter Notebook - Size: 502 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 8 - Forks: 2

lakekeeper/lakekeeper
Lakekeeper is an Apache-Licensed, secure, fast and easy to use Apache Iceberg REST Catalog written in Rust.
Language: Rust - Size: 12.8 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 621 - Forks: 45

Canner/vulcan-sql
Data API Framework for AI Agents and Data Apps
Language: TypeScript - Size: 70.6 MB - Last synced at: 1 day ago - Pushed at: 10 months ago - Stars: 678 - Forks: 33

Canner/wren-engine
🤖 The Semantic Engine for Model Context Protocol(MCP) Clients and AI Agents 🔥
Language: Java - Size: 20.9 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 307 - Forks: 74

hyparam/icebird
Icebird: JavaScript Iceberg Client
Language: JavaScript - Size: 224 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 23 - Forks: 0

linkml/linkml-store
wrapper for multiple linkml storage engines
Language: Python - Size: 10.1 MB - Last synced at: 8 days ago - Pushed at: 9 days ago - Stars: 21 - Forks: 2

smart-data-lake/smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Language: Scala - Size: 43.2 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 123 - Forks: 22

nodestream-proj/nodestream
A Declarative framework for Building, Maintaining, and Analyzing Graph Data
Language: Python - Size: 3.46 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 44 - Forks: 13

Northwestern-CSSI/sciscinet-v2
Language: HTML - Size: 405 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

jlsilva01/adls-azure
Procedimento para criação de um Azure Data Lake Storage usando Terraform, através de uma assinatura MS Learn Sandbox
Language: HCL - Size: 27.3 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 7 - Forks: 2

aws-solutions-library-samples/data-lakes-on-aws
Enterprise-grade, production-hardened, serverless data lake on AWS
Language: Python - Size: 13.8 MB - Last synced at: 5 days ago - Pushed at: about 1 month ago - Stars: 449 - Forks: 143

Canner/vulcan-sql-examples
Curated VulcanSQL show cases
Language: Jupyter Notebook - Size: 5.61 MB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 22 - Forks: 3

EdgeLake/EdgeLake
Data Lake on the Edge
Language: Python - Size: 5.55 MB - Last synced at: about 8 hours ago - Pushed at: about 9 hours ago - Stars: 18 - Forks: 1

mauropelucchi/pyconlt2025
Data warehouses meet data lakes @ PYCONLT 2025
Size: 0 Bytes - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

richban/opendata-stack-platform
Open Data Stack Platform: a collection of projects and pipelines built with open data stack tools for scalable, observable data platform.
Language: Python - Size: 1.15 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 15 - Forks: 2

datopian/r2-bucket-uploader
Cloudflare R2 bucket File Uploader with multipart upload enabled. Tested with files up to 10 GB size.
Language: TypeScript - Size: 71.3 KB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 115 - Forks: 11

san089/goodreads_etl_pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Language: Python - Size: 1.31 MB - Last synced at: 22 days ago - Pushed at: about 5 years ago - Stars: 1,365 - Forks: 224

swapnaxdata/Nyc-Taxi-Data-Engineering-Project
Language: Jupyter Notebook - Size: 2.94 MB - Last synced at: 22 days ago - Pushed at: 23 days ago - Stars: 0 - Forks: 0

MatsMoll/aligned
The DBT of ML, as Aligned describes data dependencies in ML systems, and reduce technical data debt
Language: Python - Size: 6.08 MB - Last synced at: 11 days ago - Pushed at: about 2 months ago - Stars: 58 - Forks: 2

garystafford/tickit-data-lake-demo
Resources for video demonstrations and blog posts related to DataOps on AWS
Language: Python - Size: 42.1 MB - Last synced at: 4 days ago - Pushed at: over 3 years ago - Stars: 175 - Forks: 108

Elkoumy/real_time_data_lake
🚀 Scalable near-real-time data pipeline using Apache Iceberg, Spark, Kafka, and Trino. ACID-compliant JSON ingestion, processing, and analytics. Dockerized for easy deployment. #DataEngineering #DataLake
Language: Python - Size: 233 KB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 0 - Forks: 0

alanchn31/Data-Engineering-Projects
Personal Data Engineering Projects
Language: Jupyter Notebook - Size: 2.92 MB - Last synced at: 28 days ago - Pushed at: over 2 years ago - Stars: 921 - Forks: 203

bytedance/bitsail
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
Language: Java - Size: 26.4 MB - Last synced at: 27 days ago - Pushed at: over 1 year ago - Stars: 1,654 - Forks: 331

stevehoober254/dataengineer-portfolio
📊 End-to-end ETL pipelines, Airflow DAGs, notebook-driven analytics & data warehousing
Size: 6.84 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

JuanPabloDiaz/Databricks-Certified-Data-Engineer-Associate Fork of derar-alhussein/Databricks-Certified-Data-Engineer-Associate
The resources of the preparation course for Databricks Data Engineer Associate certification exam
Language: Python - Size: 3.22 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

maxi-k/btrblocks
BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)
Language: C++ - Size: 247 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 238 - Forks: 20

san089/Udacity-Data-Engineering-Projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Language: Python - Size: 2.03 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 1,596 - Forks: 510

yahwang/Awesome-Data-Engineering
📒(GitBook) A curated list of awesome Data Engineering resources
Size: 3.16 MB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 35 - Forks: 4

awslabs/amazon-s3-find-and-forget
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Language: Python - Size: 8.39 MB - Last synced at: 1 day ago - Pushed at: 10 days ago - Stars: 243 - Forks: 36

Teradata/kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Language: Java - Size: 84.3 MB - Last synced at: 27 days ago - Pushed at: over 2 years ago - Stars: 1,111 - Forks: 577

Cesar312/retail-sales-data-lake-predictive-analytics
Language: Jupyter Notebook - Size: 509 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

utndatasystems/virtual
🗜️Compressing Parquet files using functions (TRL @NeurIPS'24, EDBT Best Demo'25)
Language: Python - Size: 693 KB - Last synced at: 13 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 1

zkan/swu-ds525
DS525
Language: Jupyter Notebook - Size: 11.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 16 - Forks: 7

KentHsu/Udacity-Data-Engineering-Nanodgree
Udacity Data Engineering Nanodegree Program
Language: Jupyter Notebook - Size: 2.12 MB - Last synced at: about 1 month ago - Pushed at: about 4 years ago - Stars: 52 - Forks: 59

AymenDaoudi/Electric-Vehicle-Charging-Simulator
Simulation, Ingestion and ETL-ing data of millions of EV charging sessions by thousands of EVs in thousands of stations around the world.
Language: Python - Size: 9.65 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

LearningJournal/SparkProgrammingInScala
Apache Spark Course Material
Language: Scala - Size: 50.9 MB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 88 - Forks: 159

jibbs1703/Tickit-Data-Pipeline
This repository demonstrates the creation of a robust data pipeline using an Orchestrator, on-prem and cloud resources. It collects data from on-premises SQL and NoSQL database and loads it into a SQL database in the cloud.
Language: Python - Size: 50.8 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

sigpwned/jdbq
JDBI-inspired Database Access Framework for Java + BigQuery
Language: Java - Size: 206 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 1

EDAO-Project/Jazero
Semantic data lake for semantic-aware table search using the example-driven query-by-table query paradigm.
Language: Java - Size: 6.38 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

uber/marmaray 📦
Generic Data Ingestion & Dispersal Library for Hadoop
Language: Java - Size: 1.61 MB - Last synced at: about 2 months ago - Pushed at: about 2 years ago - Stars: 478 - Forks: 111

Azure/AzureDataLake
Samples and Docs for Azure Data Lake Store and Analytics
Size: 51.2 MB - Last synced at: 1 day ago - Pushed at: about 2 years ago - Stars: 139 - Forks: 107

shinie19/sql-data-warehouse-project
Build a modern Data Warehouse from scratch with SQL Server, including ETL processes, data modeling and analytics.
Language: TSQL - Size: 677 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

suecodelabs/cnfuzz
Breaking Cloud Native Web APIs in their natural habitat.
Language: Go - Size: 1.83 MB - Last synced at: 3 days ago - Pushed at: about 2 years ago - Stars: 36 - Forks: 2

Pirate-Emperor/BigData-Pipeline
BigData Pipeline is a local testing environment for experimenting with various storage solutions (RDB, HDFS), query engines (Trino), schedulers (Airflow), and ETL/ELT tools (DBT). It supports MySQL, Hadoop, Hive, Kudu, and more.
Language: Dockerfile - Size: 7.95 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 3 - Forks: 0

razevedo1994/razv-data-engineering
Portfolio of projects and studies conducted in data engineering.
Language: Jupyter Notebook - Size: 70.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 33 - Forks: 3

ndomah/Azure-Medallion-Pipeline
An end-to-end Azure pipeline using Medallion Architecture.
Size: 1.03 MB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

Azure/usql
U-SQL Examples and Issue Tracking
Language: C# - Size: 21.8 MB - Last synced at: 1 day ago - Pushed at: about 2 years ago - Stars: 234 - Forks: 680

Narius2030/DataLake-Solution-IMCP
This project involved the development and implementation of a Data Lake architecture to support an AI model capable of generating image captions. The architecture was designed to efficiently ingest, process, and centralized store large volumes of image and text data.
Language: Python - Size: 193 MB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

cuebook/cuelake
Use SQL to build ELT pipelines on a data lakehouse.
Language: JavaScript - Size: 28 MB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 285 - Forks: 28

bmsuisse/lakeapi
API for distributing Data Lake Data
Language: Python - Size: 15.2 MB - Last synced at: about 2 months ago - Pushed at: 6 months ago - Stars: 10 - Forks: 2

dominique-jacque/NBA-Data-Lake
NBA Data Lake Repository contains the setup_nba_data_lake.py script, which automates the creation of a data lake for NBA analytics using AWS services. The script integrates Amazon S3, AWS Glue, and Amazon Athena, and sets up the infrastructure needed to store and query NBA-related data.
Language: Python - Size: 9.77 KB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

budproj/data
Analysis, extraction, infrastructure, and transformations of our data
Language: Dockerfile - Size: 2.11 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

shrikantnaidu/data-lakehouse-with-delta-lake
Setting up a data lakehouse with delta lake using docker
Language: Jupyter Notebook - Size: 6.58 MB - Last synced at: 4 days ago - Pushed at: 4 months ago - Stars: 2 - Forks: 0

LesiaUKR/goit-de-fp
Masters degree | Data Engineering | Final course projects | goit-de-fp
Language: Python - Size: 3.69 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

imsanjoykb/Python-MySql-Operation
This Python MySQL Repo shows you how to use MySQL Connector Python to access MySQL databases. You will learn how to connect to MySQL database and perform common database operations such as SELECT, INSERT, UPDATE, & DELETE in Python.
Language: Jupyter Notebook - Size: 5.86 KB - Last synced at: about 1 month ago - Pushed at: almost 4 years ago - Stars: 18 - Forks: 9

anuveyatsu/cloudflare-data-fabric
Cloudflare Data Fabric: Use Cloudflare's global infrastructure to build a flexible, resilient framework for data solutions.
Size: 3.91 KB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

LearningJournal/Spark-Streaming-In-Python
Apache Spark 3 - Structured Streaming Course Material
Language: Python - Size: 19.4 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 121 - Forks: 159

ozkary/data-engineering-mta-turnstile
Data Engineering - Metropolitan Transportation Authority (MTA) Subway Data Analysis
Language: Jupyter Notebook - Size: 12.9 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 25 - Forks: 4

datarootsio/terraform-module-azure-datalake 📦
Terraform module for an Azure Data Lake
Language: HCL - Size: 1.54 MB - Last synced at: 3 days ago - Pushed at: over 2 years ago - Stars: 30 - Forks: 22

apache/kyuubi-docker
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Language: Dockerfile - Size: 25.4 KB - Last synced at: about 15 hours ago - Pushed at: 5 months ago - Stars: 13 - Forks: 7

yasarsultan/taxi-trip-analysis
The NYC Taxi Trip Batch Data Pipeline automates processing of large-scale trip data using Apache Spark and Airflow, integrating AWS S3 and Google BigQuery for storage and analytics. It features scalable, containerized workflows with robust data validation.
Language: Python - Size: 114 KB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

leo-the-nardo/combopurifier
Data Pipeline from AWS SQS/S3 to Kubernetes w/ Spark using Airflow, EKS & Data Lakehouse
Language: Python - Size: 27 MB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

AuFeld/Data_Engineering_Projects
A collection of data engineering projects: data modeling, ETL pipelines, data lakes, infrastructure configuration on AWS, data warehousing, containerization, and a dashboard to monitor data pipeline KPIs
Language: Python - Size: 316 MB - Last synced at: 5 months ago - Pushed at: about 4 years ago - Stars: 14 - Forks: 1

emakhov/local-data-stack
A docker-compose collection for local data engineering and analytics
Language: Python - Size: 398 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

PaznerA/php-minio-obj-store
cli (Laravel prompts) example usage of btrfs-like storage behind s3 interface (minio) and in-memory(redis) indexing with object metadata
Language: PHP - Size: 56.6 KB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

camunda-community-hub/zeeqs
GraphQL API for Zeebe data
Language: Kotlin - Size: 1.09 MB - Last synced at: 28 days ago - Pushed at: 7 months ago - Stars: 62 - Forks: 15

realtimedatalake/rtdl
rtdl makes it easy to build and maintain a real-time data lake
Language: Go - Size: 62.5 MB - Last synced at: 6 months ago - Pushed at: over 2 years ago - Stars: 45 - Forks: 9

aws-samples/aws-dbs-refarch-datalake
Reference Architectures for Datalakes on AWS
Language: HTML - Size: 4.52 MB - Last synced at: 5 days ago - Pushed at: almost 5 years ago - Stars: 79 - Forks: 31

OElesin/querypal
Web UI for Amazon Athena
Language: Vue - Size: 22.6 MB - Last synced at: 5 months ago - Pushed at: over 2 years ago - Stars: 55 - Forks: 26

santiagortiiz/Snowflake-Data-Pipelines
EPAM's Snowflake hands-on lab. We built a pipeline to read and load data from S3 into Snowflake, developed an ETL workflow to clean the data and stored it in a data warehouse with the 3NF and Star schemas for data mart analysis.
Language: Jupyter Notebook - Size: 30.6 MB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 0

rafie-b/Data-Warehouse-AWS-Pipeline-Chat-API
notebook guide
Language: Jupyter Notebook - Size: 1.61 MB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

datamindedbe/lighthouse 📦
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.
Language: Scala - Size: 1.21 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 60 - Forks: 10

datasphere-oss/datasphere
DataSphere is the first open-source cloud-native data observability platform that helps you trace the whole data infrastructure in your warehouses, lakes and databases.
Language: Java - Size: 119 MB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 5 - Forks: 4

k0rsakov/infrastructure_for_data_engineer_kafka
infrastructure_for_data_engineer_kafka
Language: Python - Size: 17.6 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

k0rsakov/infrastructure_for_data_engineer_S3
Инфраструктура для data engineer S3
Language: Python - Size: 11.7 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

k0rsakov/all_about_DuckDB
Всё что нужно знать про DuckDB
Language: Jupyter Notebook - Size: 33.2 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

manuelandersen/football-pipeline
DE Zoomcamp 2024 Final Project 🧙
Language: Python - Size: 975 KB - Last synced at: about 2 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

SAP-samples/hana-cloud-relational-data-lake-onboarding
This is an end-to-end onboarding sample for SAP HANA Cloud, relational data lake. It shows how to create schema, load data, and execute queries.
Size: 14.7 MB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 11 - Forks: 11

sskender/cloud-data-engineering-technologies
Cloud Data Engineering Technologies FER labs
Language: Go - Size: 4.94 MB - Last synced at: about 2 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

harrisskordas/Data-Engineering-Lab---Regeneration
Data Engineering Lab, powered by TITAN and ReGeneration
Language: TSQL - Size: 3.7 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

treeverse/lakeFS-hooks
a simple lakeFS webhook for pre-commit and pre-merge validation of data objects
Language: Python - Size: 53.7 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 12 - Forks: 1

agutiernc/nyc-citi-bike-insights
Data Engineering Project using NYC Citi Bike data for years 2019, 2020, and 2023
Language: Python - Size: 1.03 MB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 1

WinThitiwat/Data_Lake_with_Spark
ETL process to S3 Data Lake through EMR, Spark, Hadoop, Schema-on-Read
Language: Jupyter Notebook - Size: 536 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

wherobots/havasu
The spatial table format for spatial lakehouse
Size: 9.77 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 13 - Forks: 0

OElesin/hadoop-cluster-docker Fork of kiwenlau/hadoop-cluster-docker
Run Hadoop - Spark Custer within Docker Containers. Now we have added persistence so your data is not lost even when you shutdown your docker containers
Language: Shell - Size: 392 KB - Last synced at: about 1 year ago - Pushed at: about 8 years ago - Stars: 1 - Forks: 2

zkan/data-engineering-on-gcp
Data Engineering on Google Cloud Platform (GCP)
Language: Python - Size: 4.96 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 31

msambol/hl7-data-lake
Build an HL7 Data Lake
Language: TypeScript - Size: 5.04 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 2

juliaobenauer/Data-Lakes-with-Spark
Udacity project within the Data Engineer Nanodegree
Language: Python - Size: 394 KB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

FINRAOS/herd-mdl
Herd-MDL, a turnkey managed data lake in the cloud. See https://finraos.github.io/herd-mdl/ for more information.
Language: Java - Size: 3.33 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 15 - Forks: 14

gabrieledarrigo/mfh-dl-performace-testing Fork of SESARLab/mfh-dl-performace-testing
Apache Hive and Apache Druid performance testing for MIND Foods HUB Data Lake
Size: 6.42 MB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0
