An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: data-pipeline

lightfeed/lightfeed

Lightfeed SDK to search and filter web data

Language: Python - Size: 112 KB - Last synced at: 33 minutes ago - Pushed at: 35 minutes ago - Stars: 3 - Forks: 1

bookbot-kids/speechline

An end-to-end, offline, audio categorization, transcription, and segmentation.

Language: Python - Size: 13.9 MB - Last synced at: about 4 hours ago - Pushed at: about 5 hours ago - Stars: 1 - Forks: 1

esther-poniatowski/meandra

Python framework for modular and flexible data workflows with structured configurations

Language: Python - Size: 40 KB - Last synced at: about 16 hours ago - Pushed at: about 17 hours ago - Stars: 0 - Forks: 0

keanteng/wqd7007-project

Big Data Pipeline for NYC Taxi Trips

Language: Jupyter Notebook - Size: 7.37 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

chalk-ai/docs

Docs for Chalk AI

Language: MDX - Size: 5.68 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 3 - Forks: 2

alwi2404/ETL-Pipeline-for-Region-Segmentation-and-Product-Performance-Analysis

An ETL Project using SQL Server Integration Services (SSIS) for Region Segmentation and Sales Performance Analysis with real-world data pipelines and business insights.

Size: 24.4 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

bruin-data/ingestr

ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

Language: Python - Size: 167 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2,951 - Forks: 79

Indexical-Metrics-Measure-Advisory/watchmen

Watchmen Platform is a low code data platform for data pipeline, meta data management , analysis, indicator objective analysis and quality management

Language: TypeScript - Size: 20.6 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 7 - Forks: 3

JKA098/Pokemon-Feistiness-Apache-Spark-Job

The following readme file, assume that before running the Spark analytic job, you have already installed the correct versions of **Java**, **Hadoop**, **Spark** and that you are inside **Ubuntu**.

Language: Python - Size: 184 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

mrmonkeyman68/-E-Commerce_Sales_Analysis

This project dives deep into the sales, delivery, and customer feedback data of major grocery delivery platforms – Blinkit, Swiggy Instamart, and JioMart. It is designed to showcase my ability to clean, analyze, and visualize data using Microsoft Excel.

Size: 11.7 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

funinkina/whatsappchatanalyzer

A next JS app that analysis your whatsapp chats and gives useful quirky insights

Language: Go - Size: 8.03 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

tungtt/datapipe-quickstart

Playground for testing Dagster as a data pipeline orchestrator.

Language: Python - Size: 25.4 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

moses000/mysoftware-nocNetIntel

AI-powered NOC assistant for forecasting network outages, analyzing root causes, and recommending proactive resolutions using LLM and time-series intelligence.

Size: 21.5 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

apache/flink-cdc

Flink CDC is a streaming data integration tool

Language: Java - Size: 40.9 MB - Last synced at: 3 days ago - Pushed at: 13 days ago - Stars: 6,051 - Forks: 2,010

chalk-ai/chalk-go

Go client for Chalk

Language: Go - Size: 4.92 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 4 - Forks: 0

eZWALT/eZAutoML

A Democratized, lightweight and modern framework for Python Automated Machine Learning

Language: Python - Size: 1.41 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

elementary-data/elementary

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

Language: HTML - Size: 205 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 2,058 - Forks: 184

jvalue/jayvee

Jayvee is a domain-specific language and runtime for automated processing of data pipelines

Language: TypeScript - Size: 10.8 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 175 - Forks: 15

KasparasPlayz/crypto-currency-data-pipeline

The "crypto-currency-data-pipeline" project collects and processes real-time cryptocurrency data from various exchanges. It transforms this data into a structured format for analysis and visualization, enabling users to track market trends effectively.

Size: 3.91 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

apache/seatunnel-web

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

Language: Java - Size: 17.4 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 677 - Forks: 302

chalk-ai/chalk-elixir

Elixir client for Chalk

Language: Elixir - Size: 57.6 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3 - Forks: 2

datazip-inc/olake

Fastest open-source tool for replicating Databases to Data Lake in Open Table Formats like Apache Iceberg. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB and MySQL

Language: Go - Size: 31.5 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 839 - Forks: 74

abhimehro/Seatek_Analysis

R-based analysis tier for Seatek sensor data processing and Excel workbook generation. Part of a three-tier analysis system working in conjunction with Python-based visualization project.

Language: Python - Size: 49.2 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

Ajax20230/TFX-Pipeline-Automation

The TFX Automation Bot is a cutting-edge Python-based tool designed to streamline machine learning pipelines and optimize TensorFlow Extended (TFX) workflows. This bot automates model training, validation, deployment, and monitoring, making AI development seamless and efficient.

Language: Python - Size: 5.86 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

rudderlabs/rudder-server

Privacy and Security focused Segment-alternative, in Golang and React

Language: Go - Size: 308 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 4,176 - Forks: 330

SonicEXEDVP/real-time-data-pipeline

📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.

Size: 1000 Bytes - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

victoria217-bottino/google-news-scraper

# 📰 Google News Scraper A Python tool to fetch, decode, and process Google News articles by keyword and time range. Extract clean article text, decode URLs, and perform NLP effortlessly. Perfect for news aggregation, analysis, or building bots. Includes progress tracking with `tqdm` and customizable features for advanced use cases. 🚀

Size: 1000 Bytes - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 3 - Forks: 1

Multiwoven/multiwoven

🔥🔥🔥 Open source composable CDP - alternative to hightouch and census.

Language: Ruby - Size: 6.17 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,600 - Forks: 71

Nachoxt17/Earthquake-Data-Pipeline-End-to-End-Azure-Fabric

Full data pipeline to collect, process, and visualize earthquake event data, supporting real-time seismic risk monitoring.

Size: 2.93 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

sandy-sp/gittxt

Gittxt is an AI-focused CLI and plugin tool for extracting, filtering, and packaging text from GitHub repos. Build LLM-compatible datasets, prep code for prompt engineering, and power AI workflows with structured .txt, .json, .md, or .zip outputs.

Language: Python - Size: 7.69 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 1

loggdme/kyro

Kyro is a collection of utilities and examples for creating efficient data pipelines in go with parallel queues and, rate limitiers and much more.

Language: Go - Size: 15.6 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Language: Python - Size: 666 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 18,044 - Forks: 4,503

debezium/debezium

Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.

Language: Java - Size: 58.5 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 11,354 - Forks: 2,647

starlake-ai/starlake

Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.

Language: Scala - Size: 170 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 107 - Forks: 23

ConduitIO/conduit

Conduit streams data between data stores. Kafka Connect replacement. No JVM required.

Language: Go - Size: 12.6 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 490 - Forks: 51

apache/shardingsphere

Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.

Language: Java - Size: 631 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 20,213 - Forks: 6,802

Saravanan9698/Mental_Health_Survey

A Standalone Depression Checking Application which analyzes user’s mental health by asking a series of highly curated psychiatric questions in AWS EC2 and Elastic BeanStalk

Language: Jupyter Notebook - Size: 10.7 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

Naeem1144/segmentation-project

Customer Segmentation using Machine learning models for clustering analysis

Language: Jupyter Notebook - Size: 16.8 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

instill-ai/console

📺 Instill Console for 🔮 Instill Core: https://github.com/instill-ai/instill-core

Language: TypeScript - Size: 13.1 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 39 - Forks: 10

tadod12/tadod

TADOD - Data Pipeline for TLC Trip Record Data using Modern Tech Stack

Language: Scala - Size: 78.5 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 2 - Forks: 0

DevDizzle/profit_scout_pipeline

This GCP-based data pipeline automates the ingestion of SEC filings (10-K/Q) and stock prices. It uses AI (Gemini/Vertex AI) via microservices (Cloud Run, Pub/Sub) to generate filing/news summaries and calculate financial ratios, storing results in BigQuery and Cloud Storage for analysis.

Language: Python - Size: 410 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

gireeshbharmshetty/scala-data-pipeline

A simple data transformation pipeline in Scala reading CSVs, joining data, and aggregating results.

Language: Scala - Size: 3.91 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

reugn/go-streams

A lightweight stream processing library for Go

Language: Go - Size: 587 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 2,028 - Forks: 164

hedisam/pipeline

A simple data processing pipeline supporting FIFO, fixed & dynamic worker pools, and broadcast stages.

Language: Go - Size: 1.45 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 5 - Forks: 0

anilkeshwani/speech-text-alignment

Functionality for speech data processing including time alignment, encoding with speech encoders (tokenizers) and data preprocessing of common datasets

Language: Python - Size: 1.32 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

superlinked/superlinked

Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.

Language: Jupyter Notebook - Size: 103 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 1,069 - Forks: 76

logleads/LogverzCore

LOGVERZ CORE. Logverz Core is the "backend" component of the Logverz application bundle (LogverzReleases Repository). Logverz is a serverless adaptive data pipeline, the fastest route from AWS S3 to instant reports.

Language: JavaScript - Size: 6.17 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 3 - Forks: 0

DataSQRL/sqrl

Flexible development framework for building streaming data applications in SQL with Kafka, Flink, Postgres, GraphQL, and more.

Language: Java - Size: 41.5 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 104 - Forks: 15

KhotChaitanya/Customer_Segmentation_ETL_SSIS

An ETL Project using SQL Server Integration Services (SSIS) for Customer Segmentation and Sales Performance Analysis with real-world data pipelines and business insights.

Size: 29.3 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

instill-ai/cli

⌨️ Instill CLI for 🔮 Instill Core: https://github.com/instill-ai/instill-core

Language: Go - Size: 630 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 22 - Forks: 3

Seddryck/Streamistry

Streamistry is a lightweight library designed to support pipeline, streaming, and ETL development for data engineering and integration. Its versatility makes it an excellent tool for building robust, scalable data workflows and optimizing data processing tasks. With features such as accumulators, windows, and sinks, it efficiently handles streaming

Language: C# - Size: 2.22 MB - Last synced at: 4 days ago - Pushed at: 12 days ago - Stars: 5 - Forks: 0

husskhosravi/athena-glue-s3-pipeline

End-to-end project using S3, Glue, Athena, and QuickSight to build a secure, automated data processing and visualisation workflow

Size: 463 KB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

alankrantas/loc-documentation

End-user documentation for FST Network's Logic Operating Centre (LOC), a serverless SaaS data product platform

Language: MDX - Size: 19.6 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

spotify/klio

Smarter data pipelines for audio.

Language: Python - Size: 73.7 MB - Last synced at: about 20 hours ago - Pushed at: over 1 year ago - Stars: 850 - Forks: 50

hamamiasma/Python-Data-Engineering-Projekt-

Extract, Transform, Load (ETL) Pipeline using Python to process CSV and JSON and XML files into unified structured data.

Language: Python - Size: 10.4 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

scribe-org/Scribe-Data

Wikidata and Wikipedia language data extraction

Language: Python - Size: 345 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 39 - Forks: 79

thiagoneye/course-apache_beam

Apache Beam studies.

Language: Jupyter Notebook - Size: 2.93 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

PFund-Software-Ltd/pfeed

Data Engine for Manual/Algo Trading: Download/Stream -> Clean -> Store. Supports Data Lakehouse Architecture. Clean Once and Forget.

Language: Python - Size: 3.57 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 23 - Forks: 5

whylabs/whylogs

An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈

Language: Jupyter Notebook - Size: 181 MB - Last synced at: 13 days ago - Pushed at: 4 months ago - Stars: 2,712 - Forks: 126

ReusJimenez/python-data-engineering

Laboratorios prácticos de ingeniería de datos con Python. ⚙️

Language: Jupyter Notebook - Size: 27.7 MB - Last synced at: 16 days ago - Pushed at: 17 days ago - Stars: 0 - Forks: 0

childmindresearch/bids2table

Efficiently index large-scale BIDS neuroimaging datasets and derivatives

Language: Jupyter Notebook - Size: 395 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 17 - Forks: 7

chofste/ETL

Language: Python - Size: 2.54 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 4 - Forks: 0

ubisoft/mobydq

:whale: Tool to automate data quality checks on data pipelines

Language: Vue - Size: 188 MB - Last synced at: 3 days ago - Pushed at: over 2 years ago - Stars: 255 - Forks: 62

infoslack/awesome-kafka

A list about Apache Kafka

Size: 96.7 KB - Last synced at: 11 days ago - Pushed at: about 2 months ago - Stars: 581 - Forks: 163

AbsaOSS/pramen

Resilient data pipeline framework running on Apache Spark

Language: Scala - Size: 3.83 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 24 - Forks: 3

InfuseAI/awesome-public-dbt-projects

A curated list of awesome public DBT projects

Size: 39.1 KB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 126 - Forks: 16

tracebloc/data-ingestors

tracebloc data pipeline for training/test dataset setup

Language: Python - Size: 3.93 MB - Last synced at: 13 days ago - Pushed at: 20 days ago - Stars: 2 - Forks: 0

mikecerton/UserInsight-Streaming-Data-Pipeline

UserInsight-Streaming-Data-Pipeline is a real-time pipeline that ingests API data into Kafka, processes it with Spark, stores it in S3, and uses AWS Lambda to load it into Redshift. The data is then used to create a dashboard in Looker. [Data Engineer]

Language: Python - Size: 325 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 1 - Forks: 0

snowplow/snowplow

The leader in Next-Generation Customer Data Infrastructure

Language: Scala - Size: 25.5 MB - Last synced at: 19 days ago - Pushed at: about 2 months ago - Stars: 6,915 - Forks: 1,191

AgnostiqHQ/covalent

Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.

Language: Python - Size: 370 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 821 - Forks: 103

95xin/Data-Engineering-Project---Automatic-Batch-Data-Processing

Data Engineering Project - Automated Batch Data Processing

Language: Python - Size: 996 KB - Last synced at: 18 days ago - Pushed at: 21 days ago - Stars: 1 - Forks: 0

1kbgz/tributary

Streaming reactive and dataflow graphs in Python

Language: Python - Size: 10.3 MB - Last synced at: 14 days ago - Pushed at: 24 days ago - Stars: 452 - Forks: 36

kevinrhz/CineSteam

A PostgreSQL database integrating IMDB's Top 1000 Movies and Steam's 2025 Game dataset for data analysis, SQL queries, and ETL with Python.

Language: Python - Size: 36.7 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

montara-io/dbt-command-center

Never sift through endless dbt™ logs again. dbt Command Center is a free, open-source, local web application that provides a user-friendly interface to monitor and manage dbt runs.

Language: TypeScript - Size: 3.55 MB - Last synced at: 7 days ago - Pushed at: 21 days ago - Stars: 28 - Forks: 0

agnostk/nasa-space-activity

A scalable, cloud-native ETL pipeline that extracts, transforms, and enriches data from NASA APIs using AWS Glue, Lightsail, and RDS — all orchestrated with Terraform. Features modular design, medallion architecture (bronze, silver, gold), image metadata extraction and classification with PyTorch, and a bonus Mosaic Generator app.

Language: Python - Size: 2.52 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

gervais-amoah/finance_pipeline

A complete ETL pipeline for currency exchange rates using three data sources: Kaggle CSV, Frankfurter API, and x-rates.com (web scraping). Transforms and stores data in a SQLite database and CSV files for future analysis. Built with Python, requests, BeautifulSoup, and pandas. Automation-ready and modular for easy scaling.

Language: Python - Size: 10.5 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 0 - Forks: 0

swapnaxdata/Nyc-Taxi-Data-Engineering-Project

Language: Jupyter Notebook - Size: 2.94 MB - Last synced at: 23 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

theatashaikh/end-to-end-azure-data-engineering-and-analytics-project

End-to-End Azure Data Engineering Project: Tokyo Olympics 2021 Analysis A complete data pipeline built on Microsoft Azure to ingest, process, and analyze Olympic data.

Size: 373 KB - Last synced at: 23 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

RaccoonOnion/job-scraping

A job post processing pipeline built with Scrapy (scrapping), MongDB (storage) and Redis (Deduplication). Containerized and easy to run & deploy!

Language: Python - Size: 190 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

sergiolm96/Cluster-Spark-Hadoop

🔍 Análisis de logs web con PySpark y Hadoop en Docker para extracción de IPs, errores y patrones de acceso.

Language: Jupyter Notebook - Size: 20.5 KB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0

zazuko/barnard59

An intuitive and flexible RDF pipeline solution designed to simplify and automate ETL processes for efficient data management.

Language: JavaScript - Size: 3.66 MB - Last synced at: 2 days ago - Pushed at: about 1 month ago - Stars: 30 - Forks: 2

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

Language: Python - Size: 23.9 MB - Last synced at: 27 days ago - Pushed at: 7 months ago - Stars: 292 - Forks: 62

GoogleCloudPlatform/data-science-on-gcp

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017

Language: Jupyter Notebook - Size: 6.51 MB - Last synced at: 28 days ago - Pushed at: about 2 months ago - Stars: 1,383 - Forks: 721

olirice/flupy

Fluent data pipelines for python and your shell

Language: Python - Size: 434 KB - Last synced at: 6 days ago - Pushed at: 8 months ago - Stars: 192 - Forks: 15

pipeline-tools/gusty

Making DAG construction easier

Language: Python - Size: 434 KB - Last synced at: 19 days ago - Pushed at: 2 months ago - Stars: 261 - Forks: 12

letfontenelle/data-pipeline-azure-medallion-architecture

This project focuses on building a data pipeline to analyze large sensor datasets collected from water bodies across European countries. By leveraging Azure Services, the pipeline ingests, processes, and visualizes data to gain insights into water quality trends and determinands.

Language: Jupyter Notebook - Size: 9.77 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 0 - Forks: 0

tvdboom/ATOM

Automated Tool for Optimized Modelling

Language: HTML - Size: 826 MB - Last synced at: 6 days ago - Pushed at: 10 months ago - Stars: 158 - Forks: 14

codetiger/dataflow-rs

An event-driven workflow orchestration engine written in Rust that empowers you to define and execute data pipelines as code

Language: Rust - Size: 4.88 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

illuin-tech/data-pipeline

Toolkit for describing data transformation pipelines by compositing simple reusable components.

Language: Java - Size: 1.27 MB - Last synced at: 28 days ago - Pushed at: about 1 month ago - Stars: 6 - Forks: 0

mensenvau/airflow-mysql-to-s3

🚀 Apache Airflow Dynamic MySQL to S3

Language: Python - Size: 535 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

airscholar/e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

Language: Python - Size: 289 KB - Last synced at: 30 days ago - Pushed at: 3 months ago - Stars: 240 - Forks: 117

dataflint/spark

Performance Observability for Apache Spark

Language: TypeScript - Size: 18.8 MB - Last synced at: 29 days ago - Pushed at: about 1 month ago - Stars: 246 - Forks: 25

sjlewis25/pizza-delivery-pipeline

Simulates a real-world data pipeline for a pizza delivery service using AWS services and Terraform. Ingests and processes delivery data with S3, triggers Lambda functions for processing, and stores structured data in DynamoDB. Highlights use of automation, event-driven triggers, and real-time cloud-based data workflows.

Language: HCL - Size: 25.4 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

abeltavares/real-time-data-pipeline

📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.

Language: Python - Size: 1010 KB - Last synced at: 19 days ago - Pushed at: 4 months ago - Stars: 43 - Forks: 4

FAIRDataPipeline/FAIR-CLI

Command line interface for the FAIR Data Pipeline

Language: Python - Size: 5.98 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 4 - Forks: 1

jugnuarora/france-courses-enrollments

Data Pipeline creation of france courses enrollments. Every month the providers report the enrollments in their programs. The idea is to get the courses listed as well as the enrollments every month and look at the trend of enrolments and the inter comparison of the trainings s providers for different courses.

Language: Python - Size: 60.1 MB - Last synced at: 29 days ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

bytedance/bitsail

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

Language: Java - Size: 26.4 MB - Last synced at: 27 days ago - Pushed at: over 1 year ago - Stars: 1,654 - Forks: 331

chaseofthejungle/intro-to-ci-cd-pipelines

A brief overview of what CI/CD pipelines are and why they are so important.

Size: 7.81 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

adilkhash/Data-Engineering-HowTo

A list of useful resources to learn Data Engineering from scratch

Size: 56.6 KB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 3,750 - Forks: 534

fremantle-industries/slurpee

A GUI frontend to manage blockchain ingestion with slurp

Language: Elixir - Size: 2.69 MB - Last synced at: 18 days ago - Pushed at: about 2 years ago - Stars: 21 - Forks: 1