GitHub topics: hadoop | Ecosyste.ms: Repos

AbsaOSS/spline

Data Lineage Tracking And Visualization Solution

Language: Scala - Size: 8.44 MB - Last synced at: about 10 hours ago - Pushed at: about 11 hours ago - Stars: 632 - Forks: 159

QubitPi/hadoop Fork of apache/hadoop

hadoop.qubitpi.org 📚 My personal study notes on my favorite software Hadoop

Language: Java - Size: 264 MB - Last synced at: 32 minutes ago - Pushed at: 35 minutes ago - Stars: 0 - Forks: 0

QubitPi/jupiter

Collection of Docker Images Commonly Used in Data-Intensive Applications

Size: 1.22 MB - Last synced at: about 16 hours ago - Pushed at: about 17 hours ago - Stars: 0 - Forks: 1

apache/nutch

Apache Nutch is an extensible and scalable web crawler

Language: Java - Size: 132 MB - Last synced at: about 7 hours ago - Pushed at: 3 months ago - Stars: 3,036 - Forks: 1,257

PoSungKim/development_study

open source based development related contents

Language: Java - Size: 5.52 MB - Last synced at: about 18 hours ago - Pushed at: about 19 hours ago - Stars: 3 - Forks: 0

apache/doris-website

Apache Doris Website

Language: TypeScript - Size: 451 MB - Last synced at: about 20 hours ago - Pushed at: about 21 hours ago - Stars: 102 - Forks: 326

GridProtectionAlliance/openPDC

Open Source Phasor Data Concentrator

Language: C# - Size: 2.48 GB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 138 - Forks: 57

apache/hadoop

Apache Hadoop

Language: Java - Size: 567 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 15,127 - Forks: 9,059

apache/calcite

Apache Calcite

Language: Java - Size: 102 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 4,867 - Forks: 2,427

A collection of ready-to-use Docker development environments for multiple Linux distributions (Ubuntu, Debian, Alpine, Arch, Kali). Includes shared configurations, utility scripts, and comprehensive documentation for reproducible development setups across teams and CI/CD pipelines.

Language: Shell - Size: 14.6 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 0

hoangsonww/End-to-End-Data-Pipeline

📈 A scalable, production-ready data pipeline for real-time streaming & batch processing, integrating Kafka, Spark, Airflow, AWS, Kubernetes, and MLflow. Supports end-to-end data ingestion, transformation, storage, monitoring, and AI/ML serving with CI/CD automation using Terraform & GitHub Actions.

Language: Python - Size: 31.1 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 41 - Forks: 24

apache/incubator-wayang

Apache Wayang(incubating) is the first cross-platform data processing system.

Language: Java - Size: 19.1 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 222 - Forks: 96

amrajib/velox

Velox is a work in progress, experimental Rust based Operating System written for fun and learning

Language: Rust - Size: 232 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

apache/ignite

Apache Ignite

Language: Java - Size: 444 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 4,944 - Forks: 1,923

crystalloide/Big-Data-Cluster Fork of mrugankray/Big-Data-Cluster

L'objectif du projet est de créer un cluster Docker donnant accès à Hadoop, HDFS, Hive, PySpark, Sqoop, Airflow, Kafka, Flume, Postgres, Cassandra, Hue, Zeppelin, Kadmin, Kafka Control Center et pgAdmin. Ce cluster est exclusivement destiné à un usage en développement. Ne pas l'utiliser en production.

Language: Shell - Size: 467 KB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0

isxcode/spark-yun

Big data computing platform based on Spark <至轻云-超轻量级大数据计算平台/数据中心/主数据>

Language: Java - Size: 5.97 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 207 - Forks: 55

hortonworks/cloudbreak

CDP Public Cloud is an integrated analytics and data management platform deployed on cloud services. It offers broad data analytics and artificial intelligence functionality along with secure user access and data governance features.

Language: Java - Size: 221 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 358 - Forks: 238

Ismail-Hossain-1/Distributed-System---MapReduce

Distributed System example code for .txt data with hadoop

Language: Java - Size: 53.7 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

ego-creator/hepmassClassification

Pipeline PySpark pour la classification de particules en physique des hautes énergies (dataset HEPMASS). Inclut le prétraitement distribué, l'entraînement de modèles (régression logistique, arbres de décision), l'évaluation et des visualisations clés. Optimisé pour Hadoop/Spark.

Size: 1.95 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

h2oai/h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Language: Jupyter Notebook - Size: 597 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 7,204 - Forks: 2,024

wgzhao/Addax

A fast and versatile ETL tool that can transfer data between RDBMS and NoSQL seamlessly

Language: Java - Size: 45.2 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1,289 - Forks: 313

joseanmos/Easy_Park

Easy_Park offers a smart solution for parking, ensuring safety and efficiency with real-time distance measurements. 🚗🔧 This project integrates an ultrasonic sensor with a microcontroller to provide clear visual and auditory alerts for drivers.

Size: 5.5 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0

hoangsonww/Moodify-Emotion-Music-App

🎹 Moodify - an emotion-based music recommendation system that uses AI/ML models to analyze text, speech, and facial expressions, providing personalized music recommendations across web and mobile platforms.

Language: JavaScript - Size: 236 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 56 - Forks: 23

infernoYam1n/BlazeData

A Curated Data Science & Deep Learning Repository -BlazeData is a high-impact, modular repository designed to accelerate my journey in Data Science, Machine Learning, and Deep Learning.

Language: Jupyter Notebook - Size: 119 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

velpurinagasai99/Spark

Projects from TrendyTech

Language: Scala - Size: 1.36 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

NitchayaninT/BigDataProcessing

For big data processing course. Lecture includes the use of hadoop, mongoDB, spark, etc

Language: Jupyter Notebook - Size: 2.62 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

demonfire2134/calc

Calc is a simple calculator application that performs basic arithmetic operations. It features a user-friendly interface, allowing users to quickly add, subtract, multiply, and divide numbers.

Size: 1.95 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

marcelmay/hadoop-hdfs-fsimage-exporter

Exports Hadoop HDFS content statistics to Prometheus

Language: Java - Size: 552 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 155 - Forks: 47

apache/hive

Apache Hive

Language: Java - Size: 704 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 5,727 - Forks: 4,750

isxcode/flink-yun

Streaming data analysis platform based on Flink <至流云-超轻量级实时流分析平台/数据同步/实时同步>

Language: Java - Size: 2.8 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 21 - Forks: 17

xiongshengxiao/CloudEon

CloudEon 使用 Kubernetes 安装和部署开源大数据组件，使开源大数据平台的容器化运行成为可能。这允许您减少对底层资源管理和维护的关注。

Language: FreeMarker - Size: 80.4 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

groda/big_data

Big Data essentials: Hadoop, MapReduce, Spark. Explore tutorials and demos in Jupyter notebooks—most are self-contained and live, ready to run with a click.

Language: Jupyter Notebook - Size: 51.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 78 - Forks: 27

prestodb/presto

The official home of the Presto distributed SQL query engine for big data

Language: Java - Size: 231 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 16,383 - Forks: 5,463

openaire/iis

Information Inference Service of the OpenAIRE system

Language: Java - Size: 71.8 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 21 - Forks: 11

trinodb/trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Language: Java - Size: 262 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 11,448 - Forks: 3,233

agile-lab-dev/wasp

WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.

Language: Scala - Size: 7.7 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 31 - Forks: 12

miguno/avro-hadoop-starter 📦

Example MapReduce jobs in Java, Hive, Pig, and Hadoop Streaming that work on Avro data.

Language: Java - Size: 650 KB - Last synced at: 2 days ago - Pushed at: over 9 years ago - Stars: 115 - Forks: 83

terascope/teraslice

Scalable data processing pipelines in JavaScript

Language: TypeScript - Size: 110 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 59 - Forks: 14

deeplearning4j/deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...

Language: Java - Size: 728 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 14,018 - Forks: 3,851

linkedin/venice

Venice, Derived Data Platform for Planet-Scale Workloads.

Language: Java - Size: 55 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 542 - Forks: 98

asier-ortiz/apuntes-ia-ds

Materiales y apuntes del curso de IA y Ciencia de Datos

Language: Jupyter Notebook - Size: 228 MB - Last synced at: 3 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

asier-ortiz/ai-course-notebooks

AI course Notebooks and Exercises

Language: Jupyter Notebook - Size: 455 MB - Last synced at: 3 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

Fradhyle/Voo-ong

더조은컴퓨터아카데미 빅데이터 10기 최종 팀 프로젝트

Language: HiveQL - Size: 57.7 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

ahmedmohamedfoua/Big-Data-Project---Predicting-Trip-Fares-with-Spark-Hive

This repository provides a complete workflow for predicting ride-sharing trip fares in New York City using Spark and Hive. Explore the data, models, and results while leveraging the power of big data! 🐙🚀

Language: Java - Size: 906 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

HariSekhon/Knowledge-Base

Large Tech Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public

Language: Shell - Size: 183 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 180 - Forks: 34

nielsbasjes/logparser

Easy parsing of Apache HTTPD and NGINX access logs with Java, Hadoop, Hive, Flink, Beam, Storm, Drill, ...

Language: Java - Size: 2.8 MB - Last synced at: 3 days ago - Pushed at: 7 days ago - Stars: 159 - Forks: 42

apache/kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

Language: Scala - Size: 60 MB - Last synced at: 6 days ago - Pushed at: 9 days ago - Stars: 2,203 - Forks: 943

umbertocicciaa/air-flights-big-data-unical

spark + hadoop + hive + streamlit application for analysis an airflight dataset for big data university exam

Language: Python - Size: 320 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

apache/ozone

Scalable, reliable, distributed storage system optimized for data analytics and object store workloads.

Language: Java - Size: 103 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 934 - Forks: 542

zeekling/hadoop_book

Hadoop 学习笔记。

Size: 216 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 2 - Forks: 0

ShapeLayer/election-result-8values-spark

Evaluate South Korea's political situation using the result of the 21st National Assembly election and 8values political quiz.

Language: Typst - Size: 1.83 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

mxagar/spark_big_data_guide

This repository contains my personal guide on Spark and topics related to Big Data.

Language: Jupyter Notebook - Size: 6.44 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 1 - Forks: 1

MadhukarSaiBabu/Aviation-Trend-Analysis-using-MapReduce-and-R

Developed a data-driven solution leveraging Hadoop MapReduce, Hive, and R to analyze air travel data. Identified trends in passenger volume, route utilization, and peak travel periods, providing actionable insights for optimizing airline operations and improving the passenger experience.

Size: 1.34 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 1 - Forks: 0

smart-data-lake/smart-data-lake

Smart Automation Tool for building modern Data Lakes and Data Pipelines

Language: Scala - Size: 44.1 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 124 - Forks: 24

Diakkoo/docker-hadoop-container

用dockerfile编写flask容器和hdfs容器，docker-compose更快捷地部署hdfs集群

Language: Python - Size: 22.5 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

lynnlangit/learning-hadoop-and-spark

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning

Language: HTML - Size: 13.6 MB - Last synced at: 3 days ago - Pushed at: 6 months ago - Stars: 195 - Forks: 165

PriyankaJhaTheDeveloper/DataAnalystPortfolioProjects

This repository contains my Data Analytics portfolio projects ranging from SQL, Python, Tableau, Excel, and Hadoop (HiveQL).

Language: Jupyter Notebook - Size: 26.1 MB - Last synced at: 9 days ago - Pushed at: over 3 years ago - Stars: 498 - Forks: 88

timebusker/timebusker.github.io

timebusker.github.io

Language: HTML - Size: 235 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 1 - Forks: 0

Jim-by/Personalized-Recommendation-System

End-to-end personalised recommender system for e-commerce: synthetic data, PySpark, Delta Lake, model training, evaluation, monitoring, A/B test.

Language: Python - Size: 86.9 KB - Last synced at: 3 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

TuLe142857/PersonalizedPageRankHadoop

Implement Personalized PageRank use Hadoop

Language: Java - Size: 19.7 MB - Last synced at: about 13 hours ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

pavithra19/apache_spark_people_data_processor

This project is a data processing application built with Apache Spark and Scala. This is designed to efficiently process, analyze and transform large datasets related to people data. It leverages Spark’s distributed computing capabilities to handle scalable data ingestion, cleaning and reporting. Shell scripts are included for hadoop deployment.

Language: Scala - Size: 1.82 MB - Last synced at: 3 days ago - Pushed at: 10 days ago - Stars: 1 - Forks: 0

HariSekhon/DevOps-Python-tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

Language: Python - Size: 3.11 MB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 798 - Forks: 348

keanteng/wqd7007-project

Uncovering Patterns in Urban Transportation: A Big Data Analysis of NYC Taxi Trips

Language: Jupyter Notebook - Size: 8.47 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

Qihoo360/hbox

AI on Hadoop

Language: Java - Size: 126 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1,731 - Forks: 386

hexnn/Stark

基于Spark+SparkMLlib+Debezium+Deequ打造的简单易用、超高性能大数据治理引擎。适用于批流一体的数据集成和数据分析，支持CDC实时数据采集、机器学习算法模型、数据质量校验、数据标注、敏感数据识别、数据建模、算法建模和OLAP数据分析

Language: Scala - Size: 136 KB - Last synced at: 9 days ago - Pushed at: 12 days ago - Stars: 29 - Forks: 1

BsoBird/filesystem-catalog-original

A prototype for implementing datalake catalog management based on arbitrary file systems

Language: Java - Size: 161 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

apache/tez

Apache Tez

Language: Java - Size: 29.3 MB - Last synced at: 6 days ago - Pushed at: 18 days ago - Stars: 496 - Forks: 433

iterative/dvc-hdfs

HDFS/WebHDFS plugin for dvc

Language: Python - Size: 102 KB - Last synced at: 8 days ago - Pushed at: 12 days ago - Stars: 2 - Forks: 1

apache/doris-thirdparty

Self-managed thirdparty dependencies for Apache Doris

Size: 515 MB - Last synced at: about 9 hours ago - Pushed at: 4 days ago - Stars: 41 - Forks: 44

Tencent/APIJSON

🏆 实时零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码，前端(客户端) 定制返回 JSON 的数据和结构 🏆 Real-Time coding-free, powerful and secure ORM 🚀 providing APIs and Docs without coding by Backend, and the returned JSON of API can be customized by Frontend(Client) users

Language: Java - Size: 69.8 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 17,973 - Forks: 2,230

TmohamedashrafT/High-Availability-Bigdata-Cluster

A highly available, fully distributed big data cluster built with Docker, integrating Hadoop HDFS, YARN, ZooKeeper, HBase, Hive, Spark, and Tez. Designed for scalability, fault tolerance, and seamless data processing in a containerized environment.

Language: Shell - Size: 16.6 KB - Last synced at: 3 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

criteo/tf-yarn

Train TensorFlow models on YARN in just a few lines of code!

Language: Python - Size: 775 KB - Last synced at: 9 days ago - Pushed at: over 1 year ago - Stars: 89 - Forks: 29

PasanAbeysekara/Taxi-Pickup-Hotspot-Analysis-using-Hadoop-MapReduce

This project analyzes one month of NYC Yellow Taxi trip data (January 2016) to identify the busiest taxi pickup locations. It utilizes the Hadoop MapReduce framework to process the data and a lookup table to map location IDs to human-readable zone names.

Language: Java - Size: 5.65 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

marcelmay/hfsa

Hadoop FSImage Analyzer (HFSA)

Language: Java - Size: 3.43 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 59 - Forks: 24

snowlift/trino-storage

Storage connector for Trino

Language: Java - Size: 2.69 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 111 - Forks: 36

BryanTheLai/Data-Engineering-Assignment-RDSY2S2

This repository contains a data engineering project aimed at processing and analyzing scraped data using PySpark, Redis, and Neo4j. The goal is to efficiently store, process, and analyze text data.

Language: Jupyter Notebook - Size: 1.8 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

PramithaMJ/job-analysis-MapReduce

Technical Skills Analysis using MapReduce - hadoop

Language: Shell - Size: 883 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

stackabletech/kafka-operator

Stackable Operator for Apache Kafka

Language: Rust - Size: 2.62 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 24 - Forks: 7

ddediu/AdhereR

Computation of adherence to medications from Electronic Healthcare Data in R

Language: HTML - Size: 191 MB - Last synced at: 9 days ago - Pushed at: over 1 year ago - Stars: 31 - Forks: 6

MoRan1607/BigDataGuide

大数据学习，从零开始学习大数据，包含大数据学习各阶段学习视频、面试资料

Size: 154 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 2,958 - Forks: 903

Malisha4065/HadoopConfiguration

Apache Hadoop Cluster configuration with original apache/hadoop:3.4.1 docker image (with YARN)

Language: Shell - Size: 6.84 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 3 - Forks: 0

apache/calcite-avatica

Apache Calcite Avatica

Language: Java - Size: 32 MB - Last synced at: about 8 hours ago - Pushed at: 3 months ago - Stars: 260 - Forks: 233

confluentinc/kafka-connect-hdfs

Kafka Connect HDFS connector

Language: Java - Size: 2.72 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 22 - Forks: 395

apache/drill

Apache Drill is a distributed MPP query layer for self describing data

Language: Java - Size: 68 MB - Last synced at: 6 days ago - Pushed at: 24 days ago - Stars: 1,975 - Forks: 982

itsSwapnil/Data-Interpolation-with-Radial-Basis-Function

A PySpark-based solution for cleaning and interpolating battery sensor data using forward/backward fill and Radial Basis Function (RBF) spatial interpolation. Outputs a clean, fully interpolated dataset in CSV format for advanced analysis.

Language: Python - Size: 13.7 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

jathavaan/bds-seoul-hadoop

Language: Python - Size: 81.1 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

mjakubowski84/parquet4s

Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.

Language: Scala - Size: 2.33 MB - Last synced at: 14 days ago - Pushed at: 28 days ago - Stars: 291 - Forks: 67

PramithaMJ/hadoop-cluster-manager

Complete Apache Hadoop 3.4.1 cluster installation and management toolkit with automated scripts, comprehensive documentation, and production-ready configuration templates for single-node and multi-node deployments.

Language: Shell - Size: 33.2 KB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 0 - Forks: 0

itsSwapnil/pyspark-incremental-airflow

This repository contains an Airflow DAG that orchestrates an incremental data pipeline using PySpark scripts. The pipeline automates daily processing data, syncs results to S3, performs housekeeping, and loops until a target date threshold is reached.

Language: Python - Size: 13.7 KB - Last synced at: 19 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0

rociobenitez/bigdata-fundamentals-notes

Colección práctica de apuntes sobre Java, Scala, Hadoop, Spark, Hive y más.

Size: 1.1 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

cloudposse/terraform-aws-emr-cluster

Terraform module to provision an Elastic MapReduce (EMR) cluster on AWS

Language: HCL - Size: 4.06 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 74 - Forks: 82

apache/hadoop-hdfs 📦

Mirror of Apache Hadoop HDFS

Language: Java - Size: 34.5 MB - Last synced at: 6 days ago - Pushed at: over 6 years ago - Stars: 199 - Forks: 115

DivitMittal/Datathon-BigData

Efficient Data Processing ETL Pipeline for Event Records

Language: Jupyter Notebook - Size: 4.1 MB - Last synced at: 18 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

TrivadisPF/platys-modern-data-platform

Support for generating modern platforms dynamically with services such as Kafka, Spark, Streamsets, HDFS, ....

Language: Python - Size: 533 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 75 - Forks: 16

baifendian/swordfish

Open-source distribute workflow schedule tools, also support streaming task.

Language: Java - Size: 3.84 MB - Last synced at: 19 days ago - Pushed at: over 7 years ago - Stars: 39 - Forks: 18

VladimirZelenokor1/Big-Data-Project---Predicting-Trip-Fares-with-Spark-Hive

A CRISP-DM–based big data pipeline for predicting NYC ride-sharing trip fares: ingesting 2024 TLC data via Sqoop into HDFS/Hive, performing ETL and feature engineering with Spark & PySpark, training and tuning Linear Regression & Gradient Boosted Tree models, and outlining end-to-end deployment.

Language: Java - Size: 906 KB - Last synced at: 15 days ago - Pushed at: 23 days ago - Stars: 1 - Forks: 0

benedekh/bigdata-projects

Student projects in Big Data field.

Language: Java - Size: 198 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 19 - Forks: 12

HariSekhon/DevOps-Bash-tools

1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LDAP, Code/Build Linting, pkg mgmt for Linux, Mac, Python, Perl, Ruby, NodeJS, Golang, Advanced dotfiles: .bashrc, .vimrc, .gitconfig, .screenrc, tmux..

Language: Shell - Size: 11.2 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 6,769 - Forks: 1,265

awesome-kyuubi/hadoop-testing

Testing Sandbox for Hadoop Ecosystem Components

Language: Jinja - Size: 2.12 MB - Last synced at: 4 days ago - Pushed at: 3 months ago - Stars: 36 - Forks: 16

apache/calcite-site

Apache Calcite Website

Language: HTML - Size: 103 MB - Last synced at: 6 days ago - Pushed at: 25 days ago - Stars: 5 - Forks: 9