GitHub topics: hadoop
AbsaOSS/spline
Data Lineage Tracking And Visualization Solution
Language: Scala - Size: 8.44 MB - Last synced at: about 10 hours ago - Pushed at: about 11 hours ago - Stars: 632 - Forks: 159

QubitPi/hadoop Fork of apache/hadoop
hadoop.qubitpi.org 📚 My personal study notes on my favorite software Hadoop
Language: Java - Size: 264 MB - Last synced at: 32 minutes ago - Pushed at: 35 minutes ago - Stars: 0 - Forks: 0

QubitPi/jupiter
Collection of Docker Images Commonly Used in Data-Intensive Applications
Size: 1.22 MB - Last synced at: about 16 hours ago - Pushed at: about 17 hours ago - Stars: 0 - Forks: 1

apache/nutch
Apache Nutch is an extensible and scalable web crawler
Language: Java - Size: 132 MB - Last synced at: about 7 hours ago - Pushed at: 3 months ago - Stars: 3,036 - Forks: 1,257

PoSungKim/development_study
open source based development related contents
Language: Java - Size: 5.52 MB - Last synced at: about 18 hours ago - Pushed at: about 19 hours ago - Stars: 3 - Forks: 0

apache/doris-website
Apache Doris Website
Language: TypeScript - Size: 451 MB - Last synced at: about 20 hours ago - Pushed at: about 21 hours ago - Stars: 102 - Forks: 326

GridProtectionAlliance/openPDC
Open Source Phasor Data Concentrator
Language: C# - Size: 2.48 GB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 138 - Forks: 57

apache/hadoop
Apache Hadoop
Language: Java - Size: 567 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 15,127 - Forks: 9,059

apache/calcite
Apache Calcite
Language: Java - Size: 102 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 4,867 - Forks: 2,427

EX539/docker-dev-env
A collection of ready-to-use Docker development environments for multiple Linux distributions (Ubuntu, Debian, Alpine, Arch, Kali). Includes shared configurations, utility scripts, and comprehensive documentation for reproducible development setups across teams and CI/CD pipelines.
Language: Shell - Size: 14.6 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 0

hoangsonww/End-to-End-Data-Pipeline
📈 A scalable, production-ready data pipeline for real-time streaming & batch processing, integrating Kafka, Spark, Airflow, AWS, Kubernetes, and MLflow. Supports end-to-end data ingestion, transformation, storage, monitoring, and AI/ML serving with CI/CD automation using Terraform & GitHub Actions.
Language: Python - Size: 31.1 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 41 - Forks: 24

apache/incubator-wayang
Apache Wayang(incubating) is the first cross-platform data processing system.
Language: Java - Size: 19.1 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 222 - Forks: 96

amrajib/velox
Velox is a work in progress, experimental Rust based Operating System written for fun and learning
Language: Rust - Size: 232 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

apache/ignite
Apache Ignite
Language: Java - Size: 444 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 4,944 - Forks: 1,923

crystalloide/Big-Data-Cluster Fork of mrugankray/Big-Data-Cluster
L'objectif du projet est de créer un cluster Docker donnant accès à Hadoop, HDFS, Hive, PySpark, Sqoop, Airflow, Kafka, Flume, Postgres, Cassandra, Hue, Zeppelin, Kadmin, Kafka Control Center et pgAdmin. Ce cluster est exclusivement destiné à un usage en développement. Ne pas l'utiliser en production.
Language: Shell - Size: 467 KB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0

isxcode/spark-yun
Big data computing platform based on Spark <至轻云-超轻量级大数据计算平台/数据中心/主数据>
Language: Java - Size: 5.97 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 207 - Forks: 55

hortonworks/cloudbreak
CDP Public Cloud is an integrated analytics and data management platform deployed on cloud services. It offers broad data analytics and artificial intelligence functionality along with secure user access and data governance features.
Language: Java - Size: 221 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 358 - Forks: 238

Ismail-Hossain-1/Distributed-System---MapReduce
Distributed System example code for .txt data with hadoop
Language: Java - Size: 53.7 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

ego-creator/hepmassClassification
Pipeline PySpark pour la classification de particules en physique des hautes énergies (dataset HEPMASS). Inclut le prétraitement distribué, l'entraînement de modèles (régression logistique, arbres de décision), l'évaluation et des visualisations clés. Optimisé pour Hadoop/Spark.
Size: 1.95 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

h2oai/h2o-3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Language: Jupyter Notebook - Size: 597 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 7,204 - Forks: 2,024

wgzhao/Addax
A fast and versatile ETL tool that can transfer data between RDBMS and NoSQL seamlessly
Language: Java - Size: 45.2 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1,289 - Forks: 313

joseanmos/Easy_Park
Easy_Park offers a smart solution for parking, ensuring safety and efficiency with real-time distance measurements. 🚗🔧 This project integrates an ultrasonic sensor with a microcontroller to provide clear visual and auditory alerts for drivers.
Size: 5.5 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0

hoangsonww/Moodify-Emotion-Music-App
🎹 Moodify - an emotion-based music recommendation system that uses AI/ML models to analyze text, speech, and facial expressions, providing personalized music recommendations across web and mobile platforms.
Language: JavaScript - Size: 236 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 56 - Forks: 23

infernoYam1n/BlazeData
A Curated Data Science & Deep Learning Repository -BlazeData is a high-impact, modular repository designed to accelerate my journey in Data Science, Machine Learning, and Deep Learning.
Language: Jupyter Notebook - Size: 119 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

velpurinagasai99/Spark
Projects from TrendyTech
Language: Scala - Size: 1.36 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

NitchayaninT/BigDataProcessing
For big data processing course. Lecture includes the use of hadoop, mongoDB, spark, etc
Language: Jupyter Notebook - Size: 2.62 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

demonfire2134/calc
Calc is a simple calculator application that performs basic arithmetic operations. It features a user-friendly interface, allowing users to quickly add, subtract, multiply, and divide numbers.
Size: 1.95 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

marcelmay/hadoop-hdfs-fsimage-exporter
Exports Hadoop HDFS content statistics to Prometheus
Language: Java - Size: 552 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 155 - Forks: 47

apache/hive
Apache Hive
Language: Java - Size: 704 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 5,727 - Forks: 4,750

isxcode/flink-yun
Streaming data analysis platform based on Flink <至流云-超轻量级实时流分析平台/数据同步/实时同步>
Language: Java - Size: 2.8 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 21 - Forks: 17

xiongshengxiao/CloudEon
CloudEon 使用 Kubernetes 安装和部署开源大数据组件,使开源大数据平台的容器化运行成为可能。这允许您减少对底层资源管理和维护的关注。
Language: FreeMarker - Size: 80.4 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

groda/big_data
Big Data essentials: Hadoop, MapReduce, Spark. Explore tutorials and demos in Jupyter notebooks—most are self-contained and live, ready to run with a click.
Language: Jupyter Notebook - Size: 51.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 78 - Forks: 27

prestodb/presto
The official home of the Presto distributed SQL query engine for big data
Language: Java - Size: 231 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 16,383 - Forks: 5,463

openaire/iis
Information Inference Service of the OpenAIRE system
Language: Java - Size: 71.8 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 21 - Forks: 11

trinodb/trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Language: Java - Size: 262 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 11,448 - Forks: 3,233

agile-lab-dev/wasp
WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.
Language: Scala - Size: 7.7 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 31 - Forks: 12

miguno/avro-hadoop-starter 📦
Example MapReduce jobs in Java, Hive, Pig, and Hadoop Streaming that work on Avro data.
Language: Java - Size: 650 KB - Last synced at: 2 days ago - Pushed at: over 9 years ago - Stars: 115 - Forks: 83

terascope/teraslice
Scalable data processing pipelines in JavaScript
Language: TypeScript - Size: 110 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 59 - Forks: 14

deeplearning4j/deeplearning4j
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...
Language: Java - Size: 728 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 14,018 - Forks: 3,851

linkedin/venice
Venice, Derived Data Platform for Planet-Scale Workloads.
Language: Java - Size: 55 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 542 - Forks: 98

asier-ortiz/apuntes-ia-ds
Materiales y apuntes del curso de IA y Ciencia de Datos
Language: Jupyter Notebook - Size: 228 MB - Last synced at: 3 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

asier-ortiz/ai-course-notebooks
AI course Notebooks and Exercises
Language: Jupyter Notebook - Size: 455 MB - Last synced at: 3 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

Fradhyle/Voo-ong
더조은컴퓨터아카데미 빅데이터 10기 최종 팀 프로젝트
Language: HiveQL - Size: 57.7 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

ahmedmohamedfoua/Big-Data-Project---Predicting-Trip-Fares-with-Spark-Hive
This repository provides a complete workflow for predicting ride-sharing trip fares in New York City using Spark and Hive. Explore the data, models, and results while leveraging the power of big data! 🐙🚀
Language: Java - Size: 906 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

HariSekhon/Knowledge-Base
Large Tech Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public
Language: Shell - Size: 183 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 180 - Forks: 34

nielsbasjes/logparser
Easy parsing of Apache HTTPD and NGINX access logs with Java, Hadoop, Hive, Flink, Beam, Storm, Drill, ...
Language: Java - Size: 2.8 MB - Last synced at: 3 days ago - Pushed at: 7 days ago - Stars: 159 - Forks: 42

apache/kyuubi
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Language: Scala - Size: 60 MB - Last synced at: 6 days ago - Pushed at: 9 days ago - Stars: 2,203 - Forks: 943

umbertocicciaa/air-flights-big-data-unical
spark + hadoop + hive + streamlit application for analysis an airflight dataset for big data university exam
Language: Python - Size: 320 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

apache/ozone
Scalable, reliable, distributed storage system optimized for data analytics and object store workloads.
Language: Java - Size: 103 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 934 - Forks: 542

zeekling/hadoop_book
Hadoop 学习笔记。
Size: 216 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 2 - Forks: 0

ShapeLayer/election-result-8values-spark
Evaluate South Korea's political situation using the result of the 21st National Assembly election and 8values political quiz.
Language: Typst - Size: 1.83 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

mxagar/spark_big_data_guide
This repository contains my personal guide on Spark and topics related to Big Data.
Language: Jupyter Notebook - Size: 6.44 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 1 - Forks: 1

MadhukarSaiBabu/Aviation-Trend-Analysis-using-MapReduce-and-R
Developed a data-driven solution leveraging Hadoop MapReduce, Hive, and R to analyze air travel data. Identified trends in passenger volume, route utilization, and peak travel periods, providing actionable insights for optimizing airline operations and improving the passenger experience.
Size: 1.34 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 1 - Forks: 0

smart-data-lake/smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Language: Scala - Size: 44.1 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 124 - Forks: 24

Diakkoo/docker-hadoop-container
用dockerfile编写flask容器和hdfs容器,docker-compose更快捷地部署hdfs集群
Language: Python - Size: 22.5 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

lynnlangit/learning-hadoop-and-spark
Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
Language: HTML - Size: 13.6 MB - Last synced at: 3 days ago - Pushed at: 6 months ago - Stars: 195 - Forks: 165

PriyankaJhaTheDeveloper/DataAnalystPortfolioProjects
This repository contains my Data Analytics portfolio projects ranging from SQL, Python, Tableau, Excel, and Hadoop (HiveQL).
Language: Jupyter Notebook - Size: 26.1 MB - Last synced at: 9 days ago - Pushed at: over 3 years ago - Stars: 498 - Forks: 88

timebusker/timebusker.github.io
timebusker.github.io
Language: HTML - Size: 235 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 1 - Forks: 0

Jim-by/Personalized-Recommendation-System
End-to-end personalised recommender system for e-commerce: synthetic data, PySpark, Delta Lake, model training, evaluation, monitoring, A/B test.
Language: Python - Size: 86.9 KB - Last synced at: 3 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

TuLe142857/PersonalizedPageRankHadoop
Implement Personalized PageRank use Hadoop
Language: Java - Size: 19.7 MB - Last synced at: about 13 hours ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

pavithra19/apache_spark_people_data_processor
This project is a data processing application built with Apache Spark and Scala. This is designed to efficiently process, analyze and transform large datasets related to people data. It leverages Spark’s distributed computing capabilities to handle scalable data ingestion, cleaning and reporting. Shell scripts are included for hadoop deployment.
Language: Scala - Size: 1.82 MB - Last synced at: 3 days ago - Pushed at: 10 days ago - Stars: 1 - Forks: 0

HariSekhon/DevOps-Python-tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Language: Python - Size: 3.11 MB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 798 - Forks: 348

keanteng/wqd7007-project
Uncovering Patterns in Urban Transportation: A Big Data Analysis of NYC Taxi Trips
Language: Jupyter Notebook - Size: 8.47 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

Qihoo360/hbox
AI on Hadoop
Language: Java - Size: 126 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1,731 - Forks: 386

hexnn/Stark
基于Spark+SparkMLlib+Debezium+Deequ打造的简单易用、超高性能大数据治理引擎。适用于批流一体的数据集成和数据分析,支持CDC实时数据采集、机器学习算法模型、数据质量校验、数据标注、敏感数据识别、数据建模、算法建模和OLAP数据分析
Language: Scala - Size: 136 KB - Last synced at: 9 days ago - Pushed at: 12 days ago - Stars: 29 - Forks: 1

BsoBird/filesystem-catalog-original
A prototype for implementing datalake catalog management based on arbitrary file systems
Language: Java - Size: 161 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

apache/tez
Apache Tez
Language: Java - Size: 29.3 MB - Last synced at: 6 days ago - Pushed at: 18 days ago - Stars: 496 - Forks: 433

iterative/dvc-hdfs
HDFS/WebHDFS plugin for dvc
Language: Python - Size: 102 KB - Last synced at: 8 days ago - Pushed at: 12 days ago - Stars: 2 - Forks: 1

apache/doris-thirdparty
Self-managed thirdparty dependencies for Apache Doris
Size: 515 MB - Last synced at: about 9 hours ago - Pushed at: 4 days ago - Stars: 41 - Forks: 44

Tencent/APIJSON
🏆 实时 零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码,前端(客户端) 定制返回 JSON 的数据和结构 🏆 Real-Time coding-free, powerful and secure ORM 🚀 providing APIs and Docs without coding by Backend, and the returned JSON of API can be customized by Frontend(Client) users
Language: Java - Size: 69.8 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 17,973 - Forks: 2,230

TmohamedashrafT/High-Availability-Bigdata-Cluster
A highly available, fully distributed big data cluster built with Docker, integrating Hadoop HDFS, YARN, ZooKeeper, HBase, Hive, Spark, and Tez. Designed for scalability, fault tolerance, and seamless data processing in a containerized environment.
Language: Shell - Size: 16.6 KB - Last synced at: 3 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

criteo/tf-yarn
Train TensorFlow models on YARN in just a few lines of code!
Language: Python - Size: 775 KB - Last synced at: 9 days ago - Pushed at: over 1 year ago - Stars: 89 - Forks: 29

PasanAbeysekara/Taxi-Pickup-Hotspot-Analysis-using-Hadoop-MapReduce
This project analyzes one month of NYC Yellow Taxi trip data (January 2016) to identify the busiest taxi pickup locations. It utilizes the Hadoop MapReduce framework to process the data and a lookup table to map location IDs to human-readable zone names.
Language: Java - Size: 5.65 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

marcelmay/hfsa
Hadoop FSImage Analyzer (HFSA)
Language: Java - Size: 3.43 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 59 - Forks: 24

snowlift/trino-storage
Storage connector for Trino
Language: Java - Size: 2.69 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 111 - Forks: 36

BryanTheLai/Data-Engineering-Assignment-RDSY2S2
This repository contains a data engineering project aimed at processing and analyzing scraped data using PySpark, Redis, and Neo4j. The goal is to efficiently store, process, and analyze text data.
Language: Jupyter Notebook - Size: 1.8 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

PramithaMJ/job-analysis-MapReduce
Technical Skills Analysis using MapReduce - hadoop
Language: Shell - Size: 883 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

stackabletech/kafka-operator
Stackable Operator for Apache Kafka
Language: Rust - Size: 2.62 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 24 - Forks: 7

ddediu/AdhereR
Computation of adherence to medications from Electronic Healthcare Data in R
Language: HTML - Size: 191 MB - Last synced at: 9 days ago - Pushed at: over 1 year ago - Stars: 31 - Forks: 6

MoRan1607/BigDataGuide
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Size: 154 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 2,958 - Forks: 903

Malisha4065/HadoopConfiguration
Apache Hadoop Cluster configuration with original apache/hadoop:3.4.1 docker image (with YARN)
Language: Shell - Size: 6.84 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 3 - Forks: 0

apache/calcite-avatica
Apache Calcite Avatica
Language: Java - Size: 32 MB - Last synced at: about 8 hours ago - Pushed at: 3 months ago - Stars: 260 - Forks: 233

confluentinc/kafka-connect-hdfs
Kafka Connect HDFS connector
Language: Java - Size: 2.72 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 22 - Forks: 395

apache/drill
Apache Drill is a distributed MPP query layer for self describing data
Language: Java - Size: 68 MB - Last synced at: 6 days ago - Pushed at: 24 days ago - Stars: 1,975 - Forks: 982

itsSwapnil/Data-Interpolation-with-Radial-Basis-Function
A PySpark-based solution for cleaning and interpolating battery sensor data using forward/backward fill and Radial Basis Function (RBF) spatial interpolation. Outputs a clean, fully interpolated dataset in CSV format for advanced analysis.
Language: Python - Size: 13.7 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

jathavaan/bds-seoul-hadoop
Language: Python - Size: 81.1 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

mjakubowski84/parquet4s
Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
Language: Scala - Size: 2.33 MB - Last synced at: 14 days ago - Pushed at: 28 days ago - Stars: 291 - Forks: 67

PramithaMJ/hadoop-cluster-manager
Complete Apache Hadoop 3.4.1 cluster installation and management toolkit with automated scripts, comprehensive documentation, and production-ready configuration templates for single-node and multi-node deployments.
Language: Shell - Size: 33.2 KB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 0 - Forks: 0

itsSwapnil/pyspark-incremental-airflow
This repository contains an Airflow DAG that orchestrates an incremental data pipeline using PySpark scripts. The pipeline automates daily processing data, syncs results to S3, performs housekeeping, and loops until a target date threshold is reached.
Language: Python - Size: 13.7 KB - Last synced at: 19 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0

rociobenitez/bigdata-fundamentals-notes
Colección práctica de apuntes sobre Java, Scala, Hadoop, Spark, Hive y más.
Size: 1.1 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

cloudposse/terraform-aws-emr-cluster
Terraform module to provision an Elastic MapReduce (EMR) cluster on AWS
Language: HCL - Size: 4.06 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 74 - Forks: 82

apache/hadoop-hdfs 📦
Mirror of Apache Hadoop HDFS
Language: Java - Size: 34.5 MB - Last synced at: 6 days ago - Pushed at: over 6 years ago - Stars: 199 - Forks: 115

DivitMittal/Datathon-BigData
Efficient Data Processing ETL Pipeline for Event Records
Language: Jupyter Notebook - Size: 4.1 MB - Last synced at: 18 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

TrivadisPF/platys-modern-data-platform
Support for generating modern platforms dynamically with services such as Kafka, Spark, Streamsets, HDFS, ....
Language: Python - Size: 533 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 75 - Forks: 16

baifendian/swordfish
Open-source distribute workflow schedule tools, also support streaming task.
Language: Java - Size: 3.84 MB - Last synced at: 19 days ago - Pushed at: over 7 years ago - Stars: 39 - Forks: 18

VladimirZelenokor1/Big-Data-Project---Predicting-Trip-Fares-with-Spark-Hive
A CRISP-DM–based big data pipeline for predicting NYC ride-sharing trip fares: ingesting 2024 TLC data via Sqoop into HDFS/Hive, performing ETL and feature engineering with Spark & PySpark, training and tuning Linear Regression & Gradient Boosted Tree models, and outlining end-to-end deployment.
Language: Java - Size: 906 KB - Last synced at: 15 days ago - Pushed at: 23 days ago - Stars: 1 - Forks: 0

benedekh/bigdata-projects
Student projects in Big Data field.
Language: Java - Size: 198 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 19 - Forks: 12

HariSekhon/DevOps-Bash-tools
1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LDAP, Code/Build Linting, pkg mgmt for Linux, Mac, Python, Perl, Ruby, NodeJS, Golang, Advanced dotfiles: .bashrc, .vimrc, .gitconfig, .screenrc, tmux..
Language: Shell - Size: 11.2 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 6,769 - Forks: 1,265

awesome-kyuubi/hadoop-testing
Testing Sandbox for Hadoop Ecosystem Components
Language: Jinja - Size: 2.12 MB - Last synced at: 4 days ago - Pushed at: 3 months ago - Stars: 36 - Forks: 16

apache/calcite-site
Apache Calcite Website
Language: HTML - Size: 103 MB - Last synced at: 6 days ago - Pushed at: 25 days ago - Stars: 5 - Forks: 9
