GitHub topics: apache-hadoop-framework

Repositories

divithraju/divith-aju-Hadoop-Pyspark-pipeline

This project demonstrates the creation of a scalable data processing pipeline for handling and analyzing log data from a hypothetical e-commerce platform. Leveraging Hadoop and PySpark, the pipeline is designed to process large volumes of log files, providing meaningful insights into user behavior, system performance, and sales metrics.

Language: Python - Size: 4.88 KB - Last synced at: about 2 months ago - Pushed at: 11 months ago - Stars: 2 - Forks: 0

DuarteDomingues/Large-Scale-Data-Computation-Word-Count-project

Large-scale data computing word count project

Language: Java - Size: 474 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

bayudwiyansatria/library-java-apache-hadoop

Apache Hadoop. Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Originally designed for computer clusters built from commodity hardware—still the common use—it has also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Language: Java - Size: 352 KB - Last synced at: 1 day ago - Pushed at: almost 4 years ago - Stars: 0 - Forks: 1