GitHub topics: big-data-processing

Repositories

akardapolov/dimension-db

Hybrid time-series and block-column storage database engine written in Java

Language: Java - Size: 1.15 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

santaclousdon/energy-data-analysis

This is the cloud model analyzing real world dataset with BigQuery and other big-data analyzing tools. I implemented docker image for running this app on cross-platform environments.

Language: Jupyter Notebook - Size: 20.5 MB - Last synced at: 2 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

hope-data-science/R4BD

R for Big Data (Chinese Version)

Language: R - Size: 37 MB - Last synced at: 2 days ago - Pushed at: 21 days ago - Stars: 8 - Forks: 1

LatiefDataVisionary/big-data-for-data-science-college-task

Language: Mermaid - Size: 3.73 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

rociobenitez/happiness-index-data-processing

Repository for Big Data Processing - Contains Jupyter Notebooks and Datasets for data analysis and processing tasks related to Big Data.

Language: Jupyter Notebook - Size: 8.22 MB - Last synced at: 21 days ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

JKA098/Pokemon-Feistiness-Apache-Spark-Job

The following readme file, assume that before running the Spark analytic job, you have already installed the correct versions of **Java**, **Hadoop**, **Spark** and that you are inside **Ubuntu**.

Language: Python - Size: 184 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

AlanShaiju/BDP_Lab_S6

Contains stuff for BDP Lab Semester 6 - KTU (2019 Scheme) - Computer Science(Data Science)

Language: R - Size: 893 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

bilgeswe/BigDataManagement

Building a Data Pipeline with Lakehouse Architecture on Microsoft Azure Platform

Language: TSQL - Size: 2.02 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

This repository contains an Apache Flink application for real-time sales analytics built using Docker Compose to orchestrate the necessary infrastructure components, including Apache Flink, Elasticsearch, and Postgres

Language: Java - Size: 302 KB - Last synced at: 4 months ago - Pushed at: over 1 year ago - Stars: 43 - Forks: 26

chandnii7/Big-Data-Processing-Pipeline

A pipeline that consumes twitter data to extract meaningful insights about a variety of topics using the following technologies: twitter API, Kafka, MongoDB, and Tableau.

Language: Python - Size: 3.21 MB - Last synced at: about 1 month ago - Pushed at: almost 4 years ago - Stars: 3 - Forks: 4

impresso/impresso-text-acquisition

🛠️ Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.

Language: Jupyter Notebook - Size: 1.43 GB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 8 - Forks: 2

souvik-databricks/dlt-with-debug

A lightweight helper utility which allows developers to do interactive pipeline development by having a unified source code for both DLT run and Non-DLT interactive notebook run.

Language: Python - Size: 88.9 KB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 49 - Forks: 8

Mileristovski/DataEngineer-SparkStreaming

Track a Boat est un système de suivi maritime en temps réel utilisant Kafka, Spark Structured Streaming et WebSockets. Il permet de visualiser la position des navires, analyser leurs trajectoires et prévoir leurs destinations sur une carte interactive.

Language: Jupyter Notebook - Size: 9.59 MB - Last synced at: 4 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

tashi-2004/Apache-Hadoop-Spark-Hive-CyberAnalytics

This project utilizes Apache Hadoop, Hive, and PySpark to process and analyze the UNSW-NB15 dataset, enabling advanced query analysis, machine learning modeling, and visualization. The project demonstrates efficient data ingestion, processing, and predictive analytics for network security insights.

Language: Jupyter Notebook - Size: 2.62 MB - Last synced at: 4 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

OuchenOussama/hespressence

Kappa Architecture Based Sentiment Analysis System for User Comments

Language: Python - Size: 10.8 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 1

mohamedsaleh1984/twitter-spark

Fetch data from Twitter and push it through Kafka to Spark then HDFS

Language: Java - Size: 7.82 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

tabletop-labs/tabletop

A curated selection of tools, libraries and services that help tame your dataflow to productively build ambitious, data driven & reactive applications on a streaming lakehouse

Language: Go - Size: 290 KB - Last synced at: 23 days ago - Pushed at: about 2 years ago - Stars: 5 - Forks: 0

sourceduty/Big_Data

📈 Massive volumes of structured and unstructured data generated from various sources.

Size: 21.5 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

LokmanAa/Big-Data-Processing-Cloud

Through this project, I learned the fundamental skills necessary for handling big data, including cloud architecture design, distributed data processing, and the integration of cloud services for data storage and computation.

Language: Jupyter Notebook - Size: 2.48 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Rifat392000/BigDataAnalytics

Language: Jupyter Notebook - Size: 18.4 MB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

Adi3042/Data_Science

📊🚀 Explore the Data Science Universe! Unlock insights and master data skills with hands-on assignments spanning machine learning, visualization, and more. Your journey to becoming a data expert starts here! 🎯💡 DataScienceJourney

Language: Jupyter Notebook - Size: 4.63 MB - Last synced at: 5 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

Turnipdo/Docker-Spark-Setup

Setting up a Spark cluster in a Docker environment for improved repeatability and reliability. This project includes a simple transformation on a dataset containing approximately 31 million rows.

Language: Python - Size: 7.81 KB - Last synced at: 3 days ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

SayamAlt/Steel-Energy-Consumption-Prediction-using-PySpark

Successfully established a machine learning model using PySpark which can precisely predict the energy consumption of the steel industry, up to an r2 score of approximately 99.5%.

Language: Jupyter Notebook - Size: 823 KB - Last synced at: 5 months ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

khanovico/energy-data-analysis

This is the cloud model analyzing real world dataset with BigQuery and other big-data analyzing tools. I implemented docker image for running this app on cross-platform environments.

Language: Jupyter Notebook - Size: 20.5 MB - Last synced at: 4 months ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

Neri-kun/Licenta

Degree diploma project

Language: Jupyter Notebook - Size: 2.45 GB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

whykay-01/movie-recommender-system

Final project for the big data class at NYU where I developed a movie recommendation system using MovieLens database and compared its performance against the popularity based models and other vanilla metrics

Language: Jupyter Notebook - Size: 4.81 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

mtumilowicz/big-data-scala-spark-batch-workshop

Introduction to Spark Batch processing.

Language: Scala - Size: 385 KB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 1

isandratskiy/awesome-bigdata-testing

A list of awesome big data testing frameworks, resources and other awesomeness.

Size: 1000 Bytes - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 1

eskimo-sh/eskimo

Eskimo is a state of the art Big Data Infrastructure and Management Web Console to build, manage and operate Big Data 2.0 Analytics clusters on Kubernetes. This is the git repository of Eskimo Community Edition.

Language: Java - Size: 39.9 MB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 25 - Forks: 7

jamestiotio/dbsys

SUTD 2021 50.043 Database and Big Data Systems Code Dump

Language: Java - Size: 69.7 MB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 1 - Forks: 3

john-fotis/Movie-Recommender

A movie recommender written in Go that suggests movies considering various factors within a particular dataset, encompassing users, movies, and movie ratings.

Language: Go - Size: 1.45 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

Lefteris-Souflas/Redis-MongoDB-Assignment

Analyzing classified ads data from the used motorcycles market. Tasks involve utilizing Redis Bitmaps for analytics on seller actions and MongoDB for analyzing bike listings. Includes data installation, cleaning, and analysis.

Language: R - Size: 797 KB - Last synced at: 5 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

drshahizan/BDM

Course covers big data fundamentals, processes, technologies, platform ecosystem, and management for practical application development.

Language: Jupyter Notebook - Size: 102 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 48 - Forks: 46

Srking501/csc8101_coursework

A summative coursework for CSC8101 Engineering for AI

Language: Jupyter Notebook - Size: 168 KB - Last synced at: 5 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

giucris/yasp

Yet Another SPark Framework

Language: Scala - Size: 228 KB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 10 - Forks: 1

pyajs/veronica

big data processing and machine learning platform，just like useing sql

Language: Python - Size: 6.57 MB - Last synced at: 6 months ago - Pushed at: 10 months ago - Stars: 9 - Forks: 0

franck-mahieu/datasets-toolbox

datasets-toolbox are some scripts usefull to generate, transfom and valid large dataset files, not openable with editor because too large. datasets-toolbox provide also a ping script.

Language: JavaScript - Size: 24.4 KB - Last synced at: over 1 year ago - Pushed at: almost 4 years ago - Stars: 0 - Forks: 0

adnanrahin/NFL-Big-Data-Bowl-2022

The 2022 Big Data Bowl data contains Next Gen Stats player tracking, play, game, player, and PFF scouting data for all 2018-2020 Special Teams play. Here, you'll find a summary of each data set in the 2022 Data Bowl, a list of key variables to join on, and a description of each variable.

Language: Scala - Size: 1.02 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

IncredibleProgress/sweetheart.py 📦

rock-solid pillars for enterprise-grade solutions

Language: Python - Size: 3.46 MB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

software-competence-center-hagenberg/AVUBDI

Github Repository for a versatile usable Big Data infrastructure (AVUBDI)

Language: Shell - Size: 364 KB - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 3 - Forks: 1

JamesHanZhang/table-data-format-transform-app

excel, markdown, csv, sql 数据源批量/单独格式互相转换

Language: Python - Size: 321 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

VincianeDesbois/Hopitaux_Production

Study of French hospital production. (2021)

Language: Jupyter Notebook - Size: 22.3 MB - Last synced at: 2 months ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 1

vishu-tyagi/BigQuery-ELT

BigQuery data pipeline with dbt, Spark, Docker, Airflow, Terraform, GCP

Language: Python - Size: 1.19 MB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

alessiococchieri/BDA-project-sparkify

This Git repo showcases my analysis of Sparkify dataset with PySpark on Apache Spark cluster mode and JupyterLab on Docker. The goal was to identify at-risk customers and develop retention strategies. The analysis tested multiple machine learning models and uncovered insights into customer behavior and churn patterns.

Language: Jupyter Notebook - Size: 4.11 MB - Last synced at: 8 months ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 1

pranjalihande/Ethereum-Analysis

Analysis of Ethereum Transactions and Smart Contracts

Size: 0 Bytes - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

zaid-24/Crack-Detection-using-CNN

Crack Detection model using yolov7

Language: Jupyter Notebook - Size: 396 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

anjijava16/GCP_Data_Enginner_Utils

GCP_Data_Enginner

Language: Shell - Size: 1.21 MB - Last synced at: almost 2 years ago - Pushed at: almost 4 years ago - Stars: 7 - Forks: 0

AbhimanyuW/BigData-EthereumAnalysis

A coursework on Ethereum Analysis using PySpark, as a part of curriculum at Queen Mary University of London.

Language: Jupyter Notebook - Size: 493 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

vvittis/FlinkSampling

Reservoir Sampling for Group-By Queries in Flink Platform. Answering effectively Single Aggregate.

Language: Java - Size: 69.3 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 4 - Forks: 1

RghdE/CapstoneTwo_EducationalLandscape

Big Data and AI Engineering bootcamp 2nd capstone project. Using Big Data Tools to predict the probability of university enrollment for Egypt's High School students. :school: :books: :microscope:

Language: Jupyter Notebook - Size: 20.7 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

DuarteDomingues/Large-Scale-Data-Computation-Word-Count-project

Large-scale data computing word count project

Language: Java - Size: 474 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

kochlisGit/Big-Data-Algorithms

Implementation of algorithms for big data using python, numpy, pandas.

Language: Python - Size: 28.8 MB - Last synced at: over 2 years ago - Pushed at: over 5 years ago - Stars: 2 - Forks: 0

bdnf/BigData-Engineering-Projects

Data modeling with Cassandra, building Data Warehouse using Redshift and creation of Data Lake using Spark and Airflow

Language: Jupyter Notebook - Size: 2.68 MB - Last synced at: over 2 years ago - Pushed at: over 5 years ago - Stars: 2 - Forks: 0

ScratchyCode/Computer-Vision

Software basati su metodi di intelligenza artificiale per l'automazione dell'analisi di big data.

Language: Python - Size: 103 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

theGuyWithBlackTie/electricChargingStations

Language: Jupyter Notebook - Size: 8.68 MB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 4 - Forks: 0

ChristopherLiew/chris-liew-technical-blog

Tech blog / notes from my various endeavours and exploits

Size: 3.13 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

jpmorgen/BigMultiPipe

"Provides tools for parallel pipeline processing of large data structures

Language: Python - Size: 4.94 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

Faisal-AlDhuwayhi/Immigration-Data-Engineering

A Capstone Project that covers several aspects of Data Engineering (Data Exploration, Cleaning, Modeling, Pipelining, Processing)

Language: Jupyter Notebook - Size: 2.19 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

Faisal-AlDhuwayhi/Data-Lake

Building Data Lake and ETL pipelines using Amazon EMR, S3, and Apache Spark

Language: Python - Size: 2.93 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

VladOnMyOwn/ctr-poisson-bootstrap

Here I demonstrate the performance difference between the Poisson and the classic bootstrap by estimating the confidence interval for the difference of CTRs of the two user groups

Language: Jupyter Notebook - Size: 6.66 MB - Last synced at: over 2 years ago - Pushed at: almost 3 years ago - Stars: 1 - Forks: 0

felipefrizzo/terraform-aws-kinesis-firehose

This code creates a Kinesis Firehose in AWS to send CloudWatch log data to S3.

Language: HCL - Size: 20.5 KB - Last synced at: over 2 years ago - Pushed at: almost 4 years ago - Stars: 25 - Forks: 22

leonardoGemin/BigDataComputing_UniPd

Collection of homework (mostly Spark-based) from the course "Big Data Computing" - University of Padua.

Language: Java - Size: 5.66 MB - Last synced at: about 2 months ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

louiecai/Sentiment-Analysis-API

Sentiment-Analysis-API

Language: Python - Size: 26.3 MB - Last synced at: over 2 years ago - Pushed at: about 3 years ago - Stars: 1 - Forks: 0

Anirban166/Big-Data-ft.-Genomics

Analysis, organization and querying of large genomic datasets using C++, Monsoon and various data structures.

Language: C++ - Size: 25.5 MB - Last synced at: 4 months ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 0

StarPlatinumStudio/Flink-SQL-Practice

Flink SQL 实战 -中文博客专栏

Language: Java - Size: 80.9 MB - Last synced at: over 2 years ago - Pushed at: about 3 years ago - Stars: 18 - Forks: 5

matthewdowns/the-biggest-data

Experiment to record as much data as possible in a given amount of time using a distributed timeseries database.

Language: Shell - Size: 12.7 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

almersesunan/Portofolio

Welcome, feel free to navigate through my project. Detail information about each project can be found inside specified directory.

Language: Jupyter Notebook - Size: 32.8 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

superminority/jsv

A compact way to represent a stream of similar json objects

Language: Python - Size: 298 KB - Last synced at: 5 days ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

Enkhai/CURE-spark

CURE clustering algorithm implementation in Scala with Spark

Language: Scala - Size: 233 MB - Last synced at: over 2 years ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 0

SCCH-KVS/AVUBDI

Github Repository for a versatile usable Big Data infrastructure (AVUBDI)

Language: Shell - Size: 385 KB - Last synced at: over 2 years ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 0

siddharths067/Easy-Airflow-Deployment

A Docker Compose Template to deploy Airflow with sync from a remote repository

Language: Shell - Size: 1000 Bytes - Last synced at: over 2 years ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

lucamoroz/BigDataComputing-UniPD

Collection of homework (mostly Spark-based) from the course "Big Data Computing" - University of Padua.

Language: Java - Size: 10 MB - Last synced at: 8 days ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 0

levindoneto/pandas-simple-csv-parser

Simple CSV parser for huge volumes of data with the use of the library Pandas for Python for getting specific columns of a CSV file and putting the extracted data into one or more files (each column in a separated file or all of them in the same output) in a short amount of time.

Language: Python - Size: 3.11 MB - Last synced at: over 2 years ago - Pushed at: over 6 years ago - Stars: 1 - Forks: 0

Bennyhwanggggg/Basic-Scala-Computations

Using Scala for big data computations for basic tasks

Language: Scala - Size: 22.6 MB - Last synced at: over 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 0

Bennyhwanggggg/Basic-Hadoop-MapReduce

Standard Hadoop MapReduce Tasks using Java

Language: Java - Size: 22.6 MB - Last synced at: over 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 0

Related Keywords

big-data-processing 75 big-data 32 big-data-analytics 21 spark 19 python 15 pyspark 9 sql 9 machine-learning 8 kafka 7 data-visualization 6 data-science 6 scala 6 java 6 python3 5 docker 5 apache-spark 5 data-engineering 5 etl-pipeline 4 etl 4 bigquery 4 batch-processing 3 clustering 3 hadoop-mapreduce 3 data-analytics 3 data-lake 3 data 3 hadoop 3 apache-flink 3 big-data-platform 3 seaborn 3 json 3 data-analysis 3 jupyter-notebook 3 sentiment-analysis 2 terraform 2 elt 2 r 2 zeppelin 2 docker-swarm 2 etl-framework 2 template-project 2 database 2 stream-processing 2 docker-compose 2 mlflow 2 scikit-learn 2 airflow 2 big-data-projects 2 cloud-computing 2 xgboost 2 exploratory-data-analysis 2 csv 2 pandas 2 distributed-computing 2 hdfs 2 ubuntu 2 spark-sql 2 deep-learning 2 data-processing 2 computer-vision 2 parquet-files 2 nlp 2 bigdata 2 pyspark-notebook 2 data-warehouse 2 ai 2 dataflow 1 dataproc 1 a-priori 1 spark-scala 1 rethinkdb 1 big-data-visualization 1 apache-hadoop-framework 1 apache-pig 1 topic 1 data-computing 1 streaming-tuples 1 streaming-data 1 stratum 1 sampling 1 reservoir-sampling 1 group-by 1 word-count 1 ethereum-analysis 1 shell-script 1 pubsub 1 notebook 1 gcp-storage 1 gcp 1 dataproc-cluster 1 py-script 1 nginx-unit 1 jupyter 1 rdd 1 transform-data 1 toolbox 1 ping-launch 1 ping 1 jsonlines 1 jsonl 1