GitHub topics: apache-spark

Repositories

kubeflow/spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

Language: Go - Size: 25.5 MB - Last synced at: about 2 hours ago - Pushed at: about 3 hours ago - Stars: 2,952 - Forks: 1,411

streamnative/pulsar-spark

Spark Connector to read and write with Pulsar

Language: Scala - Size: 726 KB - Last synced at: about 2 hours ago - Pushed at: about 3 hours ago - Stars: 113 - Forks: 51

ac-gomes/spark-iceberg-hive

Language: Jupyter Notebook - Size: 1010 KB - Last synced at: about 12 hours ago - Pushed at: about 13 hours ago - Stars: 0 - Forks: 0

teragrep/pth_10

Data Processing Language (DPL) translator for Apache Spark

Language: Java - Size: 1.17 MB - Last synced at: about 17 hours ago - Pushed at: about 18 hours ago - Stars: 1 - Forks: 9

miroslav-reiter/Big_Data_Apache_Spark_Hive_Hadoop_Airflow

🗄️ Materiály k online kurzom a školeniam Big Data (Veľké Dáta), Apache Spark, Hive, Apache Hadoop, Apache Airflow

Language: Jupyter Notebook - Size: 189 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

This repository covers data management and big data technologies, including databases, querying, and big data processing. Topics include Hadoop (MapReduce, HDFS), Apache Spark, data security, and optimization techniques. Students will learn Spark’s architecture, data distribution, parallel computing, and memory caching to enhance big data solutions

Language: Jupyter Notebook - Size: 65.2 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

databricks/LearningSparkV2

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

Language: Scala - Size: 75.2 MB - Last synced at: 3 days ago - Pushed at: 5 months ago - Stars: 1,303 - Forks: 773

chernistry/tabularasa-bi-core

A PoC for BI DE challenges, feat. Java, Apache Spark, Kafka, Spring Boot, PostgreSQL, and Docker for AdTech data processing.

Language: Java - Size: 148 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

LucaCanali/Miscellaneous

Includes notes on using Apache Spark, with drill down on Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark. Also tools for stress testing and measuring CPUs's performance. Jupyter notebooks examples for using various DB systems.

Language: Jupyter Notebook - Size: 34.3 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 449 - Forks: 152

dsgrid/dsgrid

Python package for working with demand-side grid projects, datasets and queries

Language: Python - Size: 8.76 MB - Last synced at: 1 day ago - Pushed at: 3 days ago - Stars: 28 - Forks: 5

microsoft/Mobius

C# and F# language binding and extensions to Apache Spark

Language: C# - Size: 6.44 MB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 939 - Forks: 211

AndGeo69/StreamingCotiles

A streaming implementation of COTILES algorithm using Apache Spark's Structured Streaming API

Language: Python - Size: 2.74 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

Anonymous0-0paper/SWG

AutoPipe: LLM Assisted Automatic Stream Processing Pipeline Generation

Language: Python - Size: 105 KB - Last synced at: 4 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

treeverse/lakeFS

lakeFS - Data version control for your data lake | Git for data

Language: Go - Size: 150 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 4,712 - Forks: 379

sparklyr/sparklyr

R interface for Apache Spark

Language: R - Size: 97 MB - Last synced at: 1 day ago - Pushed at: 3 months ago - Stars: 962 - Forks: 308

AbsaOSS/hyperdrive

Extensible streaming ingestion pipeline on top of Apache Spark

Language: Scala - Size: 1.63 MB - Last synced at: about 20 hours ago - Pushed at: about 21 hours ago - Stars: 45 - Forks: 13

mlflow/mlflow

Open source platform for the machine learning lifecycle

Language: Python - Size: 795 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 20,791 - Forks: 4,576

feathr-ai/feathr

Feathr – A scalable, unified data and AI engineering platform for enterprise

Language: Scala - Size: 29.4 MB - Last synced at: 2 days ago - Pushed at: about 1 year ago - Stars: 1,898 - Forks: 232

O2-Czech-Republic/proxima-platform

The Proxima platform.

Language: Java - Size: 9.4 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 21 - Forks: 7

newfront/hitchhikers_guide_to_deltalake_streaming

Don't Panic. This guide will help you when it feels like the end of the world.

Language: Jupyter Notebook - Size: 230 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 23 - Forks: 9

microsoft/SynapseML

Simple and Distributed Machine Learning

Language: Scala - Size: 157 MB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 5,138 - Forks: 847

graphframes/graphframes

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

Language: Scala - Size: 3.95 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1,058 - Forks: 251

geoHeil/awesome-tools

curated list of awesome tools and libraries for specific domains

Size: 958 KB - Last synced at: 4 days ago - Pushed at: 8 days ago - Stars: 47 - Forks: 11

oceanbase/spark-connector-oceanbase

Apache Spark Connectors for OceanBase.

Language: Scala - Size: 283 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 2 - Forks: 4

anaregdesign/openaivec

Pandas extension, Tabular calculation with LLM, Spark UDF Builder

Language: Python - Size: 1.37 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 12 - Forks: 1

srimantapal205/Subject-Wise-Question---Answer

This branch focuses on building Data Engineering Interview Question and Answer

Size: 445 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

Bhargav129/Spark

This repo helps you understand the core components of Apache Spark, starting with a deep dive into the Catalyst Optimizer.

Size: 43.9 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

Azure/azure-event-hubs-spark

Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs

Language: Scala - Size: 19.6 MB - Last synced at: 1 day ago - Pushed at: 4 months ago - Stars: 236 - Forks: 178

techsparksguru/data_ai_for_all

Data Analysis, Analytics, Science, AI & ML, LLM etc.

Language: Jupyter Notebook - Size: 23.3 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 14 - Forks: 3

datatweets/airflow-pyspark-k8s

Run Apache Airflow with KubernetesExecutor and PySpark on Kubernetes using Helm charts and Kind for local development

Language: Python - Size: 283 KB - Last synced at: 4 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 0

exacaster/lighter

REST API for Apache Spark on K8S or YARN

Language: Java - Size: 6.68 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 98 - Forks: 23

LucaCanali/sparkMeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.

Language: Scala - Size: 1.96 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 758 - Forks: 151

astrolabsoftware/fink

Fink documentation website

Size: 41.9 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 3 - Forks: 2

ayushlingayat/Spark-Learning

These are my Spark learning notes a space I revisit often to revise and strengthen my Spark concepts...🐉🀄

Language: Python - Size: 4.88 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

astrolabsoftware/fink-broker

Astronomy Broker based on Apache Spark

Language: Python - Size: 98.4 MB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 70 - Forks: 14

divithraju/divith-aju-Hadoop-Pyspark-pipeline

This project demonstrates the creation of a scalable data processing pipeline for handling and analyzing log data from a hypothetical e-commerce platform. Leveraging Hadoop and PySpark, the pipeline is designed to process large volumes of log files, providing meaningful insights into user behavior, system performance, and sales metrics.

Language: Python - Size: 4.88 KB - Last synced at: 13 days ago - Pushed at: 10 months ago - Stars: 2 - Forks: 0

dataflint/spark

A modern replacement Apache Spark UI

Language: TypeScript - Size: 17.8 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 254 - Forks: 30

miguno/kafka-storm-starter 📦

[PROJECT IS NO LONGER MAINTAINED] Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.

Language: Scala - Size: 393 KB - Last synced at: 6 days ago - Pushed at: about 3 years ago - Stars: 724 - Forks: 329

GoogleCloudPlatform/dataproc-templates

Dataproc templates and pipelines for solving in-cloud data tasks

Language: Python - Size: 18.6 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 129 - Forks: 99

aws-samples/iceberg-streaming-examples

This repo contains examples of high throughput ingestion using Apache Spark and Apache Iceberg. These examples cover IoT and CDC scenarios using best practices. The code can be deployed into any Spark compatible engine like Amazon EMR Serverless or AWS Glue. A fully local developer environment is also provided.

Language: Java - Size: 443 KB - Last synced at: 12 days ago - Pushed at: 7 months ago - Stars: 26 - Forks: 5

OKDP/okdp-spark-auth-filter

Oauth2/OIDC Authentication filter for Apache Spark Apps/History UIs

Language: Java - Size: 879 KB - Last synced at: 14 days ago - Pushed at: 15 days ago - Stars: 8 - Forks: 8

AdityaSreevatsaK/PySpark-Pipeline

A collection of PySpark projects showcasing scalable data processing, transformation pipelines, and big data analytics using Apache Spark.

Language: Jupyter Notebook - Size: 1.78 MB - Last synced at: 15 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

dotnet/spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

Language: C# - Size: 4.87 MB - Last synced at: 15 days ago - Pushed at: about 1 month ago - Stars: 2,058 - Forks: 327

data-tools/big-data-types

A library to transform Scala product types and Schemes from different systems into other Schemes. Any implemented type automatically gets methods to convert it into the rest of the types and vice versa. E.g: a Spark Schema can be transformed into a BigQuery table.

Language: Scala - Size: 3.74 MB - Last synced at: 8 days ago - Pushed at: 17 days ago - Stars: 13 - Forks: 3

opencypher/morpheus

Morpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.

Language: Scala - Size: 29.6 MB - Last synced at: 5 days ago - Pushed at: almost 5 years ago - Stars: 341 - Forks: 62

benedekh/bigdata-projects

Student projects in Big Data field.

Language: Java - Size: 198 KB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 19 - Forks: 12

lw-lin/CoolplaySpark

酷玩 Spark: Spark 源代码解析、Spark 类库等

Language: Scala - Size: 9.54 MB - Last synced at: 19 days ago - Pushed at: about 3 years ago - Stars: 3,488 - Forks: 1,410

mrpowers-io/quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

Language: Python - Size: 1.98 MB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 672 - Forks: 98

tansu-io/example-spark

Tansu schema-backed topics, instantly accessible as Apache Iceberg tables in Apache Spark

Language: Just - Size: 13.7 KB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 0 - Forks: 0

priyanshubiswas-tech/priyanshubiswas-tech

Data Engineer | Python, SQL, Spark, Hadoop, Airflow, DBT, AWS | Building pipelines, solving data problems, and sharing projects.

Size: 3.86 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

helioribeiro/helioribeiro.github.io

REPOSITORY FOR MY SOFTWARE DEVELOPMENT AND DATA SCIENCE PORTFOLIO.

Language: CSS - Size: 62.9 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

Salmon-Brain/dead-salmon-brain

Apache Spark based framework for analysis A/B experiments

Language: Java - Size: 407 KB - Last synced at: 19 days ago - Pushed at: 8 months ago - Stars: 15 - Forks: 0

guidok91/spark-structured-streaming-kafka

Spark Structured Streaming data pipeline that processes movie ratings data in real-time.

Language: Python - Size: 192 KB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 13 - Forks: 4

spark-notebook/spark-notebook

Interactive and Reactive Data Science using Scala and Spark.

Language: JavaScript - Size: 15.8 MB - Last synced at: 19 days ago - Pushed at: about 2 years ago - Stars: 3,152 - Forks: 653

CodelyTV/spark-best_practices_and_deploy-course

Deploy Spark course examples

Language: Scala - Size: 82.8 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 5 - Forks: 1

G-Research/fasttrackml

Experiment tracking server focused on speed and scalability

Language: Go - Size: 5.4 MB - Last synced at: 9 days ago - Pushed at: 5 months ago - Stars: 105 - Forks: 20

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.

Language: Jupyter Notebook - Size: 35.1 MB - Last synced at: 25 days ago - Pushed at: almost 3 years ago - Stars: 1,436 - Forks: 553

san089/goodreads_etl_pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

Language: Python - Size: 1.31 MB - Last synced at: 25 days ago - Pushed at: over 5 years ago - Stars: 1,378 - Forks: 227

JuanParias29/BigDataProcessingProject

Este repositorio contiene un proyecto de análisis y procesamiento de datos a gran escala basado en la metodología CRISP-DM, enfocado en resolver preguntas de negocio dentro del ámbito educativo.

Language: Jupyter Notebook - Size: 4.3 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

Language: Python - Size: 23.9 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 304 - Forks: 63

rizkipragustono/data_analysis_spark

Exploration: Data Analysis using Spark

Language: Jupyter Notebook - Size: 8.79 KB - Last synced at: 18 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

lykmapipo/Python-Joblib-Cookbook

A step-by-step guide to master various aspects of Joblib for parallel computing in Python

Language: Python - Size: 44.9 KB - Last synced at: 16 days ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 1

lykmapipo/Scala-Spark-Product-Sales-Analysis

Scala application to process, and analyze product sales using Spark

Language: Scala - Size: 125 KB - Last synced at: 16 days ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 1

lykmapipo/Python-Spark-Log-Analysis

Python scripts to process, and analyze log files using PySpark.

Language: Python - Size: 131 KB - Last synced at: 16 days ago - Pushed at: 11 months ago - Stars: 6 - Forks: 0

lykmapipo/NYC-TLC-Trip-Data

Python scripts to download, process, and analyze the New York City Taxi and Limousine Commission (TLC) Trip Record Data dataset

Language: Jupyter Notebook - Size: 100 MB - Last synced at: 16 days ago - Pushed at: 10 months ago - Stars: 5 - Forks: 1

airscholar/e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

Language: Python - Size: 289 KB - Last synced at: 25 days ago - Pushed at: 4 months ago - Stars: 250 - Forks: 123

openscoring/openscoring

REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models

Language: Java - Size: 869 KB - Last synced at: 25 days ago - Pushed at: 10 months ago - Stars: 583 - Forks: 171

Hydrospheredata/mist

Serverless proxy for Spark cluster

Language: Scala - Size: 9.96 MB - Last synced at: 23 days ago - Pushed at: over 4 years ago - Stars: 326 - Forks: 72

Devinterview-io/apache-spark-interview-questions

🟣 Apache Spark interview questions and answers to help you prepare for your next machine learning and data science interview in 2025.

Size: 31.3 KB - Last synced at: 28 days ago - Pushed at: 29 days ago - Stars: 9 - Forks: 8

mohitsarawgi/Leave-Authorization-system

Facilitates online submission of diverse leave request types and routes them to appropriate authorities for approval. • Monitors leave balances accurately to avoid overstepping allocated time off limits, fostering a smooth workflow and improved employee contentment.

Language: JavaScript - Size: 13.2 MB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 0 - Forks: 1

tirthajyoti/Spark-with-Python

Fundamentals of Spark with Python (using PySpark), code examples

Language: Jupyter Notebook - Size: 8.97 MB - Last synced at: 23 days ago - Pushed at: over 2 years ago - Stars: 347 - Forks: 271

vesko-vujovic/vesko-vujovic.github.io

Personal blog about Data Engineering

Language: CSS - Size: 10.9 MB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 0 - Forks: 0

infoslack/awesome-kafka

A list about Apache Kafka

Size: 96.7 KB - Last synced at: 19 days ago - Pushed at: 3 months ago - Stars: 582 - Forks: 164

aloneguid/parquet-dotnet

Fully managed Apache Parquet implementation

Language: C# - Size: 121 MB - Last synced at: 27 days ago - Pushed at: 4 months ago - Stars: 732 - Forks: 160

sbl-sdsc/mmtf-pyspark

Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.

Language: Python - Size: 524 MB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 67 - Forks: 27

awslabs/amazon-emr-vscode-toolkit

A VS Code Extension to make it easier to manage and develop Spark jobs on EMR

Language: TypeScript - Size: 907 KB - Last synced at: 28 days ago - Pushed at: 4 months ago - Stars: 37 - Forks: 5

Ditectrev/Amazon-Web-Services-Certified-AWS-Certified-Machine-Learning-MLS-C01-Practice-Tests-Exams-Question

⛳️ PASS: Amazon Web Services Certified (AWS Certified) Machine Learning Specialty (MLS-C01) by learning based on our Questions & Answers (Q&A) Practice Tests Exams.

Size: 6.88 MB - Last synced at: 28 days ago - Pushed at: 8 months ago - Stars: 63 - Forks: 38

intel/BigDL

BigDL: Distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray

Language: Jupyter Notebook - Size: 356 MB - Last synced at: 27 days ago - Pushed at: 3 months ago - Stars: 2,674 - Forks: 731

mahmoudparsian/data-algorithms-book

MapReduce, Spark, Java, and Scala for Data Algorithms Book

Language: Java - Size: 397 MB - Last synced at: 26 days ago - Pushed at: 8 months ago - Stars: 1,075 - Forks: 661

mahmoudparsian/big-data-mapreduce-course

Big Data Modeling, MapReduce, Spark, PySpark @ Santa Clara University

Language: HTML - Size: 601 MB - Last synced at: 17 days ago - Pushed at: 7 months ago - Stars: 158 - Forks: 143

1duo/awesome-ai-infrastructures

Infrastructures™ for Machine Learning Training/Inference in Production.

Size: 11.8 MB - Last synced at: 29 days ago - Pushed at: about 6 years ago - Stars: 416 - Forks: 74

tweag/sparkle

Haskell on Apache Spark.

Language: Haskell - Size: 1.1 MB - Last synced at: 25 days ago - Pushed at: over 2 years ago - Stars: 448 - Forks: 27

big-data-europe/docker-spark

Apache Spark docker image

Language: Shell - Size: 7.78 MB - Last synced at: 26 days ago - Pushed at: about 2 years ago - Stars: 2,055 - Forks: 702

awslabs/amazon-emr-cli

A command-line interface for packaging, deploying, and running your EMR Serverless Spark jobs

Language: Python - Size: 150 KB - Last synced at: 28 days ago - Pushed at: about 1 year ago - Stars: 42 - Forks: 14

archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Language: Scala - Size: 39.5 MB - Last synced at: 5 days ago - Pushed at: over 1 year ago - Stars: 144 - Forks: 33

japila-books/spark-structured-streaming-internals

The Internals of Spark Structured Streaming

Size: 119 MB - Last synced at: 23 days ago - Pushed at: over 2 years ago - Stars: 420 - Forks: 172

mattjw/sparkql

sparkql: Apache Spark SQL DataFrame schema management for sensible humans

Language: Python - Size: 4.59 MB - Last synced at: 21 days ago - Pushed at: over 1 year ago - Stars: 13 - Forks: 4

anqorithm/RealTime-StockStream

RealTime StockStream is a streamlined, simulation system for processing live stock market data. It uses Apache Kafka for data input, Apache Spark for data handling, and Apache Cassandra for data storage, making it a powerful yet easy-to-use tool for financial data analysis

Language: Python - Size: 5.36 MB - Last synced at: 5 days ago - Pushed at: 4 months ago - Stars: 26 - Forks: 3

kwartile/connected-component

Map Reduce Implementation of Connected Component on Apache Spark

Language: Scala - Size: 26.4 KB - Last synced at: 2 days ago - Pushed at: over 3 years ago - Stars: 85 - Forks: 18

hendhamdi/Sentiment-Analysis-Spark-NLP

This project uses a Spark pipeline (PySpark) to analyze the sentiment of user reviews.

Language: HTML - Size: 433 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

jlsilva01/projeto-ed-satc

Repositorio modelo para desenvolvimento do projeto final da disciplina de Engenharia de Dados do curso de Engenharia de Software da UNISATC.

Size: 662 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 3

umbertogriffo/apache-spark-best-practices-and-tuning

https://umbertogriffo.gitbook.io/apache-spark-best-practices-and-tuning/

Size: 1.78 MB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 5 - Forks: 2

srafay/Hadoop-hands-on

Learning how to tame the Big Data with Hadoop and related technologies

Language: PigLatin - Size: 96.7 KB - Last synced at: 6 days ago - Pushed at: over 5 years ago - Stars: 23 - Forks: 21

Peippo1/marketing-analytics-pipeline

A scalable marketing analytics pipeline built with Apache Spark and Delta Lake, designed to process, transform, and export data for advanced business insights.

Language: Python - Size: 404 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

lynnlangit/learning-hadoop-and-spark

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning

Language: HTML - Size: 13.6 MB - Last synced at: 7 days ago - Pushed at: 6 months ago - Stars: 195 - Forks: 165

ArshTiwari2004/Sahyog

Centralized Disaster Response and Inventory Management System that leverages AI and Google Cloud Technologies to predict disasters, optimize resource management, and provide real-time coordination.

Language: JavaScript - Size: 14.2 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 3

cerndb/dist-keras 📦

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

Language: Python - Size: 54.6 MB - Last synced at: 29 days ago - Pushed at: almost 7 years ago - Stars: 623 - Forks: 167

vishnu812-tech/Data-Engineering-Essentials

Git hub profile for learning new languages and developing projects

Language: Python - Size: 6.84 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

alimiheb/ChicagoEnergyUsageAnalysis

The Chicago Energy Usage Analysis project aims to explore energy consumption patterns in Chicago using big data techniques. Leveraging Apache Spark, it processes a dataset of approximately 30,000 records to provide actionable insights for urban planning and energy efficiency initiatives.

Language: Java - Size: 8.68 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

msamij/zig-flow

Data Engineering pipeline.

Language: Java - Size: 555 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Related Keywords

apache-spark 1,693 spark 480 python 343 pyspark 323 scala 249 big-data 231 machine-learning 198 apache-kafka 152 docker 150 data-engineering 122 spark-sql 121 java 114 data-science 100 hadoop 100 sql 91 spark-streaming 88 python3 85 bigdata 85 apache-airflow 79 jupyter-notebook 78 kafka 77 aws 67 etl 63 databricks 59 docker-compose 59 apache 59 data-analysis 58 etl-pipeline 52 hdfs 48 aws-s3 48 cassandra 44 big-data-analytics 42 postgresql 40 mongodb 38 distributed-computing 38 mapreduce 38 kubernetes 38 pandas 37 data-pipeline 36 airflow 33 streaming 32 structured-streaming 31 deep-learning 31 elasticsearch 30 apache-hadoop 28 spark-mllib 27 azure 27 data-visualization 27 delta-lake 26 hadoop-mapreduce 25 sbt 24 spring-boot 23 r 23 stream-processing 23 sparksql 23 mysql 22 mllib 22 data 22 apache-flink 22 rdd 21 apache-iceberg 21 hive 21 sentiment-analysis 20 dataframe 20 jupyter 20 data-processing 20 spark-ml 19 data-lake 18 database 18 aws-glue 18 flask 18 java8 17 dataengineering 17 databricks-notebooks 17 api 17 machine-learning-algorithms 17 parquet 16 pipeline 16 kafka-streams 16 kmeans-clustering 16 powerbi 16 recommender-system 15 analytics 15 collaborative-filtering 15 manning 15 classification 15 sparkwithjava 15 twitter 15 data-analytics 15 emr 15 spark-structured-streaming 15 logistic-regression 15 linear-regression 14 azure-databricks 14 zookeeper 14 s3 14 hadoop-hdfs 14 apache-cassandra 14 tableau 14 docker-image 14