apache-spark | Topic | Ecosyste.ms: Repos

Topic: "apache-spark"

mlflow/mlflow

The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

Language: Python - Size: 870 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 22,707 - Forks: 4,933

microsoft/SynapseML

Simple and Distributed Machine Learning

Language: Scala - Size: 166 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 5,170 - Forks: 851

treeverse/lakeFS

lakeFS - Data version control for your data lake | Git for data

Language: Go - Size: 160 MB - Last synced at: about 5 hours ago - Pushed at: about 4 hours ago - Stars: 4,954 - Forks: 402

lw-lin/CoolplaySpark

酷玩 Spark: Spark 源代码解析、Spark 类库等

Language: Scala - Size: 9.54 MB - Last synced at: 5 months ago - Pushed at: over 3 years ago - Stars: 3,488 - Forks: 1,410

spark-notebook/spark-notebook

Interactive and Reactive Data Science using Scala and Spark.

Language: JavaScript - Size: 15.8 MB - Last synced at: 2 days ago - Pushed at: over 2 years ago - Stars: 3,153 - Forks: 653

kubeflow/spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

Language: Go - Size: 25.8 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 3,051 - Forks: 1,434

intel/BigDL

BigDL: Distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray

Language: Jupyter Notebook - Size: 356 MB - Last synced at: 7 days ago - Pushed at: 19 days ago - Stars: 2,687 - Forks: 732

dotnet/spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

Language: C# - Size: 4.88 MB - Last synced at: 16 days ago - Pushed at: about 1 month ago - Stars: 2,080 - Forks: 328

big-data-europe/docker-spark

Apache Spark docker image

Language: Shell - Size: 7.78 MB - Last synced at: 5 months ago - Pushed at: over 2 years ago - Stars: 2,055 - Forks: 702

feathr-ai/feathr

Feathr – A scalable, unified data and AI engineering platform for enterprise

Language: Scala - Size: 29.4 MB - Last synced at: 16 days ago - Pushed at: over 1 year ago - Stars: 1,908 - Forks: 235

awesome-spark/awesome-spark

A curated list of awesome Apache Spark packages and resources.

Language: Shell - Size: 231 KB - Last synced at: 17 days ago - Pushed at: about 1 year ago - Stars: 1,833 - Forks: 340

OryxProject/oryx 📦

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Language: Java - Size: 7.12 MB - Last synced at: about 1 month ago - Pushed at: about 4 years ago - Stars: 1,786 - Forks: 404

japila-books/apache-spark-internals

The Internals of Apache Spark

Size: 147 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1,505 - Forks: 458

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.

Language: Jupyter Notebook - Size: 35.1 MB - Last synced at: 5 months ago - Pushed at: over 3 years ago - Stars: 1,436 - Forks: 553

san089/goodreads_etl_pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

Language: Python - Size: 1.31 MB - Last synced at: 5 months ago - Pushed at: over 5 years ago - Stars: 1,378 - Forks: 227

databricks/LearningSparkV2

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

Language: Scala - Size: 75.2 MB - Last synced at: 12 days ago - Pushed at: 9 months ago - Stars: 1,349 - Forks: 783

lensacom/sparkit-learn

PySpark + Scikit-learn = Sparkit-learn

Language: Python - Size: 444 KB - Last synced at: 6 days ago - Pushed at: almost 5 years ago - Stars: 1,154 - Forks: 256

graphframes/graphframes

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

Language: Scala - Size: 5.4 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,086 - Forks: 254

databricks/spark-sklearn 📦

(Deprecated) Scikit-learn integration package for Apache Spark

Language: Python - Size: 782 KB - Last synced at: 12 days ago - Pushed at: almost 6 years ago - Stars: 1,077 - Forks: 228

mahmoudparsian/data-algorithms-book

MapReduce, Spark, Java, and Scala for Data Algorithms Book

Language: Java - Size: 397 MB - Last synced at: 5 months ago - Pushed at: about 1 year ago - Stars: 1,075 - Forks: 661

sparklyr/sparklyr

R interface for Apache Spark

Language: R - Size: 99 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 966 - Forks: 309

microsoft/Mobius

C# and F# language binding and extensions to Apache Spark

Language: C# - Size: 6.44 MB - Last synced at: 7 days ago - Pushed at: almost 2 years ago - Stars: 940 - Forks: 208

aloneguid/parquet-dotnet

Fully managed Apache Parquet implementation

Language: C# - Size: 122 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 796 - Forks: 172

LucaCanali/sparkMeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.

Language: Scala - Size: 1.92 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 788 - Forks: 157

lw-lin/streaming-readings

Streaming System 相关的论文读物

Size: 6.84 KB - Last synced at: 5 months ago - Pushed at: over 3 years ago - Stars: 733 - Forks: 154

miguno/kafka-storm-starter 📦

[PROJECT IS NO LONGER MAINTAINED] Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.

Language: Scala - Size: 393 KB - Last synced at: 23 days ago - Pushed at: over 3 years ago - Stars: 724 - Forks: 329

mrpowers-io/quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

Language: Python - Size: 1.98 MB - Last synced at: 14 days ago - Pushed at: 8 months ago - Stars: 674 - Forks: 99

nchammas/flintrock

A command-line tool for launching Apache Spark clusters.

Language: Python - Size: 785 KB - Last synced at: 19 days ago - Pushed at: 11 months ago - Stars: 648 - Forks: 117

cerndb/dist-keras 📦

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

Language: Python - Size: 54.6 MB - Last synced at: about 1 month ago - Pushed at: over 7 years ago - Stars: 623 - Forks: 167

apache-spark-on-k8s/spark Fork of apache/spark 📦

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/

Language: Scala - Size: 260 MB - Last synced at: about 1 month ago - Pushed at: almost 6 years ago - Stars: 613 - Forks: 117

openscoring/openscoring

REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models

Language: Java - Size: 869 KB - Last synced at: 4 months ago - Pushed at: about 1 year ago - Stars: 583 - Forks: 172

infoslack/awesome-kafka

A list about Apache Kafka

Size: 96.7 KB - Last synced at: 11 days ago - Pushed at: 8 months ago - Stars: 582 - Forks: 166

japila-books/spark-sql-internals

The Internals of Spark SQL

Size: 1.57 GB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 477 - Forks: 136

rjurney/Agile_Data_Code_2

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

Language: Jupyter Notebook - Size: 23.2 MB - Last synced at: 20 days ago - Pushed at: over 1 year ago - Stars: 461 - Forks: 310

LucaCanali/Miscellaneous

Includes notes on using Apache Spark, with drill down on Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark. Also tools for stress testing, measuring CPUs' performance, and I/O latency heat maps. Jupyter notebooks examples for using various DB systems.

Language: Jupyter Notebook - Size: 35.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 455 - Forks: 154

cartershanklin/pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

Language: Python - Size: 16.8 MB - Last synced at: 7 months ago - Pushed at: about 1 year ago - Stars: 451 - Forks: 220

tweag/sparkle

Haskell on Apache Spark.

Language: Haskell - Size: 1.11 MB - Last synced at: 27 days ago - Pushed at: 3 months ago - Stars: 448 - Forks: 27

1duo/awesome-ai-infrastructures

Infrastructures™ for Machine Learning Training/Inference in Production.

Size: 11.8 MB - Last synced at: 17 days ago - Pushed at: over 6 years ago - Stars: 427 - Forks: 75

japila-books/spark-structured-streaming-internals

The Internals of Spark Structured Streaming

Size: 119 MB - Last synced at: 5 months ago - Pushed at: almost 3 years ago - Stars: 420 - Forks: 172

ekampf/PySpark-Boilerplate

A boilerplate for writing PySpark Jobs

Language: Python - Size: 10.7 KB - Last synced at: 5 months ago - Pushed at: almost 2 years ago - Stars: 396 - Forks: 155

awesome-spark/spark-gotchas 📦

Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks

Size: 188 KB - Last synced at: 7 days ago - Pushed at: over 8 years ago - Stars: 364 - Forks: 79

tirthajyoti/Spark-with-Python

Fundamentals of Spark with Python (using PySpark), code examples

Language: Jupyter Notebook - Size: 8.97 MB - Last synced at: 5 months ago - Pushed at: about 3 years ago - Stars: 347 - Forks: 271

datamechanics/delight 📦

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

Language: Scala - Size: 2.31 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 346 - Forks: 55

opencypher/morpheus

Morpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.

Language: Scala - Size: 29.6 MB - Last synced at: 26 days ago - Pushed at: about 5 years ago - Stars: 341 - Forks: 62

dmmiller612/sparktorch

Train and run Pytorch models on Apache Spark.

Language: Python - Size: 8.83 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 340 - Forks: 46

miguno/wirbelsturm 📦

[PROJECT IS NO LONGER MAINTAINED] Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.

Language: Shell - Size: 309 KB - Last synced at: 23 days ago - Pushed at: over 3 years ago - Stars: 329 - Forks: 72

Hydrospheredata/mist

Serverless proxy for Spark cluster

Language: Scala - Size: 9.96 MB - Last synced at: 5 months ago - Pushed at: about 5 years ago - Stars: 326 - Forks: 72

dataflint/spark

Drop-in replacement for Apache Spark UI

Language: TypeScript - Size: 19 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 318 - Forks: 41

microsoft/data-accelerator

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

Language: C# - Size: 401 MB - Last synced at: 8 days ago - Pushed at: 7 months ago - Stars: 306 - Forks: 91

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

Language: Python - Size: 23.9 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 304 - Forks: 63

MingChen0919/learning-apache-spark

Notes on Apache Spark (pyspark)

Language: HTML - Size: 20.1 MB - Last synced at: 5 months ago - Pushed at: over 6 years ago - Stars: 299 - Forks: 186

lifeomic/sparkflow

Easy to use library to bring Tensorflow on Apache Spark

Language: Python - Size: 8.79 MB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 296 - Forks: 45

cuebook/cuelake

Use SQL to build ELT pipelines on a data lakehouse.

Language: JavaScript - Size: 28 MB - Last synced at: 6 months ago - Pushed at: over 3 years ago - Stars: 288 - Forks: 28

svenkreiss/pysparkling

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

Language: Python - Size: 3.45 MB - Last synced at: 6 days ago - Pushed at: about 1 year ago - Stars: 271 - Forks: 45

hortonworks-spark/spark-atlas-connector

A Spark Atlas connector to track data lineage in Apache Atlas

Language: Scala - Size: 903 KB - Last synced at: 6 days ago - Pushed at: almost 3 years ago - Stars: 266 - Forks: 149

jaceklaskowski/spark-workshop

Apache Spark™ and Scala Workshops

Language: HTML - Size: 57 MB - Last synced at: 5 months ago - Pushed at: over 1 year ago - Stars: 264 - Forks: 148

PiercingDan/spark-Jupyter-AWS

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

Language: Jupyter Notebook - Size: 220 KB - Last synced at: 6 months ago - Pushed at: almost 8 years ago - Stars: 261 - Forks: 18

Mellanox/SparkRDMA 📦

This is archive of SparkRDMA project. The new repository with RDMA shuffle acceleration for Apache Spark is here: https://github.com/Nvidia/sparkucx

Language: Java - Size: 259 KB - Last synced at: about 1 month ago - Pushed at: over 6 years ago - Stars: 253 - Forks: 73

airscholar/e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

Language: Python - Size: 289 KB - Last synced at: 5 months ago - Pushed at: 9 months ago - Stars: 250 - Forks: 123

Azure/azure-event-hubs-spark

Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs

Language: Scala - Size: 19.6 MB - Last synced at: 6 days ago - Pushed at: 9 months ago - Stars: 238 - Forks: 179

Chabane/bigdata-playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

Language: TypeScript - Size: 3.08 MB - Last synced at: 6 months ago - Pushed at: almost 7 years ago - Stars: 209 - Forks: 74

Azure/azure-cosmosdb-spark 📦

Apache Spark Connector for Azure Cosmos DB

Size: 192 MB - Last synced at: 6 days ago - Pushed at: 8 months ago - Stars: 203 - Forks: 120

lynnlangit/learning-hadoop-and-spark

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning

Language: HTML - Size: 13.6 MB - Last synced at: 28 days ago - Pushed at: 11 months ago - Stars: 202 - Forks: 166

databrickslabs/automl-toolkit 📦

Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.

Language: HTML - Size: 158 MB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 193 - Forks: 44

vinta/albedo

A recommender system for discovering GitHub repos, built with Apache Spark

Language: Scala - Size: 448 KB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 181 - Forks: 36

IBM/spark-tpc-ds-performance-test 📦

Use the TPC-DS benchmark to test Spark SQL performance

Language: TSQL - Size: 354 MB - Last synced at: about 1 month ago - Pushed at: over 5 years ago - Stars: 181 - Forks: 98

whylabs/whylogs-java 📦

Profile and monitor your ML data pipeline end-to-end

Language: Java - Size: 5.95 MB - Last synced at: about 1 month ago - Pushed at: about 4 years ago - Stars: 178 - Forks: 7

lamastex/scalable-data-science

Scalable Data Science, course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath.

Language: HTML - Size: 1.24 GB - Last synced at: 5 months ago - Pushed at: 9 months ago - Stars: 168 - Forks: 93

mahmoudparsian/big-data-mapreduce-course

Big Data Modeling, MapReduce, Spark, PySpark @ Santa Clara University

Language: HTML - Size: 614 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 161 - Forks: 143

radanalyticsio/spark-operator

Operator for managing the Spark clusters on Kubernetes and OpenShift.

Language: Java - Size: 3.39 MB - Last synced at: about 1 month ago - Pushed at: almost 4 years ago - Stars: 158 - Forks: 62

BitwiseInc/Hydrograph

A visual ETL development and debugging tool for big data

Language: Java - Size: 33.5 MB - Last synced at: 9 days ago - Pushed at: almost 3 years ago - Stars: 154 - Forks: 108

qubole/spark-on-lambda

Apache Spark on AWS Lambda

Language: Scala - Size: 111 MB - Last synced at: 6 months ago - Pushed at: almost 3 years ago - Stars: 151 - Forks: 33

archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Language: Scala - Size: 39.5 MB - Last synced at: about 24 hours ago - Pushed at: over 1 year ago - Stars: 147 - Forks: 34

SANSA-Stack/SANSA-Stack

Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/

Language: Scala - Size: 65.2 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 146 - Forks: 30

GoogleCloudPlatform/dataproc-templates

Dataproc templates and pipelines for solving in-cloud data tasks

Language: Python - Size: 18.8 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 135 - Forks: 106

gtkcyber/griffon-vm

Griffon Data Science Virtual Machine

Size: 896 KB - Last synced at: 4 days ago - Pushed at: over 3 years ago - Stars: 132 - Forks: 26

MemVerge/splash

Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange

Language: Scala - Size: 666 KB - Last synced at: 2 months ago - Pushed at: 11 months ago - Stars: 128 - Forks: 29

jleetutorial/scala-spark-tutorial

Project for James' Apache Spark with Scala course

Language: Scala - Size: 1.14 MB - Last synced at: 6 months ago - Pushed at: over 5 years ago - Stars: 127 - Forks: 252

LearningJournal/Spark-Streaming-In-Python

Apache Spark 3 - Structured Streaming Course Material

Language: Python - Size: 19.4 MB - Last synced at: about 2 months ago - Pushed at: about 2 years ago - Stars: 122 - Forks: 164

zero323/pyspark-stubs 📦

Apache (Py)Spark type annotations (stub files).

Language: Python - Size: 1.3 MB - Last synced at: about 1 month ago - Pushed at: about 3 years ago - Stars: 117 - Forks: 37

streamnative/pulsar-spark

Spark Connector to read and write with Pulsar

Language: Scala - Size: 722 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 116 - Forks: 53

vivek-bombatkar/Spark-with-Python---My-learning-notes-

ETL pipeline using pyspark (Spark - Python)

Language: CSS - Size: 10.7 MB - Last synced at: 4 days ago - Pushed at: over 5 years ago - Stars: 116 - Forks: 82

G-Research/fasttrackml

Experiment tracking server focused on speed and scalability

Language: Go - Size: 5.4 MB - Last synced at: 2 months ago - Pushed at: 9 months ago - Stars: 111 - Forks: 20

jgperrin/net.jgp.books.spark.ch01

Spark in Action, 2nd edition - chapter 1 - Introduction

Language: Scala - Size: 6.91 MB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 106 - Forks: 69

exacaster/lighter

REST API for Apache Spark on K8S or YARN

Language: Java - Size: 6 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 100 - Forks: 25

dimajix/flowman

Flowman is an ETL framework powered by Apache Spark. With its declarative approach, Flowman simplifies the development of complex data pipelines.

Language: Scala - Size: 18.3 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 96 - Forks: 19

igor-suhorukov/openstreetmap_h3

OSM planet dump high performance data loader. Transform OpenStreetMap World/Region PBF dump into partitioned by H3 regions PostGIS pgsnapshot (lossless) OSM schema representation and/or into ArrowIPC/Parquet dumps

Language: Java - Size: 6.01 MB - Last synced at: 28 days ago - Pushed at: 3 months ago - Stars: 95 - Forks: 9

adrianulbona/osm-parquetizer

A converter for the OSM PBFs to Parquet files

Language: Java - Size: 75.2 KB - Last synced at: 4 days ago - Pushed at: about 2 years ago - Stars: 95 - Forks: 33

chermenin/spark-states

Custom state store providers for Apache Spark

Language: Scala - Size: 267 KB - Last synced at: 5 months ago - Pushed at: 9 months ago - Stars: 92 - Forks: 25

itsjafer/jupyterlab-sparkmonitor

JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook

Language: JavaScript - Size: 4.08 MB - Last synced at: 5 days ago - Pushed at: almost 3 years ago - Stars: 92 - Forks: 23

LearningJournal/SparkProgrammingInScala

Apache Spark Course Material

Language: Scala - Size: 50.9 MB - Last synced at: 7 months ago - Pushed at: over 2 years ago - Stars: 88 - Forks: 159

groda/big_data

Big Data essentials: Hadoop, MapReduce, Spark. Explore tutorials and demos in Jupyter notebooks—most are self-contained and live, ready to run with a click.

Language: Jupyter Notebook - Size: 62.5 MB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 85 - Forks: 27