An open API service providing repository metadata for many open source software ecosystems.

Topic: "distributed-training"

GokuMohandas/Made-With-ML

Learn how to design, develop, deploy and iterate on production-grade ML applications.

Language: Jupyter Notebook - Size: 3.82 MB - Last synced at: 13 days ago - Pushed at: 9 months ago - Stars: 38,445 - Forks: 6,092

huggingface/pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

Language: Python - Size: 26.5 MB - Last synced at: about 12 hours ago - Pushed at: 6 days ago - Stars: 34,024 - Forks: 4,908

PaddlePaddle/Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

Language: C++ - Size: 472 MB - Last synced at: about 12 hours ago - Pushed at: 2 days ago - Stars: 22,737 - Forks: 5,722

PaddlePaddle/PaddleNLP

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

Language: Python - Size: 110 MB - Last synced at: about 12 hours ago - Pushed at: 1 day ago - Stars: 12,557 - Forks: 3,020

skypilot-org/skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

Language: Python - Size: 148 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 7,975 - Forks: 638

IDEA-CCNL/Fengshenbang-LM

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。

Language: Python - Size: 84.5 MB - Last synced at: 26 days ago - Pushed at: 9 months ago - Stars: 4,106 - Forks: 380

FedML-AI/FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

Language: Python - Size: 892 MB - Last synced at: 6 days ago - Pushed at: about 2 months ago - Stars: 3,849 - Forks: 744

bytedance/byteps

A high performance and generic framework for distributed DNN training

Language: Python - Size: 20.2 MB - Last synced at: 25 days ago - Pushed at: over 1 year ago - Stars: 3,675 - Forks: 491

tensorflow/adanet

Fast and flexible AutoML with learning guarantees.

Language: Jupyter Notebook - Size: 2.46 MB - Last synced at: 6 days ago - Pushed at: over 1 year ago - Stars: 3,464 - Forks: 529

determined-ai/determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

Language: Go - Size: 198 MB - Last synced at: 13 days ago - Pushed at: about 2 months ago - Stars: 3,132 - Forks: 364

alpa-projects/alpa 📦

Training and serving large-scale neural networks with auto parallelization.

Language: Python - Size: 7.11 MB - Last synced at: 1 day ago - Pushed at: over 1 year ago - Stars: 3,129 - Forks: 359

learning-at-home/hivemind

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Language: Python - Size: 12.1 MB - Last synced at: 10 days ago - Pushed at: 16 days ago - Stars: 2,167 - Forks: 184

intelligent-machine-learning/dlrover

DLRover: An Automatic Distributed Deep Learning System

Language: Python - Size: 164 MB - Last synced at: 10 days ago - Pushed at: 11 days ago - Stars: 1,421 - Forks: 176

pytorch/gloo

Collective communications library with various primitives for multi-machine training.

Language: C++ - Size: 1.45 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 1,291 - Forks: 325

tensorlayer/HyperPose

Library for Fast and Flexible Human Pose Estimation

Language: Python - Size: 9.66 MB - Last synced at: 7 days ago - Pushed at: about 2 years ago - Stars: 1,261 - Forks: 274

DeepRec-AI/DeepRec

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

Language: C++ - Size: 764 MB - Last synced at: 24 days ago - Pushed at: 3 months ago - Stars: 1,092 - Forks: 361

mryab/efficient-dl-systems

Efficient Deep Learning Systems course materials (HSE, YSDA)

Language: Jupyter Notebook - Size: 68.7 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 814 - Forks: 132

alibaba/Megatron-LLaMA Fork of NVIDIA/Megatron-LM

Best practice for training LLaMA models in Megatron-LM

Language: Python - Size: 4.18 MB - Last synced at: 6 days ago - Pushed at: over 1 year ago - Stars: 650 - Forks: 56

Guitaricet/relora

Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

Language: Jupyter Notebook - Size: 1.89 MB - Last synced at: 8 days ago - Pushed at: about 1 year ago - Stars: 452 - Forks: 38

petuum/adaptdl

Resource-adaptive cluster scheduler for deep learning training.

Language: Python - Size: 2.53 MB - Last synced at: 25 days ago - Pushed at: about 2 years ago - Stars: 436 - Forks: 79

Oneflow-Inc/libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

Language: Python - Size: 34.6 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 402 - Forks: 56

LambdaLabsML/distributed-training-guide

Best practices & guides on how to write distributed pytorch training code

Language: Python - Size: 429 KB - Last synced at: 27 days ago - Pushed at: 2 months ago - Stars: 388 - Forks: 27

pytorch/torchx

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

Language: Python - Size: 33.9 MB - Last synced at: about 4 hours ago - Pushed at: about 4 hours ago - Stars: 361 - Forks: 130

DataCanvasIO/HyperGBM

A full pipeline AutoML tool for tabular data

Language: Python - Size: 11 MB - Last synced at: 23 days ago - Pushed at: 24 days ago - Stars: 347 - Forks: 47

sail-sg/oat

🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.

Language: Python - Size: 2.27 MB - Last synced at: 17 days ago - Pushed at: 18 days ago - Stars: 325 - Forks: 21

maudzung/YOLO3D-YOLOv4-PyTorch

YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)

Language: Python - Size: 12.2 MB - Last synced at: 26 days ago - Pushed at: over 4 years ago - Stars: 300 - Forks: 44

PKU-DAIR/Hetu Fork of Hsword/Hetu

A high-performance distributed deep learning system targeting large-scale and automated distributed training.

Language: Python - Size: 88.1 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 297 - Forks: 33

lsds/KungFu

Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.

Language: Go - Size: 1.88 MB - Last synced at: 26 days ago - Pushed at: about 1 year ago - Stars: 293 - Forks: 59

DeNA/HandyRL

HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

Language: Python - Size: 612 KB - Last synced at: 24 days ago - Pushed at: 2 months ago - Stars: 287 - Forks: 43

HMUNACHI/NanoDL

A Jax-based library for designing and training small transformers.

Language: Python - Size: 44.4 MB - Last synced at: 6 days ago - Pushed at: 8 months ago - Stars: 286 - Forks: 10

aws-samples/awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.

Language: Shell - Size: 141 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 284 - Forks: 118

alibaba/EasyParallelLibrary

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

Language: Python - Size: 771 KB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 267 - Forks: 49

awslabs/deeplearning-cfn

Distributed Deep Learning on AWS Using CloudFormation (CFN), MXNet and TensorFlow

Language: Python - Size: 3.33 MB - Last synced at: 17 days ago - Pushed at: about 5 years ago - Stars: 254 - Forks: 104

dougsouza/pytorch-sync-batchnorm-example

How to use Cross Replica / Synchronized Batchnorm in Pytorch

Size: 19.5 KB - Last synced at: about 1 year ago - Pushed at: almost 6 years ago - Stars: 247 - Forks: 24

foundation-model-stack/fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

Language: Python - Size: 785 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 243 - Forks: 38

synxlin/deep-gradient-compression

[ICLR 2018] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Language: Python - Size: 316 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 207 - Forks: 45

chairc/Integrated-Design-Diffusion-Model

IDDM (Industrial, landscape, animate, spectrogram...), support DDPM, DDIM, PLMS, webui and distributed training. Pytorch实现扩散模型,生成模型,分布式训练

Language: Python - Size: 2.73 MB - Last synced at: about 17 hours ago - Pushed at: 8 days ago - Stars: 201 - Forks: 25

zh320/realtime-semantic-segmentation-pytorch

PyTorch implementation of over 30 realtime semantic segmentations models, e.g. BiSeNetv1, BiSeNetv2, CGNet, ContextNet, DABNet, DDRNet, EDANet, ENet, ERFNet, ESPNet, ESPNetv2, FastSCNN, ICNet, LEDNet, LinkNet, PP-LiteSeg, SegNet, ShelfNet, STDC, SwiftNet, and support knowledge distillation, distributed training, Optuna etc.

Language: Python - Size: 10.9 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 189 - Forks: 28

wenwei202/terngrad

Ternary Gradients to Reduce Communication in Distributed Deep Learning (TensorFlow)

Language: Python - Size: 5.59 MB - Last synced at: about 1 year ago - Pushed at: over 6 years ago - Stars: 181 - Forks: 48

ZJU-OpenKS/OpenKS

OpenKS - 领域可泛化的知识学习与计算平台

Language: Python - Size: 388 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 157 - Forks: 67

huggingface/chug

Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.

Language: Python - Size: 146 KB - Last synced at: 5 days ago - Pushed at: about 1 year ago - Stars: 157 - Forks: 11

PaddlePaddle/PLSC

Paddle Large Scale Classification Tools,supports ArcFace, CosFace, PartialFC, Data Parallel + Model Parallel. Model includes ResNet, ViT, Swin, DeiT, CaiT, FaceViT, MoCo, MAE, ConvMAE, CAE.

Language: Python - Size: 2.9 MB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 144 - Forks: 32

aws/sagemaker-xgboost-container

This is the Docker container based on open source framework XGBoost (https://xgboost.readthedocs.io/en/latest/) to allow customers use their own XGBoost scripts in SageMaker.

Language: Python - Size: 830 KB - Last synced at: 5 days ago - Pushed at: about 2 months ago - Stars: 136 - Forks: 85

microsoft/nnscaler

nnScaler: Compiling DNN models for Parallel Training

Language: Python - Size: 1.51 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 108 - Forks: 14

alibaba/TePDist

TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.

Language: C++ - Size: 48.1 MB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 93 - Forks: 9

richardkxu/distributed-pytorch

Distributed, mixed-precision training with PyTorch

Language: Python - Size: 1.26 MB - Last synced at: about 2 years ago - Pushed at: over 4 years ago - Stars: 86 - Forks: 24

omerbsezer/Fast-Kubeflow

This repo covers Kubeflow Environment with LABs: Kubeflow GUI, Jupyter Notebooks on pods, Kubeflow Pipelines, Experiments, KALE, KATIB (AutoML: Hyperparameter Tuning), KFServe (Model Serving), Training Operators (Distributed Training), Projects, etc.

Language: Python - Size: 160 KB - Last synced at: 8 days ago - Pushed at: about 1 year ago - Stars: 80 - Forks: 19

G-U-N/a-PyTorch-Tutorial-to-Class-Incremental-Learning

a PyTorch Tutorial to Class-Incremental Learning | a Distributed Training Template of CIL with core code less than 100 lines.

Language: Python - Size: 22.1 MB - Last synced at: about 1 year ago - Pushed at: almost 3 years ago - Stars: 78 - Forks: 6

PanJinquan/Pytorch-Base-Trainer

Pytorch分布式训练框架

Language: Python - Size: 34.6 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 77 - Forks: 10

bindog/pytorch-model-parallel

A memory balanced and communication efficient FullyConnected layer with CrossEntropyLoss model parallel implementation in PyTorch

Language: Python - Size: 85 KB - Last synced at: about 1 year ago - Pushed at: almost 5 years ago - Stars: 74 - Forks: 20

bytedance/ps-lite Fork of dmlc/ps-lite

A lightweight parameter server interface

Language: C++ - Size: 1.18 MB - Last synced at: 9 months ago - Pushed at: over 2 years ago - Stars: 71 - Forks: 24

bryanyzhu/Video-Tutorial-CVPR2020

A Comprehensive Tutorial on Video Modeling

Language: Jupyter Notebook - Size: 15.7 MB - Last synced at: about 1 month ago - Pushed at: almost 5 years ago - Stars: 66 - Forks: 12

tanyuqian/redco

NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference

Language: Python - Size: 11.5 MB - Last synced at: 6 days ago - Pushed at: 5 months ago - Stars: 65 - Forks: 7

taishan1994/pytorch-distributed-NLP

pytorch分布式训练

Language: Python - Size: 1.79 MB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 65 - Forks: 14

hkproj/pytorch-transformer-distributed

Distributed training (multi-node) of a Transformer model

Language: Python - Size: 4.03 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 63 - Forks: 26

NoteDance/Note

Machine learning library, Distributed training, Deep learning, Reinforcement learning, Models, TensorFlow, PyTorch

Language: Python - Size: 9.77 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 61 - Forks: 2

pinpoint-apm/pinpoint-node-agent

Pinpoint Node.js agent

Language: JavaScript - Size: 2.36 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 57 - Forks: 11

AI-Hypercomputer/gpu-recipes

Recipes for reproducing training and serving benchmarks for large machine learning models using GPUs on Google Cloud.

Language: Python - Size: 338 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 56 - Forks: 15

awslabs/dynamic-training-with-apache-mxnet-on-aws 📦

Dynamic training with Apache MXNet reduces cost and time for training deep neural networks by leveraging AWS cloud elasticity and scale. The system reduces training cost and time by dynamically updating the training cluster size during training, with minimal impact on model training accuracy.

Language: Python - Size: 17.6 MB - Last synced at: 17 days ago - Pushed at: over 2 years ago - Stars: 56 - Forks: 17

aws-samples/aws-do-eks

Create, List, Update, Delete Amazon EKS clusters. Deploy and manage software on EKS. Run distributed model training and inference examples.

Language: Shell - Size: 26.3 MB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 55 - Forks: 32

andreped/GradientAccumulator

:dart: Accumulated Gradients for TensorFlow 2

Language: Python - Size: 5.3 MB - Last synced at: 4 days ago - Pushed at: about 1 year ago - Stars: 53 - Forks: 11

GATECH-EIC/BNS-GCN

[MLSys 2022] "BNS-GCN: Efficient Full-Graph Training of Graph Convolutional Networks with Partition-Parallelism and Random Boundary Node Sampling" by Cheng Wan, Youjie Li, Ang Li, Nam Sung Kim, Yingyan Lin

Language: Python - Size: 75.2 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 53 - Forks: 11

l294265421/my-llm

All about large language models

Size: 188 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 46 - Forks: 5

AdrianBZG/LLM-distributed-finetune

Tune efficiently any LLM model from HuggingFace using distributed training (multiple GPU) and DeepSpeed. Uses Ray AIR to orchestrate the training on multiple AWS GPU instances

Language: Python - Size: 20.9 MB - Last synced at: 12 months ago - Pushed at: almost 2 years ago - Stars: 46 - Forks: 6

uw-mad-dash/shockwave

Artifact for "Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning" [NSDI '23]

Language: Python - Size: 5.77 MB - Last synced at: 8 days ago - Pushed at: over 2 years ago - Stars: 43 - Forks: 1

array2d/deepx

Large-scale Auto-Distributed Training/Inference Unified Framework | Memory-Compute-Control Decoupled Architecture | Multi-language SDK & Heterogeneous Hardware Support

Language: C++ - Size: 1.95 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 41 - Forks: 4

saareliad/FTPipe

FTPipe and related pipeline model parallelism research.

Language: Python - Size: 11.4 MB - Last synced at: 23 days ago - Pushed at: almost 2 years ago - Stars: 41 - Forks: 7

aws-samples/amazon-sagemaker-protein-classification

Implementation of Protein Classification based on subcellular localization using ProtBert(Rostlab/prot_bert_bfd_localization) model from Hugging Face library, based on BERT model trained on large corpus of protein sequences.

Language: Jupyter Notebook - Size: 108 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 40 - Forks: 23

HongxinXiang/pytorch-multi-GPU-training-tutorial

Language: Python - Size: 57.6 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 39 - Forks: 13

megvii-research/basecls

A codebase & model zoo for pretrained backbone based on MegEngine.

Language: Python - Size: 1.32 MB - Last synced at: about 2 months ago - Pushed at: about 2 years ago - Stars: 33 - Forks: 3

aws-samples/TensorFlow-in-SageMaker-workshop

Running your TensorFlow models in Amazon SageMaker

Language: Jupyter Notebook - Size: 97.7 KB - Last synced at: 24 days ago - Pushed at: over 4 years ago - Stars: 33 - Forks: 18

determined-ai/determined-examples

Example ML projects that use the Determined library.

Language: Python - Size: 2.73 MB - Last synced at: 5 days ago - Pushed at: 8 months ago - Stars: 32 - Forks: 3

4paradigm/OpenEmbedding

OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

Language: C++ - Size: 1.75 MB - Last synced at: 12 days ago - Pushed at: about 2 years ago - Stars: 31 - Forks: 6

GATECH-EIC/PipeGCN

[ICLR 2022] "PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication" by Cheng Wan, Youjie Li, Cameron R. Wolfe, Anastasios Kyrillidis, Nam Sung Kim, Yingyan Lin

Language: Python - Size: 31.3 KB - Last synced at: 3 months ago - Pushed at: about 2 years ago - Stars: 31 - Forks: 7

pabaq/Coursera-TensorFlow-Advanced-Techniques-Specialization

Programming assignments and labs from the TensorFlow Advanced Techniques Specialization offered by deeplearning.ai on Coursera.

Language: Jupyter Notebook - Size: 248 MB - Last synced at: 5 months ago - Pushed at: about 2 years ago - Stars: 27 - Forks: 8

AshishKumar4/FlaxDiff

A simple, easy-to-understand library for diffusion models using Flax and Jax. Includes detailed notebooks on DDPM, DDIM, and EDM with simplified mathematical explanations. Made as part of my journey for learning and experimenting with generative AI.

Language: Jupyter Notebook - Size: 238 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 25 - Forks: 0

CyprienQuemeneur/fedpylot

FedPylot: Navigating Federated Learning for Real-Time Object Detection in Internet of Vehicles

Language: Jupyter Notebook - Size: 35.5 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 23 - Forks: 10

knagrecha/saturn

Saturn accelerates the training of large-scale deep learning models with a novel joint optimization approach.

Language: Python - Size: 107 KB - Last synced at: 6 days ago - Pushed at: over 1 year ago - Stars: 23 - Forks: 5

Azure/DistributedDeepLearning 📦

Tutorials on running distributed deep learning on Batch AI

Language: Shell - Size: 604 KB - Last synced at: 3 days ago - Pushed at: over 6 years ago - Stars: 23 - Forks: 15

graykode/horovod-ansible

Create Horovod cluster easily using Ansible

Language: HCL - Size: 217 KB - Last synced at: about 9 hours ago - Pushed at: almost 6 years ago - Stars: 22 - Forks: 5

cornell-zhang/HOGA

Hop-Wise Graph Attention for Scalable and Generalizable Learning on Circuits

Language: Python - Size: 635 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 21 - Forks: 2

saforem2/ezpz

Train across all your devices, ezpz 🍋

Language: Python - Size: 5.71 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 20 - Forks: 5

OpenRL-Lab/Ray_Tutorial

Tutorial for Ray

Language: Python - Size: 20.5 KB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 20 - Forks: 3

sayakpaul/Distributed-Training-in-TensorFlow-2-with-AI-Platform

Contains code to demonstrate distributed training in TensorFlow 2 with AI Platform and custom Docker contains.

Language: Python - Size: 51.8 KB - Last synced at: about 1 month ago - Pushed at: about 4 years ago - Stars: 20 - Forks: 1

ShinoharaHare/LLM-Training

A distributed training framework for large language models powered by Lightning.

Language: Python - Size: 281 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 19 - Forks: 4

alexrenz/AdaPM

A fully adaptive, zero-tuning parameter manager that enables efficient distributed machine learning training

Language: C++ - Size: 2.41 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 19 - Forks: 8

SLAMPAI/large-scale-pretraining-transfer

Code for reproducing the experiments on large-scale pre-training and transfer learning for the paper "Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images" (https://arxiv.org/abs/2106.00116)

Language: Jupyter Notebook - Size: 401 KB - Last synced at: 6 months ago - Pushed at: almost 3 years ago - Stars: 18 - Forks: 4

aws-samples/amazon-sagemaker-visual-transformer

Implementation of Image Classification using Visual Transformers in Amazon SageMaker based on the ideas from research paper - Visual Transformers: Token-based Image Representation and Processing for Computer Vision.

Language: Jupyter Notebook - Size: 646 KB - Last synced at: about 2 years ago - Pushed at: over 4 years ago - Stars: 16 - Forks: 8

raywan-110/AdaQP

Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training

Language: Python - Size: 87.9 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 15 - Forks: 0

AberHu/ImageNet-training

Pytorch ImageNet training codes with various tricks, lr schedulers, distributed training, mixed precision training, DALI dataloader etc.

Language: Python - Size: 51.8 KB - Last synced at: about 2 years ago - Pushed at: over 4 years ago - Stars: 15 - Forks: 6

senli1073/SeisT

[TGRS] SeisT: A Foundational Deep-Learning Model for Earthquake Monitoring Tasks

Language: Python - Size: 24.4 MB - Last synced at: 21 days ago - Pushed at: 6 months ago - Stars: 14 - Forks: 0

JYWa/MATCHA

Communication-efficient decentralized SGD (Pytorch)

Language: Python - Size: 19.5 KB - Last synced at: about 2 years ago - Pushed at: about 5 years ago - Stars: 14 - Forks: 6

daekeun-ml/sm-distributed-training-step-by-step

This repository provides hands-on labs on PyTorch-based Distributed Training and SageMaker Distributed Training. It is written to make it easy for beginners to get started, and guides you through step-by-step modifications to the code based on the most basic BERT use cases.

Language: Jupyter Notebook - Size: 1.3 MB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 13 - Forks: 2

hysts/pytorch_yolov3 📦

A PyTorch Implementation of YOLOv3

Language: Python - Size: 1.23 MB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 13 - Forks: 0

zh320/medical-segmentation-pytorch

PyTorch implementation of medical semantic segmentations models, e.g. UNet, UNet++, DUCKNet, ResUNet, ResUNet++, and support knowledge distillation, distributed training, Optuna etc.

Language: Python - Size: 71.3 KB - Last synced at: 24 days ago - Pushed at: 2 months ago - Stars: 12 - Forks: 4

18520339/ml-distributed-training

Reduce the training time of CNNs by leveraging the power of multiple GPUs in 2 approaches, Multi-workers & Parameter Sever Training using TensorFlow 2

Language: Jupyter Notebook - Size: 8.05 MB - Last synced at: 4 days ago - Pushed at: over 2 years ago - Stars: 12 - Forks: 3

cake-lab/DELI

Optimizing loading training data from cloud bucket storage for cloud-based distributed deep learning. Official repository for Quantifying and Improving Performance of Distributed Deep Learning with Cloud Storage, to be published in IC2E 2021

Language: Jupyter Notebook - Size: 716 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 11 - Forks: 1

redis-applied-ai/redis-feast-ray 📦

A demo pipeline of using Redis as an online feature store with Feast for orchestration and Ray for training and model serving

Language: Python - Size: 23.9 MB - Last synced at: 6 months ago - Pushed at: over 2 years ago - Stars: 10 - Forks: 0

AlibabaPAI/FlashModels

Fast and easy distributed model training examples.

Language: Python - Size: 42.9 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 9 - Forks: 4

Shenggan/atp

Adaptive Tensor Parallelism for Foundation Models

Language: Python - Size: 3.22 MB - Last synced at: 29 days ago - Pushed at: over 2 years ago - Stars: 9 - Forks: 0