GitHub topics: fastertransformer

Repositories

InternLM/lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Language: Python - Size: 8.18 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 6,504 - Forks: 556

Curt-Park/serving-codegen-gptj-triton

Serving Example of CodeGen-350M-Mono-GPTJ on Triton Inference Server with Docker and Kubernetes

Language: Python - Size: 5.47 MB - Last synced at: 2 months ago - Pushed at: about 2 years ago - Stars: 20 - Forks: 0

RajeshThallam/fastertransformer-converter

This repository is a code sample to serve Large Language Models (LLM) on a Google Kubernetes Engine (GKE) cluster with GPUs running NVIDIA Triton Inference Server with FasterTransformer backend.

Language: Python - Size: 139 KB - Last synced at: 3 months ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

detail-novelist/novelist-triton-server

Deploy KoGPT with Triton Inference Server

Language: Shell - Size: 7.81 KB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 14 - Forks: 0

clam004/triton-ft-api

tutorial on how to deploy a scalable autoregressive causal language model transformer using nvidia triton server

Language: Python - Size: 52.7 KB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 5 - Forks: 0

Related Keywords

fastertransformer 5 triton-inference-server 3 huggingface 2 llm 2 gke 1 googlecloudplatform 1 inference 1 large-scale-machine-learning 1 gptj 1 kogpt 1 large-language-models 1 transformers 1 triton 1 fastapi 1 gpt 1 nvidia 1 nvidia-docker 1 nvidia-gpu 1 pytorch 1 kubernetes 1 huggingface-transformers 1 docker 1 codegen 1 turbomind 1 llm-inference 1 llama3 1 llama2 1 llama 1 internlm 1 deepspeed 1 cuda-kernels 1 codellama 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Repos