Topic: "visual-language-learning"
haotian-liu/LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Language: Python - Size: 13.4 MB - Last synced at: 1 day ago - Pushed at: 9 months ago - Stars: 22,343 - Forks: 2,456

NExT-GPT/NExT-GPT
Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large Language Model
Language: Python - Size: 127 MB - Last synced at: 21 days ago - Pushed at: 6 months ago - Stars: 3,480 - Forks: 349

EvolvingLMMs-Lab/Otter
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
Language: Python - Size: 7.39 MB - Last synced at: 6 days ago - Pushed at: about 1 year ago - Stars: 3,248 - Forks: 214

InternLM/InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Language: Python - Size: 199 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 2,815 - Forks: 173

RLHF-V/RLHF-V
[CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Language: Python - Size: 70.6 MB - Last synced at: 1 day ago - Pushed at: 8 months ago - Stars: 276 - Forks: 8

mlpc-ucsd/BLIVA
(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
Language: Python - Size: 12.3 MB - Last synced at: 1 day ago - Pushed at: about 1 year ago - Stars: 256 - Forks: 23

thomas-yanxin/KarmaVLM
🧘🏻♂️KarmaVLM (相生):A family of high efficiency and powerful visual language model.
Language: Python - Size: 2.68 MB - Last synced at: 1 day ago - Pushed at: about 1 year ago - Stars: 88 - Forks: 3

AdrianBZG/llama-multimodal-vqa
Multimodal Instruction Tuning for Llama 3
Language: Python - Size: 31.3 KB - Last synced at: 1 day ago - Pushed at: about 1 year ago - Stars: 48 - Forks: 10

Skyline-9/Shotluck-Holmes
[ACM MMGR '24] 🔍 Shotluck Holmes: A family of small-scale LLVMs for shot-level video understanding
Language: Python - Size: 26.3 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 11 - Forks: 0

xinyanghuang7/Basic-Visual-Language-Model
Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖
Language: Python - Size: 34.7 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 3 - Forks: 0

ashleykleynhans/llava-docker
Docker image for LLaVA: Large Language and Vision Assistant
Language: Shell - Size: 126 KB - Last synced at: 1 day ago - Pushed at: 10 months ago - Stars: 1 - Forks: 0

MuhammadAliS/CLIP
PyTorch implementation of OpenAI's CLIP model for image classification, visual search, and visual question answering (VQA).
Language: Jupyter Notebook - Size: 15.7 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

ecoxial2007/EffVideoQA
Efficient Video Question Answering
Language: Python - Size: 3.74 MB - Last synced at: 9 months ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0
