GitHub topics: multimodal-large-language-models
NishilBalar/Awesome-LVLM-Hallucination
up-to-date curated list of state-of-the-art Large vision language models hallucinations research work, papers & resources
Size: 189 KB - Last synced at: about 2 hours ago - Pushed at: about 3 hours ago - Stars: 125 - Forks: 6

lll6gg/UI-R1
Code for "UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning"
Language: Python - Size: 1.08 MB - Last synced at: about 14 hours ago - Pushed at: about 14 hours ago - Stars: 91 - Forks: 6

yaotingwangofficial/Awesome-MCoT
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Size: 4.59 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 553 - Forks: 13

Glasgow-AI4BioMed/RRG-BioNLP-ACL2024 Fork of X-iZhang/RRG-BioNLP-ACL2024
Code for the paper "Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for Radiology Report Generation" (BioNLP ACL'24)
Language: Python - Size: 591 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

YingqingHe/Awesome-LLMs-meet-Multimodal-Generation
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
Language: HTML - Size: 12.7 MB - Last synced at: about 14 hours ago - Pushed at: about 1 month ago - Stars: 472 - Forks: 26

X-iZhang/RRG-BioNLP-ACL2024
🔬 Med-CXRGen, developed by Glasgow AI4BioMed Lab, brings vision-language adaptation to biomedical radiology via visual instruction tuning. (BioNLP ACL'24)
Language: Python - Size: 591 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 1

OpenGVLab/PIIP
[NeurIPS 2024 Spotlight ⭐️] Parameter-Inverted Image Pyramid Networks (PIIP)
Language: Python - Size: 11.7 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 89 - Forks: 2

RainBowLuoCS/OpenOmni
OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
Language: Python - Size: 8.4 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 47 - Forks: 5

modelscope/modelscope-agent
ModelScope-Agent: An agent framework connecting models in ModelScope with the world
Language: Python - Size: 67.6 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 3,110 - Forks: 348

JoeJoe1313/PaliGemma-Image-Segmentation
An app with FastAPI, Docker, transformers, JAX/Flax for performing image segmentation with PaliGemma 2 mix
Language: Python - Size: 7.46 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

MME-Benchmarks/Video-MME
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Size: 16.7 MB - Last synced at: 3 days ago - Pushed at: 24 days ago - Stars: 542 - Forks: 20

VITA-MLLM/Woodpecker
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models
Language: Python - Size: 21.2 MB - Last synced at: 1 day ago - Pushed at: 5 months ago - Stars: 635 - Forks: 30

AIDC-AI/Awesome-Unified-Multimodal-Models
Awesome Unified Multimodal Models
Size: 3.92 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 3 - Forks: 0

X-iZhang/Libra
Space for MLLM on Biomedical Radiology Analysis
Language: Python - Size: 13.4 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 6 - Forks: 1

LLaVA-VL/LLaVA-Plus-Codebase
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
Language: Python - Size: 19 MB - Last synced at: 1 day ago - Pushed at: over 1 year ago - Stars: 740 - Forks: 58

LoupXpro/AlphaExtract
AlphaExtract is a sophisticated PDF summarization tool that combines cutting-edge AI technology with efficient document processing. The project is built using Python and leverages Meta's LLaMA 4 MOE Maverick model along with Groq's inference engine to provide fast and accurate PDF summaries.
Language: Python - Size: 12.6 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

JinXins/Awesome-Token-Merge-for-MLLMs
A paper list about Token Merge, Reduce, Resample, Drop for MLLMs.
Size: 103 KB - Last synced at: 3 days ago - Pushed at: 4 months ago - Stars: 54 - Forks: 0

hustvl/EVF-SAM
Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"
Language: Python - Size: 5.94 MB - Last synced at: 3 days ago - Pushed at: about 2 months ago - Stars: 403 - Forks: 18

xjywhu/Awesome-Multimodal-LLM-for-Code
Multimodal Large Language Models for Code Generation under Multimodal Scenarios
Size: 129 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 66 - Forks: 2

scofield7419/Video-of-Thought
Video Chain of Thought, Codes for ICML 2024 paper: "Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition"
Language: Python - Size: 1.72 MB - Last synced at: 3 days ago - Pushed at: 2 months ago - Stars: 137 - Forks: 7

akusayudodograu/Agentic-RAG-Story-Generation-with-Multimodal-GenAI
Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling
Size: 1000 Bytes - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 8 - Forks: 1

X-LANCE/SLAM-LLM
Speech, Language, Audio, Music Processing with Large Language Model
Language: Python - Size: 169 MB - Last synced at: 5 days ago - Pushed at: 16 days ago - Stars: 794 - Forks: 76

SkyworkAI/Vitron
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Language: Python - Size: 667 MB - Last synced at: 3 days ago - Pushed at: 7 months ago - Stars: 531 - Forks: 33

ritzz-ai/GUI-R1
Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Language: Python - Size: 974 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 70 - Forks: 5

multimodal-ai-lab/DEFAME
Fact-checking system for textual and visual inputs.
Language: Python - Size: 29.6 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 19 - Forks: 2

BradyFU/Awesome-Multimodal-Large-Language-Models
:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
Size: 82.9 MB - Last synced at: 6 days ago - Pushed at: 17 days ago - Stars: 14,921 - Forks: 957

zjunlp/OceanGPT
[ACL 2024] OceanGPT: A Large Language Model for Ocean Science Tasks
Language: Python - Size: 36.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 47 - Forks: 5

The-Martyr/CausalMM
[ICLR 2025] Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality
Language: Python - Size: 7.1 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 25 - Forks: 2

richard-peng-xia/awesome-multimodal-in-medical-imaging
A collection of resources on applications of multi-modal learning in medical imaging.
Size: 234 KB - Last synced at: 6 days ago - Pushed at: about 1 month ago - Stars: 730 - Forks: 65

Adm-2005/Test-It
Test-it! is an AI testing tool designed to generate comprehensive testing instructions and code for digital products based on snapshots and code snippets.
Language: JavaScript - Size: 1.18 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 2 - Forks: 1

thisisiron/LLaVA-Pool
🌋 A flexible framework for training and configuring Vision-Language Models
Language: Python - Size: 3.09 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

willxxy/ECG-Bench
A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)
Language: Python - Size: 6.67 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 9 - Forks: 2

piomin/spring-ai-showcase
Sample Spring AI Application with several use cases
Language: Java - Size: 3.94 MB - Last synced at: 7 days ago - Pushed at: 10 days ago - Stars: 31 - Forks: 15

AIDC-AI/Parrot
🎉 The code repository for "Parrot: Multilingual Visual Instruction Tuning" in PyTorch.
Language: Python - Size: 25.2 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 36 - Forks: 1

GerrySant/multimodalhugs
MultimodalHugs is an extension of Hugging Face that offers a generalized framework for training, evaluating, and using multimodal AI models with minimal code differences, ensuring seamless compatibility with Hugging Face pipelines.
Language: Python - Size: 4.24 MB - Last synced at: 9 days ago - Pushed at: 10 days ago - Stars: 3 - Forks: 2

jingyi0000/R1-VL
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Language: Python - Size: 2.3 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 276 - Forks: 0

Wang-ML-Lab/multimodal-needle-in-a-haystack
[NAACL 2025 Oral] Multimodal Needle in a Haystack (MMNeedle): Benchmarking Long-Context Capability of Multimodal Large Language Models
Language: Python - Size: 16.6 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 42 - Forks: 3

The-Martyr/Awesome-Modality-Priors-in-MLLMs
Latest Advances on Modality Priors in Multimodal Large Language Models
Size: 76.2 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 15 - Forks: 1

X-PLUG/MobileAgent
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
Language: Python - Size: 383 MB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 4,149 - Forks: 412

IrohXu/Awesome-Multimodal-LLM-Autonomous-Driving
[WACV 2024 Survey Paper] Multimodal Large Language Models for Autonomous Driving
Size: 15 MB - Last synced at: 1 day ago - Pushed at: about 1 year ago - Stars: 286 - Forks: 11

AIDC-AI/Ovis
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Language: Python - Size: 5.56 MB - Last synced at: 11 days ago - Pushed at: about 2 months ago - Stars: 899 - Forks: 56

Paranioar/Awesome_Matching_Pretraining_Transfering
The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.
Size: 369 KB - Last synced at: 4 days ago - Pushed at: 5 months ago - Stars: 423 - Forks: 48

apple/ml-slowfast-llava
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Language: Python - Size: 375 KB - Last synced at: 7 days ago - Pushed at: 8 months ago - Stars: 217 - Forks: 13

rese1f/MovieChat
[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Language: Python - Size: 78.7 MB - Last synced at: 11 days ago - Pushed at: 3 months ago - Stars: 612 - Forks: 41

friedrichor/Awesome-Multimodal-Papers
A curated list of awesome Multimodal studies.
Language: HTML - Size: 63.3 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 186 - Forks: 18

burglarhobbit/Awesome-Medical-Large-Language-Models
Curated papers on Large Language Models in Healthcare and Medical domain
Size: 42 KB - Last synced at: 12 days ago - Pushed at: 4 months ago - Stars: 302 - Forks: 35

deepglint/unicom
Large-Scale Visual Representation Model
Language: Python - Size: 22.8 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 661 - Forks: 29

zjunlp/Deco
[ICLR 2025] MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
Language: Python - Size: 17.6 MB - Last synced at: 13 days ago - Pushed at: 5 months ago - Stars: 64 - Forks: 5

declare-lab/Auto-Scaling
[Arxiv 2024] Official Implementation of the paper: "Towards Robust Instruction Tuning on Multimodal Large Language Models"
Language: Jupyter Notebook - Size: 67.5 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 8 - Forks: 1

gautierdag/plancraft
Plancraft is a minecraft environment and agent suite to test planning capabilities in LLMs
Language: Python - Size: 124 MB - Last synced at: 12 days ago - Pushed at: 15 days ago - Stars: 10 - Forks: 0

UKPLab/arxiv2025-misleading-visualizations
Code and datasets accompanying the arXiv preprint: "Protecting multimodal large language models against misleading visualizations"
Language: JavaScript - Size: 22.7 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 2 - Forks: 0

RauhanAhmed/AlphaExtract
AlphaExtract is a sophisticated PDF summarization tool that combines cutting-edge AI technology with efficient document processing. The project is built using Python and leverages Meta's LLaMA 4 MOE Maverick model along with Groq's inference engine to provide fast and accurate PDF summaries.
Language: Python - Size: 5.58 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

THUDM/VisualAgentBench
Towards Large Multimodal Models as Visual Foundation Agents
Language: Python - Size: 5.62 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 207 - Forks: 6

CristianoPatricio/CBVLM
Code for the paper "CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification".
Language: Python - Size: 903 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 7 - Forks: 1

AMD-AIG-AIMA/gpt-fast
The GPT-Fast for Multimodal Models on AMD GPUs
Language: Python - Size: 6.03 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 2 - Forks: 0

BAAI-DCAI/Bunny
A family of lightweight multimodal models.
Language: Python - Size: 28.5 MB - Last synced at: 9 days ago - Pushed at: 6 months ago - Stars: 1,015 - Forks: 74

SuperBruceJia/Awesome-Large-Vision-Language-Model
Awesome Large Vision-Language Model: A Curated List of Large Vision-Language Model
Size: 103 KB - Last synced at: 9 days ago - Pushed at: 8 months ago - Stars: 28 - Forks: 3

mediacontentatlas/mediacontentatlas
Code for Media Content Atlas
Language: Python - Size: 1.45 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 1 - Forks: 0

shaopengw/Awesome-Music-Generation
Awesome music generation model——MG²
Language: Python - Size: 3.16 MB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 154 - Forks: 10

rese1f/aurora
[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Language: Python - Size: 25.3 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 94 - Forks: 4

puar-playground/Self-Visual-RAG
Implementation of MLLM-based Self-Vision-RAG models
Language: Python - Size: 1.56 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 2 - Forks: 1

taco-group/Re-Align
A novel alignment framework that leverages image retrieval to mitigate hallucinations in Vision Language Models.
Language: Python - Size: 18.6 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 40 - Forks: 1

CoolGuy2982/Eco
A Multimodal AI app that gives you eco friendly insights with just a picture. It can understand what you want to know just by looking at the picture, offering recycling advice locations and alternative products, helps subvert greenwashing, and much much more.
Language: HTML - Size: 34.4 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 1 - Forks: 0

ai4colonoscopy/IntelliScope
Frontiers in Intelligent Colonoscopy [ColonSurvey | ColonINST | ColonGPT]
Language: Python - Size: 30.9 MB - Last synced at: 19 days ago - Pushed at: 20 days ago - Stars: 64 - Forks: 5

VachanVY/Transfusion.torch
PyTorch Implementation of Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Language: Python - Size: 2.07 MB - Last synced at: 12 days ago - Pushed at: 7 months ago - Stars: 21 - Forks: 5

MSR3D/MSR3D
[NeurIPS 2024] Official code repository for MSR3D paper
Language: Python - Size: 75.7 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 50 - Forks: 2

UKPLab/5pils
Code associated with the EMNLP 2024 Main paper: "Image, tell me your story!" Predicting the original meta-context of visual misinformation.
Language: Python - Size: 3.38 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 38 - Forks: 4

ChocoWu/SeTok
Codes for Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM
Language: Python - Size: 2.1 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 54 - Forks: 0

HenryHZY/Awesome-Multimodal-LLM
Research Trends in LLM-guided Multimodal Learning.
Size: 17.6 KB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 358 - Forks: 16

ictnlp/LLaMA-Omni
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Language: Python - Size: 3.27 MB - Last synced at: 21 days ago - Pushed at: 22 days ago - Stars: 2,891 - Forks: 195

YangLing0818/RPG-DiffusionMaster
[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)
Language: Jupyter Notebook - Size: 64.2 MB - Last synced at: 22 days ago - Pushed at: 3 months ago - Stars: 1,793 - Forks: 102

NVIDIA/audio-flamingo
PyTorch implementation of Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities.
Language: Python - Size: 4.84 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 456 - Forks: 27

315386775/OpenPathoFoundation
收集和梳理病理AI大模型相关
Size: 229 KB - Last synced at: 22 days ago - Pushed at: 23 days ago - Stars: 3 - Forks: 0

Video-Bench/Video-Bench
Video Generation Benchmark
Language: Python - Size: 8.95 MB - Last synced at: 15 days ago - Pushed at: 24 days ago - Stars: 16 - Forks: 2

UKPLab/naacl2025-cove
Code associated with the NAACL 2025 paper "COVE: COntext and VEracity prediction for out-of-context images"
Language: Python - Size: 2.51 MB - Last synced at: 22 days ago - Pushed at: 23 days ago - Stars: 1 - Forks: 0

GLUS-video/GLUS
[CVPR 2025] Official PyTorch Implementation of GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation
Language: Jupyter Notebook - Size: 66.4 MB - Last synced at: 22 days ago - Pushed at: 23 days ago - Stars: 31 - Forks: 2

RaptorMai/MLLM-CompBench
[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes
Language: Jupyter Notebook - Size: 10.7 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 38 - Forks: 2

mbzuai-oryx/ALM-Bench
[CVPR 2025 🔥] ALM-Bench is a multilingual multi-modal diverse cultural benchmark for 100 languages across 19 categories. It assesses the next generation of LMMs on cultural inclusitivity.
Language: Python - Size: 26.6 MB - Last synced at: 23 days ago - Pushed at: 24 days ago - Stars: 36 - Forks: 2

khoi03/Multimodal-ChatBot
A chatbot can process and analyze various forms of media including text, images, videos, and other data types.
Language: Python - Size: 2.94 MB - Last synced at: 23 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

joanrod/star-vector
StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.
Language: Python - Size: 6.3 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 3,570 - Forks: 186

Hoar012/TDC-Video
Size: 3.05 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0

PanguIR/MRAGSurvey
A Survey of Multimodal Retrieval-Augmented Generation
Size: 4.92 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 11 - Forks: 1

leoli51/youtube-conspiracy-detection
Code for the paper "Evaluating AI capabilities in detecting conspiracy theories on YouTube".
Language: Jupyter Notebook - Size: 12.9 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0

Hyeongkeun/LAVCap
Official Pytorch Implementation of 'LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport' (ICASSP2025)
Language: Python - Size: 3.58 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 3 - Forks: 0

Victorwz/MLM_Filter
Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".
Language: Python - Size: 30.7 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 51 - Forks: 1

gaotiexinqu/V2P-Bench
V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
Size: 16.9 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 5 - Forks: 0

OpenGVLab/MM-NIAH
[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.
Language: Python - Size: 2.83 MB - Last synced at: 23 days ago - Pushed at: 6 months ago - Stars: 115 - Forks: 6

NotYuSheng/Multimodal-Large-Language-Model
Localized Multimodal Large Language Model (MLLM) integrated with Streamlit and Ollama for text and image processing tasks.
Language: Python - Size: 7.37 MB - Last synced at: 20 days ago - Pushed at: 28 days ago - Stars: 4 - Forks: 2

baaivision/EVE
EVE Series: Encoder-Free Vision-Language Models from BAAI
Language: Python - Size: 6.95 MB - Last synced at: 28 days ago - Pushed at: 2 months ago - Stars: 320 - Forks: 8

zjysteven/lmms-finetune
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
Language: Python - Size: 13 MB - Last synced at: 29 days ago - Pushed at: 2 months ago - Stars: 284 - Forks: 29

genieincodebottle/genaicodelab
Comprehensive resources on Generative AI, including a detailed Codebase and tutorials
Language: Jupyter Notebook - Size: 47.6 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 41 - Forks: 4

X-PLUG/mPLUG-DocOwl
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Language: Python - Size: 105 MB - Last synced at: 29 days ago - Pushed at: 5 months ago - Stars: 2,151 - Forks: 128

PE51K/spbu-diploma
MLLM application to Chinese speech practice as my SPBU diploma project
Language: Jupyter Notebook - Size: 66.4 KB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

weihaox/UMBRAE
[ECCV 2024] UMBRAE: Unified Multimodal Brain Decoding | Unveiling the 'Dark Side' of Brain Modality
Language: Jupyter Notebook - Size: 34.6 MB - Last synced at: 15 days ago - Pushed at: 8 months ago - Stars: 46 - Forks: 3

cambrian-mllm/cambrian
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
Language: Python - Size: 1.99 MB - Last synced at: 28 days ago - Pushed at: 6 months ago - Stars: 1,885 - Forks: 129

Haochen-Wang409/ross
[ICLR'25] Reconstructive Visual Instruction Tuning
Language: Python - Size: 12.8 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 75 - Forks: 3

ictnlp/LLaVA-Mini
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
Language: Python - Size: 54.6 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 441 - Forks: 19

FoundationVision/Liquid
Liquid: Language Models are Scalable and Unified Multi-modal Generators
Language: Python - Size: 31.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 353 - Forks: 24

OmniMMI/OmniMMI
[CVPR 2025] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
Language: Python - Size: 25.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 8 - Forks: 0

xmed-lab/MedRegA
[ICLR 2025] MedRegA: Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks
Language: Python - Size: 5.61 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 25 - Forks: 1
