GitHub topics: large-multimodal-models
hiyamdebary/EarthDial
[CVPR 2025 π₯] EarthDial: Turning Multi-Sensory Earth Observations to Interactive Dialogues.
Language: Python - Size: 8.46 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 42 - Forks: 3

luisKING2008/Stream-Omni
Stream-Omni enables seamless interactions across text, vision, and speech using a large language model. This repository includes the model, datasets, and tools for developers to explore multimodal capabilities. ππ
Language: Python - Size: 10.6 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

mbzuai-oryx/GeoPixel
GeoPixel: A Pixel Grounding Large Multimodal Model for Remote Sensing is specifically developed for high-resolution remote sensing image analysis, offering advanced multi-target pixel grounding capabilities.
Language: Python - Size: 29.8 MB - Last synced at: 3 days ago - Pushed at: 24 days ago - Stars: 96 - Forks: 9

visresearch/LLaVA-STF
The official implementation of "Learning Compact Vision Tokens for Efficient Large Multimodal Models"
Language: Python - Size: 2.62 MB - Last synced at: 2 days ago - Pushed at: 11 days ago - Stars: 27 - Forks: 2

ictnlp/Stream-Omni
Stream-Omni is an end-to-end language-vision-speech chatbot that simultaneously supports interaction across various modality combinations.
Language: Python - Size: 10.6 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

friedrichor/Awesome-Multimodal-Papers
A curated list of awesome Multimodal studies.
Size: 63.4 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 207 - Forks: 18

NastyMarcus/PhyX
PhyX: Does Your Model Have the "Wits" for Physical Reasoning?
Language: Python - Size: 45.8 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 39 - Forks: 1

Wang-ML-Lab/interpretable-foundation-models
[ICML 2024] Probabilistic Conceptual Explainers (PACE): Trustworthy Conceptual Explanations for Vision Foundation Models
Language: Python - Size: 52.7 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 15 - Forks: 3

ritzz-ai/GUI-R1
Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Language: Python - Size: 974 KB - Last synced at: 8 days ago - Pushed at: about 2 months ago - Stars: 112 - Forks: 11

D2I-Group/awesome-vision-time-series
This is an official repository for "Harnessing Vision Models for Time Series Analysis: A Survey".
Language: Python - Size: 3.71 MB - Last synced at: 9 days ago - Pushed at: 10 days ago - Stars: 26 - Forks: 1

TinyLLaVA/TinyLLaVA_Factory
A Framework of Small-scale Large Multimodal Models
Language: Python - Size: 6.01 MB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 835 - Forks: 86

lingyzhu0101/Awesome_VCM
[Paper Listβ25] Paper List of Visual Data Coding for Machines, including Image/Video Coding for Machines, Feature Compression, Point Cloud Compression for Machines and Image/Video Coding for Machines with Large Multimodal Models
Size: 1.16 MB - Last synced at: 7 days ago - Pushed at: 2 months ago - Stars: 22 - Forks: 0

NVlabs/describe-anything
Implementation for Describe Anything: Detailed Localized Image and Video Captioning
Language: Python - Size: 66 MB - Last synced at: 10 days ago - Pushed at: about 2 months ago - Stars: 1,155 - Forks: 64

showlab/WorldGUI
Enable AI to control your PC. This repo includes the WorldGUI Benchmark and GUI-Thinker Agent Framework.
Language: Python - Size: 74.6 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 71 - Forks: 6

OBI-Future/OBI-Bench
[ICLR'25] The first benchmark aiming to evaluate whether LMMs can assist oracle bone inscription processing tasks
Language: Python - Size: 27.6 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 3 - Forks: 0

thaoshibe/awesome-personalized-lmms
A curated list of Awesome Personalized Large Multimodal Models resources
Size: 1.61 MB - Last synced at: 12 days ago - Pushed at: 25 days ago - Stars: 25 - Forks: 0

Md-Emon-Hasan/Gen-AI-on-going
Generative AI (Gen AI) is a branch of artificial intelligence that creates new content such as text, images, audio, or code using models like GPT or Gemini. It powers applications like AI chatbots, image generation tools, and creative assistants across various industries.
Language: Jupyter Notebook - Size: 4.27 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 1 - Forks: 0

richard-peng-xia/awesome-multimodal-in-medical-imaging
A collection of resources on applications of multi-modal learning in medical imaging.
Size: 309 KB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 753 - Forks: 68

JinjieNi/MixEval
The official evaluation suite and dynamic data release for MixEval.
Language: Python - Size: 9.37 MB - Last synced at: 9 days ago - Pushed at: 7 months ago - Stars: 242 - Forks: 41

pichayakorn/carla-temporal-collage-prompting
[InCIT 2024] Temporal Collage Prompting: A Cost-Effective Simulator-Based Driving Accident Video Recognition With GPT-4o
Language: Python - Size: 108 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 0 - Forks: 0

tsunghan-wu/reverse_vlm
π₯ Official implementation of "Generate, but Verify: Reducing Visual Hallucination in Vision-Language Models with Retrospective Resampling"
Language: Python - Size: 417 KB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 28 - Forks: 3

RainBowLuoCS/OpenOmni
OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
Language: Python - Size: 8.45 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 51 - Forks: 5

HarryYancy/SolidGeo
SolidGeo: Measuring Multimodal Spatial Math Reasoning in Solid Geometry
Language: Python - Size: 57.1 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 2 - Forks: 0

CuriousDima/mirk
Mirk seamlessly integrates classical CV models with large visual models, enabling richly contextualized and detailed video analysis and understanding.
Language: Python - Size: 6.45 MB - Last synced at: 10 days ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

OpenAdaptAI/OpenAdapt
Open Source Generative Process Automation (i.e. Generative RPA). AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models
Language: Python - Size: 28.9 MB - Last synced at: 27 days ago - Pushed at: 3 months ago - Stars: 1,277 - Forks: 181

Baizhige/EEGUnity
An open source tool for large-scale EEG datasets processing
Language: Python - Size: 4.98 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 41 - Forks: 3

ictnlp/LLaVA-Mini
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
Language: Python - Size: 54.6 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 479 - Forks: 22

showlab/LOVA3
(NeurIPS 2024) Official PyTorch implementation of LOVA3
Language: Python - Size: 6.01 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 85 - Forks: 2

MMMU-Benchmark/MMMU
This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"
Language: Python - Size: 186 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 431 - Forks: 35

yu-rp/apiprompting
[ECCV 2024] API: Attention Prompting on Image for Large Vision-Language Models
Language: Python - Size: 8.63 MB - Last synced at: 16 days ago - Pushed at: 9 months ago - Stars: 88 - Forks: 5

taco-group/GenAI4AD
a comprehensive and critical synthesis of the emerging role of GenAI across the autonomous driving stack
Size: 4.65 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 13 - Forks: 1

zjysteven/lmms-finetune
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
Language: Python - Size: 13 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 296 - Forks: 33

pritamqu/VCRBench
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models
Language: Python - Size: 1.14 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

ShareGPT4Omni/ShareGPT4Video
[NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Language: Python - Size: 7.73 MB - Last synced at: 30 days ago - Pushed at: 9 months ago - Stars: 1,057 - Forks: 41

pritamqu/RRPO
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Language: Python - Size: 4.35 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

WisconsinAIVision/YoChameleon
π¦ Yo'Chameleon: Your Personalized Chameleon (CVPR 2025)
Language: Python - Size: 10.6 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 128 - Forks: 1

LLaVA-VL/LLaVA-Plus-Codebase
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
Language: Python - Size: 19 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 740 - Forks: 58

ShareGPT4Omni/ShareGPT4V
[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions
Language: Python - Size: 644 KB - Last synced at: about 2 months ago - Pushed at: 12 months ago - Stars: 217 - Forks: 6

shikiw/OPERA
[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Language: Python - Size: 15.7 MB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 332 - Forks: 28

mbzuai-oryx/Camel-Bench
[NAACL 2025 π₯] CAMEL-Bench is an Arabic benchmark for evaluating multimodal models across eight domains with 29,000 questions.
Language: Python - Size: 14 MB - Last synced at: about 2 months ago - Pushed at: 2 months ago - Stars: 31 - Forks: 1

360CVGroup/Inner-Adaptor-Architecture
LMM solved catastrophic forgetting, AAAI2025
Language: Python - Size: 4.45 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 40 - Forks: 4

sshh12/multi_token
Embed arbitrary modalities (images, audio, documents, etc) into large language models.
Language: Python - Size: 1.22 MB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 184 - Forks: 15

shikiw/Modality-Integration-Rate
The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".
Language: Python - Size: 17.7 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 98 - Forks: 2

Haochen-Wang409/ross
[ICLR'25] Reconstructive Visual Instruction Tuning
Language: Python - Size: 12.8 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 75 - Forks: 3

SliMM-X/CoMP-MM
Official repository of "CoMP: Continual Multimodal Pre-training for Vision Foundation Models"
Language: Python - Size: 2.12 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 19 - Forks: 1

eric-ai-lab/ProbMed
"Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA"
Language: Python - Size: 263 KB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 16 - Forks: 1

VITA-MLLM/VITA
β¨β¨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Language: Python - Size: 15.7 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 2,184 - Forks: 164

rohit901/VANE-Bench
[NAACL'25] Contains code and documentation for our VANE-Bench paper.
Language: Python - Size: 38.3 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 11 - Forks: 1

Ruiyang-061X/Uncertainty-o
β¨ Official code for our paper: "Uncertainty-o: One Model-agnostic Framework for Unveiling Epistemic Uncertainty in Large Multimodal Models".
Language: Python - Size: 5.96 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 8 - Forks: 1

xk0720/FoundationForge
Exploratory journey of working with large foundation models
Language: Python - Size: 1000 Bytes - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

ParadoxZW/LLaVA-UHD-Better
A bug-free and improved implementation of LLaVA-UHD, based on the code from the official repo
Language: Python - Size: 102 KB - Last synced at: 3 months ago - Pushed at: 10 months ago - Stars: 33 - Forks: 3

xyz9911/FLAME
[AAAI-25 Oral] Official Implementation of "FLAME: Learning to Navigate with Multimodal LLM in Urban Environments"
Language: Python - Size: 8.57 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 30 - Forks: 3

inst-it/inst-it
Official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning"
Language: Python - Size: 2.66 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 27 - Forks: 0

bowen-upenn/Agent_Rationality
[NAACL 2025] Towards Rationality in Language and Multimodal Agents: A Survey
Size: 7.97 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 25 - Forks: 0

JinjieNi/MixEval-X
The official github repo for MixEval-X, the first any-to-any, real-world benchmark.
Language: Python - Size: 1.24 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 12 - Forks: 0

visual-haystacks/mirage
π₯ [ICLR 2025] Official PyTorch Model "Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark"
Language: Python - Size: 11.2 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 8 - Forks: 0

visual-haystacks/vhs_benchmark
π₯ [ICLR 2025] Official Benchmark Toolkits for "Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark"
Language: Python - Size: 5.22 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 24 - Forks: 1

MileBench/MileBench
This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"
Language: Python - Size: 3.52 MB - Last synced at: 4 months ago - Pushed at: 12 months ago - Stars: 29 - Forks: 1

YanqiDai/MMRole
(ICLR'25) A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents
Language: Python - Size: 47.2 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 41 - Forks: 2

astra-vision/LatteCLIP
[WACV 2025] LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts
Language: Jupyter Notebook - Size: 1.86 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 6 - Forks: 0

2toinf/IVM
[NeurIPS-2024] The offical Implementation of "Instruction-Guided Visual Masking"
Language: Jupyter Notebook - Size: 70 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 29 - Forks: 2

shijian2001/VQAPromptBench
A Benchmark for VQA prompt sensitivity
Language: Python - Size: 267 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 7 - Forks: 1

ShareGPT4Omni/ShareGPT4Omni
ShareGPT4Omni: Towards Building Omni Large Multi-modal Models with Comprehensive Multi-modal Annotations
Size: 0 Bytes - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 8 - Forks: 0

AIFEG/BenchLMM
[ECCV 2024] BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Language: Python - Size: 103 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 82 - Forks: 6

h4nwei/2AFC-LMMs
Offical Implementation of 2AFC-LMMs
Language: Python - Size: 321 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 8 - Forks: 0

zchoi/Multi-Modal-Large-Language-Learning
Awesome multi-modal large language paper/project, collections of popular training strategies, e.g., PEFT, LoRA.
Size: 55.7 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 16 - Forks: 0

VisualWebBench/VisualWebBench
Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"
Language: Python - Size: 3.17 MB - Last synced at: 12 months ago - Pushed at: about 1 year ago - Stars: 36 - Forks: 1

thunlp/LEGENT
Open Platform for Embodied Agents
Language: Python - Size: 1.72 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 128 - Forks: 6

jameszhou-gl/icl-distribution-shift
Code for "Adapting Large Multimodal Models to Distribution Shifts: The Role of In-Context Learning"
Size: 1.95 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

MMStar-Benchmark/MMStar
This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"
Language: Python - Size: 3.41 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 84 - Forks: 1
