GitHub topics: large-multimodal-models

Repositories

hiyamdebary/EarthDial

[CVPR 2025 🔥] EarthDial: Turning Multi-Sensory Earth Observations to Interactive Dialogues.

Language: Python - Size: 8.46 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 42 - Forks: 3

Stream-Omni enables seamless interactions across text, vision, and speech using a large language model. This repository includes the model, datasets, and tools for developers to explore multimodal capabilities. 🌟🌐

Language: Python - Size: 10.6 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

mbzuai-oryx/GeoPixel

GeoPixel: A Pixel Grounding Large Multimodal Model for Remote Sensing is specifically developed for high-resolution remote sensing image analysis, offering advanced multi-target pixel grounding capabilities.

Language: Python - Size: 29.8 MB - Last synced at: 3 days ago - Pushed at: 24 days ago - Stars: 96 - Forks: 9

visresearch/LLaVA-STF

The official implementation of "Learning Compact Vision Tokens for Efficient Large Multimodal Models"

Language: Python - Size: 2.62 MB - Last synced at: 2 days ago - Pushed at: 11 days ago - Stars: 27 - Forks: 2

ictnlp/Stream-Omni

Stream-Omni is an end-to-end language-vision-speech chatbot that simultaneously supports interaction across various modality combinations.

Language: Python - Size: 10.6 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

friedrichor/Awesome-Multimodal-Papers

A curated list of awesome Multimodal studies.

Size: 63.4 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 207 - Forks: 18

NastyMarcus/PhyX

PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

Language: Python - Size: 45.8 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 39 - Forks: 1

Wang-ML-Lab/interpretable-foundation-models

[ICML 2024] Probabilistic Conceptual Explainers (PACE): Trustworthy Conceptual Explanations for Vision Foundation Models

Language: Python - Size: 52.7 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 15 - Forks: 3

ritzz-ai/GUI-R1

Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Language: Python - Size: 974 KB - Last synced at: 8 days ago - Pushed at: about 2 months ago - Stars: 112 - Forks: 11

D2I-Group/awesome-vision-time-series

This is an official repository for "Harnessing Vision Models for Time Series Analysis: A Survey".

Language: Python - Size: 3.71 MB - Last synced at: 9 days ago - Pushed at: 10 days ago - Stars: 26 - Forks: 1

TinyLLaVA/TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models

Language: Python - Size: 6.01 MB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 835 - Forks: 86

lingyzhu0101/Awesome_VCM

[Paper List‘25] Paper List of Visual Data Coding for Machines, including Image/Video Coding for Machines, Feature Compression, Point Cloud Compression for Machines and Image/Video Coding for Machines with Large Multimodal Models

Size: 1.16 MB - Last synced at: 7 days ago - Pushed at: 2 months ago - Stars: 22 - Forks: 0

NVlabs/describe-anything

Implementation for Describe Anything: Detailed Localized Image and Video Captioning

Language: Python - Size: 66 MB - Last synced at: 10 days ago - Pushed at: about 2 months ago - Stars: 1,155 - Forks: 64

showlab/WorldGUI

Enable AI to control your PC. This repo includes the WorldGUI Benchmark and GUI-Thinker Agent Framework.

Language: Python - Size: 74.6 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 71 - Forks: 6

OBI-Future/OBI-Bench

[ICLR'25] The first benchmark aiming to evaluate whether LMMs can assist oracle bone inscription processing tasks

Language: Python - Size: 27.6 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 3 - Forks: 0

thaoshibe/awesome-personalized-lmms

A curated list of Awesome Personalized Large Multimodal Models resources

Size: 1.61 MB - Last synced at: 12 days ago - Pushed at: 25 days ago - Stars: 25 - Forks: 0

Md-Emon-Hasan/Gen-AI-on-going

Generative AI (Gen AI) is a branch of artificial intelligence that creates new content such as text, images, audio, or code using models like GPT or Gemini. It powers applications like AI chatbots, image generation tools, and creative assistants across various industries.

Language: Jupyter Notebook - Size: 4.27 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 1 - Forks: 0

richard-peng-xia/awesome-multimodal-in-medical-imaging

A collection of resources on applications of multi-modal learning in medical imaging.

Size: 309 KB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 753 - Forks: 68

JinjieNi/MixEval

The official evaluation suite and dynamic data release for MixEval.

Language: Python - Size: 9.37 MB - Last synced at: 9 days ago - Pushed at: 7 months ago - Stars: 242 - Forks: 41

pichayakorn/carla-temporal-collage-prompting

[InCIT 2024] Temporal Collage Prompting: A Cost-Effective Simulator-Based Driving Accident Video Recognition With GPT-4o

Language: Python - Size: 108 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 0 - Forks: 0

tsunghan-wu/reverse_vlm

🔥 Official implementation of "Generate, but Verify: Reducing Visual Hallucination in Vision-Language Models with Retrospective Resampling"

Language: Python - Size: 417 KB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 28 - Forks: 3

RainBowLuoCS/OpenOmni

OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

Language: Python - Size: 8.45 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 51 - Forks: 5

HarryYancy/SolidGeo

SolidGeo: Measuring Multimodal Spatial Math Reasoning in Solid Geometry

Language: Python - Size: 57.1 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 2 - Forks: 0

CuriousDima/mirk

Mirk seamlessly integrates classical CV models with large visual models, enabling richly contextualized and detailed video analysis and understanding.

Language: Python - Size: 6.45 MB - Last synced at: 10 days ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

OpenAdaptAI/OpenAdapt

Open Source Generative Process Automation (i.e. Generative RPA). AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models

Language: Python - Size: 28.9 MB - Last synced at: 27 days ago - Pushed at: 3 months ago - Stars: 1,277 - Forks: 181

Baizhige/EEGUnity

An open source tool for large-scale EEG datasets processing

Language: Python - Size: 4.98 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 41 - Forks: 3

ictnlp/LLaVA-Mini

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

Language: Python - Size: 54.6 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 479 - Forks: 22

showlab/LOVA3

(NeurIPS 2024) Official PyTorch implementation of LOVA3

Language: Python - Size: 6.01 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 85 - Forks: 2

MMMU-Benchmark/MMMU

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"

Language: Python - Size: 186 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 431 - Forks: 35

yu-rp/apiprompting

[ECCV 2024] API: Attention Prompting on Image for Large Vision-Language Models

Language: Python - Size: 8.63 MB - Last synced at: 16 days ago - Pushed at: 9 months ago - Stars: 88 - Forks: 5

taco-group/GenAI4AD

a comprehensive and critical synthesis of the emerging role of GenAI across the autonomous driving stack

Size: 4.65 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 13 - Forks: 1

zjysteven/lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.

Language: Python - Size: 13 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 296 - Forks: 33

pritamqu/VCRBench

VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models

Language: Python - Size: 1.14 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

ShareGPT4Omni/ShareGPT4Video

[NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Language: Python - Size: 7.73 MB - Last synced at: 30 days ago - Pushed at: 9 months ago - Stars: 1,057 - Forks: 41

pritamqu/RRPO

Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization

Language: Python - Size: 4.35 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

WisconsinAIVision/YoChameleon

🦎 Yo'Chameleon: Your Personalized Chameleon (CVPR 2025)

Language: Python - Size: 10.6 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 128 - Forks: 1

LLaVA-VL/LLaVA-Plus-Codebase

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

Language: Python - Size: 19 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 740 - Forks: 58

ShareGPT4Omni/ShareGPT4V

[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions

Language: Python - Size: 644 KB - Last synced at: about 2 months ago - Pushed at: 12 months ago - Stars: 217 - Forks: 6

shikiw/OPERA

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Language: Python - Size: 15.7 MB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 332 - Forks: 28

mbzuai-oryx/Camel-Bench

[NAACL 2025 🔥] CAMEL-Bench is an Arabic benchmark for evaluating multimodal models across eight domains with 29,000 questions.

Language: Python - Size: 14 MB - Last synced at: about 2 months ago - Pushed at: 2 months ago - Stars: 31 - Forks: 1

360CVGroup/Inner-Adaptor-Architecture

LMM solved catastrophic forgetting, AAAI2025

Language: Python - Size: 4.45 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 40 - Forks: 4

sshh12/multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large language models.

Language: Python - Size: 1.22 MB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 184 - Forks: 15

shikiw/Modality-Integration-Rate

The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".

Language: Python - Size: 17.7 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 98 - Forks: 2

Haochen-Wang409/ross

[ICLR'25] Reconstructive Visual Instruction Tuning

Language: Python - Size: 12.8 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 75 - Forks: 3

SliMM-X/CoMP-MM

Official repository of "CoMP: Continual Multimodal Pre-training for Vision Foundation Models"

Language: Python - Size: 2.12 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 19 - Forks: 1

eric-ai-lab/ProbMed

"Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA"

Language: Python - Size: 263 KB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 16 - Forks: 1

VITA-MLLM/VITA

✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Language: Python - Size: 15.7 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 2,184 - Forks: 164

rohit901/VANE-Bench

[NAACL'25] Contains code and documentation for our VANE-Bench paper.

Language: Python - Size: 38.3 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 11 - Forks: 1

Ruiyang-061X/Uncertainty-o

✨ Official code for our paper: "Uncertainty-o: One Model-agnostic Framework for Unveiling Epistemic Uncertainty in Large Multimodal Models".

Language: Python - Size: 5.96 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 8 - Forks: 1

xk0720/FoundationForge

Exploratory journey of working with large foundation models

Language: Python - Size: 1000 Bytes - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

ParadoxZW/LLaVA-UHD-Better

A bug-free and improved implementation of LLaVA-UHD, based on the code from the official repo

Language: Python - Size: 102 KB - Last synced at: 3 months ago - Pushed at: 10 months ago - Stars: 33 - Forks: 3

xyz9911/FLAME

[AAAI-25 Oral] Official Implementation of "FLAME: Learning to Navigate with Multimodal LLM in Urban Environments"

Language: Python - Size: 8.57 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 30 - Forks: 3

inst-it/inst-it

Official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning"

Language: Python - Size: 2.66 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 27 - Forks: 0

bowen-upenn/Agent_Rationality

[NAACL 2025] Towards Rationality in Language and Multimodal Agents: A Survey

Size: 7.97 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 25 - Forks: 0

JinjieNi/MixEval-X

The official github repo for MixEval-X, the first any-to-any, real-world benchmark.

Language: Python - Size: 1.24 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 12 - Forks: 0

visual-haystacks/mirage

🔥 [ICLR 2025] Official PyTorch Model "Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark"

Language: Python - Size: 11.2 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 8 - Forks: 0

visual-haystacks/vhs_benchmark

🔥 [ICLR 2025] Official Benchmark Toolkits for "Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark"

Language: Python - Size: 5.22 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 24 - Forks: 1

MileBench/MileBench

This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"

Language: Python - Size: 3.52 MB - Last synced at: 4 months ago - Pushed at: 12 months ago - Stars: 29 - Forks: 1

YanqiDai/MMRole

(ICLR'25) A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents

Language: Python - Size: 47.2 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 41 - Forks: 2

astra-vision/LatteCLIP

[WACV 2025] LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

Language: Jupyter Notebook - Size: 1.86 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 6 - Forks: 0

2toinf/IVM

[NeurIPS-2024] The offical Implementation of "Instruction-Guided Visual Masking"

Language: Jupyter Notebook - Size: 70 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 29 - Forks: 2

shijian2001/VQAPromptBench

A Benchmark for VQA prompt sensitivity

Language: Python - Size: 267 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 7 - Forks: 1

ShareGPT4Omni/ShareGPT4Omni

ShareGPT4Omni: Towards Building Omni Large Multi-modal Models with Comprehensive Multi-modal Annotations

Size: 0 Bytes - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 8 - Forks: 0

AIFEG/BenchLMM

[ECCV 2024] BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

Language: Python - Size: 103 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 82 - Forks: 6

h4nwei/2AFC-LMMs

Offical Implementation of 2AFC-LMMs

Language: Python - Size: 321 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 8 - Forks: 0

zchoi/Multi-Modal-Large-Language-Learning

Awesome multi-modal large language paper/project, collections of popular training strategies, e.g., PEFT, LoRA.

Size: 55.7 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 16 - Forks: 0

VisualWebBench/VisualWebBench

Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"

Language: Python - Size: 3.17 MB - Last synced at: 12 months ago - Pushed at: about 1 year ago - Stars: 36 - Forks: 1

thunlp/LEGENT

Open Platform for Embodied Agents

Language: Python - Size: 1.72 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 128 - Forks: 6

jameszhou-gl/icl-distribution-shift

Code for "Adapting Large Multimodal Models to Distribution Shifts: The Role of In-Context Learning"

Size: 1.95 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

MMStar-Benchmark/MMStar

This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"

Language: Python - Size: 3.41 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 84 - Forks: 1

Related Keywords

large-multimodal-models 70 large-language-models 31 multimodal 19 multimodal-large-language-models 18 vision-language-model 15 visual-question-answering 12 foundation-models 11 benchmark 11 evaluation 9 llm 8 large-vision-language-models 8 llms 7 llava 7 machine-learning 6 computer-vision 6 multimodal-learning 6 deep-learning 6 llama 6 large-language-model 5 multimodal-deep-learning 5 natural-language-processing 4 instruction-tuning 4 chatgpt 4 robotics 3 speech 3 long-context-modeling 3 agents 3 multimodality 3 gpt-4o 3 chatbot 3 deep-neural-networks 3 gpt-4v 3 gpt 3 vision-language-models 3 o1 2 lmms 2 nlp 2 gui-agent 2 transformers 2 reasoning 2 large-video-language-models 2 vision-language 2 awesome-list 2 agent 2 dataset 2 awesome 2 personalization 2 personalized 2 personalized-generation 2 agentic-ai 2 vision-language-learning 2 generative-ai 2 huggingface 2 remote-sensing 2 speech-recognition 2 embodied-ai 2 large-vision-language-model 2 diffusion-models 2 question-answering 2 visual-instruction-tuning 2 video 2 gpt4o 2 science 2 gpt4v 2 vision 2 video-understanding 2 language-model 1 eccv2024 1 prompting 1 gpt-4 1 nvidia 1 sim2real 1 efficient 1 autonomous-vehicles 1 visual-prompting 1 arabic 1 mbzuai 1 autonomous-driving 1 vqa 1 lmm 1 3dvision 1 image-generation 1 llava-next 1 qwen-vl 1 causal-reasoning 1 multi-step-reasoning 1 finetuning 1 python-library 1 visual-question-generation 1 sora 1 stem 1 text-to-video 1 transportation 1 alignment 1 self-alignment 1 chameleon 1 cvpr 1 graphical-models 1 cvpr2025 1 tool-use 1