An open API service providing repository metadata for many open source software ecosystems.

Topic: "multimodal-large-language-models"

BradyFU/Awesome-Multimodal-Large-Language-Models

:sparkles::sparkles:Latest Advances on Multimodal Large Language Models

Size: 82.9 MB - Last synced at: 4 days ago - Pushed at: 10 days ago - Stars: 15,444 - Forks: 1,002

X-PLUG/MobileAgent

Mobile-Agent: The Powerful Mobile Device Operation Assistant Family

Language: Python - Size: 383 MB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 4,281 - Forks: 430

joanrod/star-vector

StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.

Language: Python - Size: 6.3 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 3,570 - Forks: 186

modelscope/modelscope-agent

ModelScope-Agent: An agent framework connecting models in ModelScope with the world

Language: Python - Size: 68.1 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 3,157 - Forks: 358

ictnlp/LLaMA-Omni

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

Language: Python - Size: 3.28 MB - Last synced at: 18 days ago - Pushed at: 21 days ago - Stars: 2,923 - Forks: 197

VITA-MLLM/VITA

✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Language: Python - Size: 15.7 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2,184 - Forks: 164

X-PLUG/mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Language: Python - Size: 105 MB - Last synced at: 9 days ago - Pushed at: 10 days ago - Stars: 2,177 - Forks: 126

cambrian-mllm/cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

Language: Python - Size: 1.99 MB - Last synced at: 11 days ago - Pushed at: 7 months ago - Stars: 1,905 - Forks: 132

YangLing0818/RPG-DiffusionMaster

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)

Language: Jupyter Notebook - Size: 64.2 MB - Last synced at: 18 days ago - Pushed at: 4 months ago - Stars: 1,802 - Forks: 101

BAAI-DCAI/Bunny

A family of lightweight multimodal models.

Language: Python - Size: 28.5 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 1,015 - Forks: 74

ByteDance-Seed/Seed1.5-VL

Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.

Language: Jupyter Notebook - Size: 140 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 949 - Forks: 26

AIDC-AI/Ovis

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

Language: Python - Size: 5.56 MB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 925 - Forks: 57

X-LANCE/SLAM-LLM

Speech, Language, Audio, Music Processing with Large Language Model

Language: Python - Size: 169 MB - Last synced at: 11 days ago - Pushed at: about 2 months ago - Stars: 811 - Forks: 79

richard-peng-xia/awesome-multimodal-in-medical-imaging

A collection of resources on applications of multi-modal learning in medical imaging.

Size: 309 KB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 753 - Forks: 68

LLaVA-VL/LLaVA-Plus-Codebase

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

Language: Python - Size: 19 MB - Last synced at: 30 days ago - Pushed at: over 1 year ago - Stars: 740 - Forks: 58

deepglint/unicom

Large-Scale Visual Representation Model

Language: Python - Size: 22.9 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 668 - Forks: 31

VITA-MLLM/Woodpecker

✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models

Language: Python - Size: 21.2 MB - Last synced at: 17 days ago - Pushed at: 6 months ago - Stars: 634 - Forks: 30

rese1f/MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Language: Python - Size: 78.7 MB - Last synced at: 6 days ago - Pushed at: 4 months ago - Stars: 621 - Forks: 41

yaotingwangofficial/Awesome-MCoT

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Size: 4.63 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 576 - Forks: 15

MME-Benchmarks/Video-MME

✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Size: 16.7 MB - Last synced at: 23 days ago - Pushed at: about 1 month ago - Stars: 551 - Forks: 20

SkyworkAI/Vitron

NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

Language: Python - Size: 667 MB - Last synced at: 8 days ago - Pushed at: 8 months ago - Stars: 541 - Forks: 34

NVIDIA/audio-flamingo

PyTorch implementation of Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities.

Language: Python - Size: 4.98 MB - Last synced at: 17 days ago - Pushed at: about 1 month ago - Stars: 484 - Forks: 27

ictnlp/LLaVA-Mini

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

Language: Python - Size: 54.6 MB - Last synced at: 17 days ago - Pushed at: 5 months ago - Stars: 479 - Forks: 22

YingqingHe/Awesome-LLMs-meet-Multimodal-Generation

🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).

Language: HTML - Size: 12.7 MB - Last synced at: 29 days ago - Pushed at: 2 months ago - Stars: 472 - Forks: 26

Paranioar/Awesome_Matching_Pretraining_Transfering

The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.

Size: 369 KB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 423 - Forks: 48

Coobiw/MPP-LLaVA

Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.

Language: Jupyter Notebook - Size: 73.1 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 420 - Forks: 23

hustvl/EVF-SAM

Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"

Language: Python - Size: 5.94 MB - Last synced at: 17 days ago - Pushed at: 3 months ago - Stars: 406 - Forks: 19

jingyi0000/R1-VL

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Language: Python - Size: 2.36 MB - Last synced at: 7 days ago - Pushed at: about 1 month ago - Stars: 392 - Forks: 0

HenryHZY/Awesome-Multimodal-LLM

Research Trends in LLM-guided Multimodal Learning.

Size: 17.6 KB - Last synced at: 24 days ago - Pushed at: over 1 year ago - Stars: 358 - Forks: 16

FoundationVision/Liquid

Liquid: Language Models are Scalable and Unified Multi-modal Generators

Language: Python - Size: 31.6 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 353 - Forks: 24

baaivision/EVE

EVE Series: Encoder-Free Vision-Language Models from BAAI

Language: Python - Size: 6.95 MB - Last synced at: 16 days ago - Pushed at: 3 months ago - Stars: 326 - Forks: 8

burglarhobbit/Awesome-Medical-Large-Language-Models

Curated papers on Large Language Models in Healthcare and Medical domain

Size: 53.7 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 316 - Forks: 37

tsujuifu/pytorch_mgie

A Gradio demo of MGIE

Language: Python - Size: 32.5 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 307 - Forks: 24

X-PLUG/Youku-mPLUG

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks

Language: Python - Size: 15.1 MB - Last synced at: 9 days ago - Pushed at: over 1 year ago - Stars: 297 - Forks: 11

zjysteven/lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.

Language: Python - Size: 13 MB - Last synced at: 22 days ago - Pushed at: 3 months ago - Stars: 296 - Forks: 33

IrohXu/Awesome-Multimodal-LLM-Autonomous-Driving

[WACV 2024 Survey Paper] Multimodal Large Language Models for Autonomous Driving

Size: 15 MB - Last synced at: about 20 hours ago - Pushed at: about 1 year ago - Stars: 289 - Forks: 11

AIDC-AI/Awesome-Unified-Multimodal-Models

Awesome Unified Multimodal Models

Size: 6.97 MB - Last synced at: 4 days ago - Pushed at: 17 days ago - Stars: 279 - Forks: 6

VITA-MLLM/Freeze-Omni

✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Language: Python - Size: 10.1 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 257 - Forks: 16

mbzuai-oryx/LLMVoX

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

Language: Python - Size: 132 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 246 - Forks: 27

apple/ml-slowfast-llava

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Language: Python - Size: 375 KB - Last synced at: about 1 month ago - Pushed at: 9 months ago - Stars: 217 - Forks: 13

THUDM/VisualAgentBench

Towards Large Multimodal Models as Visual Foundation Agents

Language: Python - Size: 5.56 MB - Last synced at: 15 days ago - Pushed at: about 2 months ago - Stars: 214 - Forks: 6

JUNJIE99/MLVU

🔥🔥MLVU: Multi-task Long Video Understanding Benchmark

Language: Python - Size: 2.51 MB - Last synced at: 8 days ago - Pushed at: 9 days ago - Stars: 201 - Forks: 1

friedrichor/Awesome-Multimodal-Papers

A curated list of awesome Multimodal studies.

Language: HTML - Size: 63.3 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 192 - Forks: 19

cyw-3d/SAR3D

Official repository for "SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE"

Language: Python - Size: 11.8 MB - Last synced at: about 4 hours ago - Pushed at: about 5 hours ago - Stars: 154 - Forks: 1

shaopengw/Awesome-Music-Generation

Awesome music generation model——MG²

Language: Python - Size: 3.16 MB - Last synced at: 15 days ago - Pushed at: 2 months ago - Stars: 154 - Forks: 10

baaivision/DenseFusion

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Language: Python - Size: 18.1 MB - Last synced at: 14 days ago - Pushed at: 6 months ago - Stars: 145 - Forks: 1

scofield7419/Video-of-Thought

Video Chain of Thought, Codes for ICML 2024 paper: "Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition"

Language: Python - Size: 1.72 MB - Last synced at: 14 days ago - Pushed at: 3 months ago - Stars: 141 - Forks: 7

dvlab-research/VisionReasoner

The official implement of "VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning"

Language: Python - Size: 12.1 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 130 - Forks: 8

pipixin321/HolmesVAD

Official implementation of "Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM"

Language: Python - Size: 19.3 MB - Last synced at: 14 days ago - Pushed at: 3 months ago - Stars: 129 - Forks: 5

NishilBalar/Awesome-LVLM-Hallucination

up-to-date curated list of state-of-the-art Large vision language models hallucinations research work, papers & resources

Size: 189 KB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 125 - Forks: 6

zjukg/KoPA

[Paper][ACM MM 2024] Making Large Language Models Perform Better in Knowledge Graph Completion

Language: Python - Size: 2.85 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 123 - Forks: 8

shufangxun/LLaVA-MoD

[ICLR 2025] LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation

Language: Python - Size: 3.41 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 120 - Forks: 7

danilop/multimodal-chat

A multimodal chat interface with many tools.

Language: Python - Size: 254 KB - Last synced at: 14 days ago - Pushed at: 3 months ago - Stars: 119 - Forks: 18

OpenGVLab/MM-NIAH

[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.

Language: Python - Size: 2.83 MB - Last synced at: 2 days ago - Pushed at: 7 months ago - Stars: 117 - Forks: 6

rese1f/aurora

[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Language: Python - Size: 25.3 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 108 - Forks: 5

lll6gg/UI-R1

Code for "UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning"

Language: Python - Size: 1.04 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 104 - Forks: 6

invictus717/MiCo

Explore the Limits of Omni-modal Pretraining at Scale

Language: Python - Size: 11.6 MB - Last synced at: 2 months ago - Pushed at: 9 months ago - Stars: 97 - Forks: 4

X-PLUG/mPLUG-HalOwl

mPLUG-HalOwl: Multimodal Hallucination Evaluation and Mitigating

Language: Python - Size: 13.9 MB - Last synced at: 9 days ago - Pushed at: over 1 year ago - Stars: 95 - Forks: 2

OpenGVLab/PIIP

[NeurIPS 2024 Spotlight ⭐️] Parameter-Inverted Image Pyramid Networks (PIIP)

Language: Python - Size: 11.7 MB - Last synced at: 2 days ago - Pushed at: 24 days ago - Stars: 91 - Forks: 2

mu-cai/matryoshka-mm

Matryoshka Multimodal Models

Language: Python - Size: 26.1 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 90 - Forks: 5

LINs-lab/DynMoE

[ICLR 2025] Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

Language: Python - Size: 57.3 MB - Last synced at: 28 days ago - Pushed at: 4 months ago - Stars: 89 - Forks: 11

showlab/LOVA3

(NeurIPS 2024) Official PyTorch implementation of LOVA3

Language: Python - Size: 6.01 MB - Last synced at: 19 days ago - Pushed at: 3 months ago - Stars: 85 - Forks: 2

Sreyan88/GAMA

Code for the paper: GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Language: Python - Size: 15.6 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 84 - Forks: 9

zjunlp/Deco

[ICLR 2025] MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

Language: Python - Size: 17.6 MB - Last synced at: 6 days ago - Pushed at: 6 months ago - Stars: 82 - Forks: 7

Haochen-Wang409/ross

[ICLR'25] Reconstructive Visual Instruction Tuning

Language: Python - Size: 12.8 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 75 - Forks: 3

xjywhu/Awesome-Multimodal-LLM-for-Code

Multimodal Large Language Models for Code Generation under Multimodal Scenarios

Size: 234 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 72 - Forks: 2

ritzz-ai/GUI-R1

Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Language: Python - Size: 974 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 70 - Forks: 5

vincentlux/Awesome-Multimodal-LLM

Reading list for Multimodal Large Language Models

Size: 110 KB - Last synced at: 26 days ago - Pushed at: almost 2 years ago - Stars: 68 - Forks: 7

keshik6/HourVideo

[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

Language: Jupyter Notebook - Size: 8.16 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 67 - Forks: 3

AviSoori1x/seemore

From scratch implementation of a vision language model in pure PyTorch

Language: Jupyter Notebook - Size: 20 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 67 - Forks: 4

ai4colonoscopy/IntelliScope

Frontiers in Intelligent Colonoscopy [ColonSurvey | ColonINST | ColonGPT]

Language: Python - Size: 32.3 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 66 - Forks: 4

gyxxyg/TRACE

[ICLR 2025] TRACE: Temporal Grounding Video LLM via Casual Event Modeling

Language: Python - Size: 45.7 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 60 - Forks: 0

ChocoWu/SeTok

Codes for Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM

Language: Python - Size: 2.1 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 54 - Forks: 0

JinXins/Awesome-Token-Merge-for-MLLMs

A paper list about Token Merge, Reduce, Resample, Drop for MLLMs.

Size: 103 KB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 54 - Forks: 0

zjunlp/OceanGPT

[沧渊] [ACL 2024] OceanGPT: A Large Language Model for Ocean Science Tasks

Language: Python - Size: 38.4 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 53 - Forks: 7

Hoar012/RAP-MLLM

[CVPR 2025] RAP: Retrieval-Augmented Personalization

Language: Python - Size: 60.9 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 53 - Forks: 1

RainBowLuoCS/OpenOmni

OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

Language: Python - Size: 8.45 MB - Last synced at: 10 days ago - Pushed at: 11 days ago - Stars: 51 - Forks: 5

Victorwz/MLM_Filter

Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".

Language: Python - Size: 30.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 51 - Forks: 1

MSR3D/MSR3D

[NeurIPS 2024] Official code repository for MSR3D paper

Language: Python - Size: 75.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 50 - Forks: 2

IDEA-FinAI/ChartMoE

[ICLR2025 Oral] ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart Understanding

Language: Jupyter Notebook - Size: 9.76 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 50 - Forks: 1

richard-peng-xia/RULE

[EMNLP'24] RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models

Language: Python - Size: 7.2 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 50 - Forks: 3

Hangover3832/ComfyUI-Hangover-Moondream 📦

Moondream is a lightweight multimodal large language model

Language: Python - Size: 2.14 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 46 - Forks: 7

weihaox/UMBRAE

[ECCV 2024] UMBRAE: Unified Multimodal Brain Decoding | Unveiling the 'Dark Side' of Brain Modality

Language: Jupyter Notebook - Size: 34.6 MB - Last synced at: about 1 month ago - Pushed at: 9 months ago - Stars: 46 - Forks: 3

OpenKG-ORG/EasyDetect

An Easy-to-use Hallucination Detection Framework for LLMs.

Language: Python - Size: 12 MB - Last synced at: 10 months ago - Pushed at: about 1 year ago - Stars: 45 - Forks: 3

Wang-ML-Lab/multimodal-needle-in-a-haystack

[NAACL 2025 Oral] Multimodal Needle in a Haystack (MMNeedle): Benchmarking Long-Context Capability of Multimodal Large Language Models

Language: Python - Size: 16.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 42 - Forks: 3

taco-group/Re-Align

A novel alignment framework that leverages image retrieval to mitigate hallucinations in Vision Language Models.

Language: Python - Size: 18.6 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 40 - Forks: 1

scofield7419/EmpathyEar

Multimodal Empathetic Chatbot

Language: Python - Size: 423 MB - Last synced at: 5 days ago - Pushed at: 11 months ago - Stars: 40 - Forks: 6

pipixin321/HolmesVAU

[CVPR 2025] Official implementation of "Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity"

Language: Python - Size: 60.1 MB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 39 - Forks: 2

mbzuai-oryx/ALM-Bench

[CVPR 2025 🔥] ALM-Bench is a multilingual multi-modal diverse cultural benchmark for 100 languages across 19 categories. It assesses the next generation of LMMs on cultural inclusitivity.

Language: Python - Size: 26.7 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 38 - Forks: 2

UKPLab/5pils

Code associated with the EMNLP 2024 Main paper: "Image, tell me your story!" Predicting the original meta-context of visual misinformation.

Language: Python - Size: 3.38 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 38 - Forks: 4

RaptorMai/MLLM-CompBench

[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes

Language: Jupyter Notebook - Size: 10.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 38 - Forks: 2

cocacola-lab/MineLand

Simulating Large-Scale Multi-Agent Interactions with Limited Multimodal Senses and Physical Needs

Language: Python - Size: 83.1 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 37 - Forks: 4

AIDC-AI/Parrot

🎉 The code repository for "Parrot: Multilingual Visual Instruction Tuning" in PyTorch.

Language: Python - Size: 25.2 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 36 - Forks: 1

VisualWebBench/VisualWebBench

Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"

Language: Python - Size: 3.17 MB - Last synced at: 12 months ago - Pushed at: about 1 year ago - Stars: 36 - Forks: 1

piomin/spring-ai-showcase

Sample Spring AI Application with several use cases

Language: Java - Size: 3.95 MB - Last synced at: 9 days ago - Pushed at: 14 days ago - Stars: 35 - Forks: 18

EternityYW/Gemini-Commonsense-Evaluation

Official implementation of "Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models"

Size: 16.9 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 35 - Forks: 2

whwu95/FreeVA

FreeVA: Offline MLLM as Training-Free Video Assistant

Language: Python - Size: 3.22 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 34 - Forks: 0

Lzcstan/DrugLAMP

A PyTorch-based system for highly accurate drug-target interaction predictions utilizing multi-modal large language models to discern structural affinities in drug-target pairs.

Language: Python - Size: 128 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 32 - Forks: 0

GLUS-video/GLUS

[CVPR 2025] Official PyTorch Implementation of GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation

Language: Jupyter Notebook - Size: 66.4 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 31 - Forks: 2

zjunlp/EasyDetect

[ACL 2024] An Easy-to-use Hallucination Detection Framework for LLMs.

Language: Python - Size: 11.5 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 30 - Forks: 1

Related Topics
large-language-models 71 multimodal 44 llm 40 vision-language-model 30 mllm 24 deep-learning 18 large-multimodal-models 18 machine-learning 17 vlm 16 artificial-intelligence 15 multimodal-learning 15 benchmark 14 multimodal-deep-learning 14 llms 13 large-vision-language-models 12 chatbot 12 generative-ai 12 natural-language-processing 12 llava 12 reasoning 10 foundation-models 9 multimodality 9 computer-vision 8 large-language-model 8 instruction-tuning 8 video 8 ai 8 visual-question-answering 8 llama 7 transformers 7 retrieval-augmented-generation 7 video-understanding 7 multimodal-data 7 vision-language 7 hallucination 6 dataset 6 vision-language-models 6 vision-transformer 6 rag 6 python 6 chatgpt 5 reinforcement-learning 5 agentic-ai 5 awesome-list 5 visual-instruction-tuning 5 medical-image-analysis 5 streamlit 5 qwen 5 huggingface-transformers 4 pytorch 4 vision-and-language 4 huggingface 4 instruction-following 4 in-context-learning 4 hallucination-detection 4 clip 4 hallucination-mitigation 4 docker 4 knowledge-graph 4 segmentation 4 mixture-of-experts 4 radiology-report-generation 4 chest-xrays 4 long-video-understanding 4 multi-modality 4 llama3 4 video-language-model 4 question-answering 4 video-question-answering 4 gemini-pro 4 text-to-image-generation 4 nlp 4 evaluation 3 alignment 3 code-generation 3 safety 3 audio 3 chain-of-thought 3 prompt-engineering 3 deepseek-r1 3 mllm-reasoning 3 neurips-2024 3 agent 3 llms-benchmarking 3 text-to-speech 3 gpt-4 3 agentic-workflow 3 benchmarking 3 supervised-finetuning 3 multi-modal 3 gradio 3 gemini-api 3 text-to-image 3 fact-checking 3 misinformation 3 reasoning-language-models 3 fine-tuning 3 ai-agents 3 gpt 3 vision-language-transformer 3