An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: multimodal-large-language-models

NishilBalar/Awesome-LVLM-Hallucination

up-to-date curated list of state-of-the-art Large vision language models hallucinations research work, papers & resources

Size: 189 KB - Last synced at: about 2 hours ago - Pushed at: about 3 hours ago - Stars: 125 - Forks: 6

lll6gg/UI-R1

Code for "UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning"

Language: Python - Size: 1.08 MB - Last synced at: about 14 hours ago - Pushed at: about 14 hours ago - Stars: 91 - Forks: 6

yaotingwangofficial/Awesome-MCoT

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Size: 4.59 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 553 - Forks: 13

Glasgow-AI4BioMed/RRG-BioNLP-ACL2024 Fork of X-iZhang/RRG-BioNLP-ACL2024

Code for the paper "Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for Radiology Report Generation" (BioNLP ACL'24)

Language: Python - Size: 591 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

YingqingHe/Awesome-LLMs-meet-Multimodal-Generation

🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).

Language: HTML - Size: 12.7 MB - Last synced at: about 14 hours ago - Pushed at: about 1 month ago - Stars: 472 - Forks: 26

X-iZhang/RRG-BioNLP-ACL2024

🔬 Med-CXRGen, developed by Glasgow AI4BioMed Lab, brings vision-language adaptation to biomedical radiology via visual instruction tuning. (BioNLP ACL'24)

Language: Python - Size: 591 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 1

OpenGVLab/PIIP

[NeurIPS 2024 Spotlight ⭐️] Parameter-Inverted Image Pyramid Networks (PIIP)

Language: Python - Size: 11.7 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 89 - Forks: 2

RainBowLuoCS/OpenOmni

OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

Language: Python - Size: 8.4 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 47 - Forks: 5

modelscope/modelscope-agent

ModelScope-Agent: An agent framework connecting models in ModelScope with the world

Language: Python - Size: 67.6 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 3,110 - Forks: 348

JoeJoe1313/PaliGemma-Image-Segmentation

An app with FastAPI, Docker, transformers, JAX/Flax for performing image segmentation with PaliGemma 2 mix

Language: Python - Size: 7.46 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

MME-Benchmarks/Video-MME

✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Size: 16.7 MB - Last synced at: 3 days ago - Pushed at: 24 days ago - Stars: 542 - Forks: 20

VITA-MLLM/Woodpecker

✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models

Language: Python - Size: 21.2 MB - Last synced at: 1 day ago - Pushed at: 5 months ago - Stars: 635 - Forks: 30

AIDC-AI/Awesome-Unified-Multimodal-Models

Awesome Unified Multimodal Models

Size: 3.92 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 3 - Forks: 0

X-iZhang/Libra

Space for MLLM on Biomedical Radiology Analysis

Language: Python - Size: 13.4 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 6 - Forks: 1

LLaVA-VL/LLaVA-Plus-Codebase

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

Language: Python - Size: 19 MB - Last synced at: 1 day ago - Pushed at: over 1 year ago - Stars: 740 - Forks: 58

LoupXpro/AlphaExtract

AlphaExtract is a sophisticated PDF summarization tool that combines cutting-edge AI technology with efficient document processing. The project is built using Python and leverages Meta's LLaMA 4 MOE Maverick model along with Groq's inference engine to provide fast and accurate PDF summaries.

Language: Python - Size: 12.6 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

JinXins/Awesome-Token-Merge-for-MLLMs

A paper list about Token Merge, Reduce, Resample, Drop for MLLMs.

Size: 103 KB - Last synced at: 3 days ago - Pushed at: 4 months ago - Stars: 54 - Forks: 0

hustvl/EVF-SAM

Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"

Language: Python - Size: 5.94 MB - Last synced at: 3 days ago - Pushed at: about 2 months ago - Stars: 403 - Forks: 18

xjywhu/Awesome-Multimodal-LLM-for-Code

Multimodal Large Language Models for Code Generation under Multimodal Scenarios

Size: 129 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 66 - Forks: 2

scofield7419/Video-of-Thought

Video Chain of Thought, Codes for ICML 2024 paper: "Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition"

Language: Python - Size: 1.72 MB - Last synced at: 3 days ago - Pushed at: 2 months ago - Stars: 137 - Forks: 7

akusayudodograu/Agentic-RAG-Story-Generation-with-Multimodal-GenAI

Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling

Size: 1000 Bytes - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 8 - Forks: 1

X-LANCE/SLAM-LLM

Speech, Language, Audio, Music Processing with Large Language Model

Language: Python - Size: 169 MB - Last synced at: 5 days ago - Pushed at: 16 days ago - Stars: 794 - Forks: 76

SkyworkAI/Vitron

NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

Language: Python - Size: 667 MB - Last synced at: 3 days ago - Pushed at: 7 months ago - Stars: 531 - Forks: 33

ritzz-ai/GUI-R1

Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Language: Python - Size: 974 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 70 - Forks: 5

multimodal-ai-lab/DEFAME

Fact-checking system for textual and visual inputs.

Language: Python - Size: 29.6 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 19 - Forks: 2

BradyFU/Awesome-Multimodal-Large-Language-Models

:sparkles::sparkles:Latest Advances on Multimodal Large Language Models

Size: 82.9 MB - Last synced at: 6 days ago - Pushed at: 17 days ago - Stars: 14,921 - Forks: 957

zjunlp/OceanGPT

[ACL 2024] OceanGPT: A Large Language Model for Ocean Science Tasks

Language: Python - Size: 36.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 47 - Forks: 5

The-Martyr/CausalMM

[ICLR 2025] Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality

Language: Python - Size: 7.1 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 25 - Forks: 2

richard-peng-xia/awesome-multimodal-in-medical-imaging

A collection of resources on applications of multi-modal learning in medical imaging.

Size: 234 KB - Last synced at: 6 days ago - Pushed at: about 1 month ago - Stars: 730 - Forks: 65

Adm-2005/Test-It

Test-it! is an AI testing tool designed to generate comprehensive testing instructions and code for digital products based on snapshots and code snippets.

Language: JavaScript - Size: 1.18 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 2 - Forks: 1

thisisiron/LLaVA-Pool

🌋 A flexible framework for training and configuring Vision-Language Models

Language: Python - Size: 3.09 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

willxxy/ECG-Bench

A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)

Language: Python - Size: 6.67 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 9 - Forks: 2

piomin/spring-ai-showcase

Sample Spring AI Application with several use cases

Language: Java - Size: 3.94 MB - Last synced at: 7 days ago - Pushed at: 10 days ago - Stars: 31 - Forks: 15

AIDC-AI/Parrot

🎉 The code repository for "Parrot: Multilingual Visual Instruction Tuning" in PyTorch.

Language: Python - Size: 25.2 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 36 - Forks: 1

GerrySant/multimodalhugs

MultimodalHugs is an extension of Hugging Face that offers a generalized framework for training, evaluating, and using multimodal AI models with minimal code differences, ensuring seamless compatibility with Hugging Face pipelines.

Language: Python - Size: 4.24 MB - Last synced at: 9 days ago - Pushed at: 10 days ago - Stars: 3 - Forks: 2

jingyi0000/R1-VL

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Language: Python - Size: 2.3 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 276 - Forks: 0

Wang-ML-Lab/multimodal-needle-in-a-haystack

[NAACL 2025 Oral] Multimodal Needle in a Haystack (MMNeedle): Benchmarking Long-Context Capability of Multimodal Large Language Models

Language: Python - Size: 16.6 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 42 - Forks: 3

The-Martyr/Awesome-Modality-Priors-in-MLLMs

Latest Advances on Modality Priors in Multimodal Large Language Models

Size: 76.2 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 15 - Forks: 1

X-PLUG/MobileAgent

Mobile-Agent: The Powerful Mobile Device Operation Assistant Family

Language: Python - Size: 383 MB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 4,149 - Forks: 412

IrohXu/Awesome-Multimodal-LLM-Autonomous-Driving

[WACV 2024 Survey Paper] Multimodal Large Language Models for Autonomous Driving

Size: 15 MB - Last synced at: 1 day ago - Pushed at: about 1 year ago - Stars: 286 - Forks: 11

AIDC-AI/Ovis

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

Language: Python - Size: 5.56 MB - Last synced at: 11 days ago - Pushed at: about 2 months ago - Stars: 899 - Forks: 56

Paranioar/Awesome_Matching_Pretraining_Transfering

The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.

Size: 369 KB - Last synced at: 4 days ago - Pushed at: 5 months ago - Stars: 423 - Forks: 48

apple/ml-slowfast-llava

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Language: Python - Size: 375 KB - Last synced at: 7 days ago - Pushed at: 8 months ago - Stars: 217 - Forks: 13

rese1f/MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Language: Python - Size: 78.7 MB - Last synced at: 11 days ago - Pushed at: 3 months ago - Stars: 612 - Forks: 41

friedrichor/Awesome-Multimodal-Papers

A curated list of awesome Multimodal studies.

Language: HTML - Size: 63.3 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 186 - Forks: 18

burglarhobbit/Awesome-Medical-Large-Language-Models

Curated papers on Large Language Models in Healthcare and Medical domain

Size: 42 KB - Last synced at: 12 days ago - Pushed at: 4 months ago - Stars: 302 - Forks: 35

deepglint/unicom

Large-Scale Visual Representation Model

Language: Python - Size: 22.8 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 661 - Forks: 29

zjunlp/Deco

[ICLR 2025] MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

Language: Python - Size: 17.6 MB - Last synced at: 13 days ago - Pushed at: 5 months ago - Stars: 64 - Forks: 5

declare-lab/Auto-Scaling

[Arxiv 2024] Official Implementation of the paper: "Towards Robust Instruction Tuning on Multimodal Large Language Models"

Language: Jupyter Notebook - Size: 67.5 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 8 - Forks: 1

gautierdag/plancraft

Plancraft is a minecraft environment and agent suite to test planning capabilities in LLMs

Language: Python - Size: 124 MB - Last synced at: 12 days ago - Pushed at: 15 days ago - Stars: 10 - Forks: 0

UKPLab/arxiv2025-misleading-visualizations

Code and datasets accompanying the arXiv preprint: "Protecting multimodal large language models against misleading visualizations"

Language: JavaScript - Size: 22.7 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 2 - Forks: 0

RauhanAhmed/AlphaExtract

AlphaExtract is a sophisticated PDF summarization tool that combines cutting-edge AI technology with efficient document processing. The project is built using Python and leverages Meta's LLaMA 4 MOE Maverick model along with Groq's inference engine to provide fast and accurate PDF summaries.

Language: Python - Size: 5.58 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

THUDM/VisualAgentBench

Towards Large Multimodal Models as Visual Foundation Agents

Language: Python - Size: 5.62 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 207 - Forks: 6

CristianoPatricio/CBVLM

Code for the paper "CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification".

Language: Python - Size: 903 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 7 - Forks: 1

AMD-AIG-AIMA/gpt-fast

The GPT-Fast for Multimodal Models on AMD GPUs

Language: Python - Size: 6.03 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 2 - Forks: 0

BAAI-DCAI/Bunny

A family of lightweight multimodal models.

Language: Python - Size: 28.5 MB - Last synced at: 9 days ago - Pushed at: 6 months ago - Stars: 1,015 - Forks: 74

SuperBruceJia/Awesome-Large-Vision-Language-Model

Awesome Large Vision-Language Model: A Curated List of Large Vision-Language Model

Size: 103 KB - Last synced at: 9 days ago - Pushed at: 8 months ago - Stars: 28 - Forks: 3

mediacontentatlas/mediacontentatlas

Code for Media Content Atlas

Language: Python - Size: 1.45 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 1 - Forks: 0

shaopengw/Awesome-Music-Generation

Awesome music generation model——MG²

Language: Python - Size: 3.16 MB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 154 - Forks: 10

rese1f/aurora

[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Language: Python - Size: 25.3 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 94 - Forks: 4

puar-playground/Self-Visual-RAG

Implementation of MLLM-based Self-Vision-RAG models

Language: Python - Size: 1.56 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 2 - Forks: 1

taco-group/Re-Align

A novel alignment framework that leverages image retrieval to mitigate hallucinations in Vision Language Models.

Language: Python - Size: 18.6 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 40 - Forks: 1

CoolGuy2982/Eco

A Multimodal AI app that gives you eco friendly insights with just a picture. It can understand what you want to know just by looking at the picture, offering recycling advice locations and alternative products, helps subvert greenwashing, and much much more.

Language: HTML - Size: 34.4 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 1 - Forks: 0

ai4colonoscopy/IntelliScope

Frontiers in Intelligent Colonoscopy [ColonSurvey | ColonINST | ColonGPT]

Language: Python - Size: 30.9 MB - Last synced at: 19 days ago - Pushed at: 20 days ago - Stars: 64 - Forks: 5

VachanVY/Transfusion.torch

PyTorch Implementation of Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Language: Python - Size: 2.07 MB - Last synced at: 12 days ago - Pushed at: 7 months ago - Stars: 21 - Forks: 5

MSR3D/MSR3D

[NeurIPS 2024] Official code repository for MSR3D paper

Language: Python - Size: 75.7 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 50 - Forks: 2

UKPLab/5pils

Code associated with the EMNLP 2024 Main paper: "Image, tell me your story!" Predicting the original meta-context of visual misinformation.

Language: Python - Size: 3.38 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 38 - Forks: 4

ChocoWu/SeTok

Codes for Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM

Language: Python - Size: 2.1 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 54 - Forks: 0

HenryHZY/Awesome-Multimodal-LLM

Research Trends in LLM-guided Multimodal Learning.

Size: 17.6 KB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 358 - Forks: 16

ictnlp/LLaMA-Omni

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

Language: Python - Size: 3.27 MB - Last synced at: 21 days ago - Pushed at: 22 days ago - Stars: 2,891 - Forks: 195

YangLing0818/RPG-DiffusionMaster

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)

Language: Jupyter Notebook - Size: 64.2 MB - Last synced at: 22 days ago - Pushed at: 3 months ago - Stars: 1,793 - Forks: 102

NVIDIA/audio-flamingo

PyTorch implementation of Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities.

Language: Python - Size: 4.84 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 456 - Forks: 27

315386775/OpenPathoFoundation

收集和梳理病理AI大模型相关

Size: 229 KB - Last synced at: 22 days ago - Pushed at: 23 days ago - Stars: 3 - Forks: 0

Video-Bench/Video-Bench

Video Generation Benchmark

Language: Python - Size: 8.95 MB - Last synced at: 15 days ago - Pushed at: 24 days ago - Stars: 16 - Forks: 2

UKPLab/naacl2025-cove

Code associated with the NAACL 2025 paper "COVE: COntext and VEracity prediction for out-of-context images"

Language: Python - Size: 2.51 MB - Last synced at: 22 days ago - Pushed at: 23 days ago - Stars: 1 - Forks: 0

GLUS-video/GLUS

[CVPR 2025] Official PyTorch Implementation of GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation

Language: Jupyter Notebook - Size: 66.4 MB - Last synced at: 22 days ago - Pushed at: 23 days ago - Stars: 31 - Forks: 2

RaptorMai/MLLM-CompBench

[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes

Language: Jupyter Notebook - Size: 10.7 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 38 - Forks: 2

mbzuai-oryx/ALM-Bench

[CVPR 2025 🔥] ALM-Bench is a multilingual multi-modal diverse cultural benchmark for 100 languages across 19 categories. It assesses the next generation of LMMs on cultural inclusitivity.

Language: Python - Size: 26.6 MB - Last synced at: 23 days ago - Pushed at: 24 days ago - Stars: 36 - Forks: 2

khoi03/Multimodal-ChatBot

A chatbot can process and analyze various forms of media including text, images, videos, and other data types.

Language: Python - Size: 2.94 MB - Last synced at: 23 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

joanrod/star-vector

StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.

Language: Python - Size: 6.3 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 3,570 - Forks: 186

Hoar012/TDC-Video

Size: 3.05 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0

PanguIR/MRAGSurvey

A Survey of Multimodal Retrieval-Augmented Generation

Size: 4.92 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 11 - Forks: 1

leoli51/youtube-conspiracy-detection

Code for the paper "Evaluating AI capabilities in detecting conspiracy theories on YouTube".

Language: Jupyter Notebook - Size: 12.9 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0

Hyeongkeun/LAVCap

Official Pytorch Implementation of 'LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport' (ICASSP2025)

Language: Python - Size: 3.58 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 3 - Forks: 0

Victorwz/MLM_Filter

Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".

Language: Python - Size: 30.7 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 51 - Forks: 1

gaotiexinqu/V2P-Bench

V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

Size: 16.9 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 5 - Forks: 0

OpenGVLab/MM-NIAH

[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.

Language: Python - Size: 2.83 MB - Last synced at: 23 days ago - Pushed at: 6 months ago - Stars: 115 - Forks: 6

NotYuSheng/Multimodal-Large-Language-Model

Localized Multimodal Large Language Model (MLLM) integrated with Streamlit and Ollama for text and image processing tasks.

Language: Python - Size: 7.37 MB - Last synced at: 20 days ago - Pushed at: 28 days ago - Stars: 4 - Forks: 2

baaivision/EVE

EVE Series: Encoder-Free Vision-Language Models from BAAI

Language: Python - Size: 6.95 MB - Last synced at: 28 days ago - Pushed at: 2 months ago - Stars: 320 - Forks: 8

zjysteven/lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.

Language: Python - Size: 13 MB - Last synced at: 29 days ago - Pushed at: 2 months ago - Stars: 284 - Forks: 29

genieincodebottle/genaicodelab

Comprehensive resources on Generative AI, including a detailed Codebase and tutorials

Language: Jupyter Notebook - Size: 47.6 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 41 - Forks: 4

X-PLUG/mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Language: Python - Size: 105 MB - Last synced at: 29 days ago - Pushed at: 5 months ago - Stars: 2,151 - Forks: 128

PE51K/spbu-diploma

MLLM application to Chinese speech practice as my SPBU diploma project

Language: Jupyter Notebook - Size: 66.4 KB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

weihaox/UMBRAE

[ECCV 2024] UMBRAE: Unified Multimodal Brain Decoding | Unveiling the 'Dark Side' of Brain Modality

Language: Jupyter Notebook - Size: 34.6 MB - Last synced at: 15 days ago - Pushed at: 8 months ago - Stars: 46 - Forks: 3

cambrian-mllm/cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

Language: Python - Size: 1.99 MB - Last synced at: 28 days ago - Pushed at: 6 months ago - Stars: 1,885 - Forks: 129

Haochen-Wang409/ross

[ICLR'25] Reconstructive Visual Instruction Tuning

Language: Python - Size: 12.8 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 75 - Forks: 3

ictnlp/LLaVA-Mini

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

Language: Python - Size: 54.6 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 441 - Forks: 19

FoundationVision/Liquid

Liquid: Language Models are Scalable and Unified Multi-modal Generators

Language: Python - Size: 31.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 353 - Forks: 24

OmniMMI/OmniMMI

[CVPR 2025] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Language: Python - Size: 25.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 8 - Forks: 0

xmed-lab/MedRegA

[ICLR 2025] MedRegA: Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

Language: Python - Size: 5.61 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 25 - Forks: 1

Related Keywords
multimodal-large-language-models 219 large-language-models 65 multimodal 40 llm 38 vision-language-model 25 mllm 23 large-multimodal-models 16 vlm 16 deep-learning 16 machine-learning 16 multimodal-learning 15 artificial-intelligence 14 llms 13 multimodal-deep-learning 13 large-vision-language-models 12 llava 12 generative-ai 12 chatbot 12 benchmark 12 natural-language-processing 11 multimodality 9 foundation-models 9 instruction-tuning 8 computer-vision 8 ai 8 video 7 vision-language 7 llama 7 multimodal-data 7 transformers 6 video-understanding 6 vision-transformer 6 rag 6 large-language-model 6 dataset 6 hallucination 6 visual-question-answering 6 retrieval-augmented-generation 6 vision-language-models 6 reasoning 6 python 6 agentic-ai 5 medical-image-analysis 5 streamlit 5 chatgpt 5 awesome-list 5 qwen 5 visual-instruction-tuning 5 huggingface 4 long-video-understanding 4 pytorch 4 mixture-of-experts 4 video-question-answering 4 docker 4 multi-modality 4 instruction-following 4 in-context-learning 4 llama3 4 question-answering 4 agentic-workflow 4 vision-and-language 4 nlp 4 clip 4 gemini-pro 4 video-language-model 4 hallucination-mitigation 4 hallucination-detection 4 text-to-image-generation 3 neurips-2024 3 gpt 3 text-to-image 3 misinformation 3 multi-modal 3 audio 3 evaluation 3 llms-benchmarking 3 aigc 3 radiology-report-generation 3 code-generation 3 chest-xrays 3 gemini-api 3 agentic-rag 3 internvl2 3 mllm-reasoning 3 vision-language-transformer 3 large-vision-language-model 3 vision-language-learning 3 python3 3 prompt-engineering 3 chain-of-thought 3 fact-checking 3 r1 3 huggingface-transformers 3 pinecone 3 hallucinations 3 gradio 3 gpt-4 3 text-to-speech 3 knowledge-graph 3 supervised-finetuning 3