GitHub topics: multimodal-large-language-models
willxxy/ECG-Bench
A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)
Language: Python - Size: 6.7 MB - Last synced at: about 6 hours ago - Pushed at: about 7 hours ago - Stars: 16 - Forks: 2

path2generalist/General-Level
On Path to Multimodal Generalist: General-Level and General-Bench
Language: Python - Size: 918 KB - Last synced at: about 17 hours ago - Pushed at: about 19 hours ago - Stars: 17 - Forks: 2

modelscope/ms-agent
MS-Agent: Lightweight Framework for Empowering Agents with Autonomous Exploration
Language: Python - Size: 74.1 MB - Last synced at: about 21 hours ago - Pushed at: about 24 hours ago - Stars: 3,263 - Forks: 374

AIDC-AI/Ovis
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Language: Python - Size: 6.38 MB - Last synced at: about 12 hours ago - Pushed at: 28 days ago - Stars: 985 - Forks: 64

sepiatone/unfaithful_shortcuts Fork of whitebox-research/c2-proving-ground-martinez-cot
Repository for the report 'Investigating Unfaithful Shortcuts in the CoT Reasoning for Multimodal Inputs'
Language: Python - Size: 9.94 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

robosense2025/track1
Track 1: Driving with Language
Language: Python - Size: 756 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 13 - Forks: 0

NVIDIA/audio-flamingo
PyTorch implementation of Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities.
Size: 7.08 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 496 - Forks: 29

PRAISELab-PicusLab/MMMED
๐ฉบ MMMED is a benchmark dataset for evaluating Vision-Language Models (VLMs) on medical multiple-choice question answering (MCQA) tasks. ๐ฅ๐ก It features 194 real-world medical questions from Spanish MIR residency exams, available in ๐ช๐ธ Spanish, ๐ฌ๐ง English, and ๐ฎ๐น Italian.
Language: Jupyter Notebook - Size: 734 KB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 3 - Forks: 0

LoupXpro/AlphaExtract
AlphaExtract is a sophisticated PDF summarization tool that combines cutting-edge AI technology with efficient document processing. The project is built using Python and leverages Meta's LLaMA 4 MOE Maverick model along with Groq's inference engine to provide fast and accurate PDF summaries.
Language: Python - Size: 12.6 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 1

piomin/spring-ai-showcase
Sample Spring AI Application with several use cases
Language: Java - Size: 3.96 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 47 - Forks: 22

rese1f/MovieChat
[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Language: Python - Size: 78.7 MB - Last synced at: about 12 hours ago - Pushed at: 5 months ago - Stars: 634 - Forks: 41

thisisiron/vision-token-calculator
๐งฎ A calculator for vision tokens in VLMs.
Language: Python - Size: 23.4 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

baaivision/EVE
EVE Series: Encoder-Free Vision-Language Models from BAAI
Language: Python - Size: 7.29 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 332 - Forks: 8

LINs-lab/DynMoE
[ICLR 2025] Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models
Language: Python - Size: 57.3 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 115 - Forks: 14

xjywhu/Awesome-Multimodal-LLM-for-Code
Multimodal Large Language Models for Code Generation under Multimodal Scenarios
Size: 485 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 98 - Forks: 4

akusayudodograu/Agentic-RAG-Story-Generation-with-Multimodal-GenAI
Multimodal Agentic GenAI Workflow โ Seamlessly blends retrieval and generation for intelligent storytelling
Size: 1000 Bytes - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 12 - Forks: 1

offline-function-calling/sdk
This repository consists of an SDK and a comprehensive set of resources to get started with function calling using offline, multimodal, large language models.
Language: Python - Size: 514 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

X-iZhang/Libra
[ACL 2025] โ๏ธ Temporally-aware MLLM for Biomedical Radiology Analysis and Report Generation. Flexible toolkit with LLM backbone support, real-time validation, training resumption, and smart model saving.
Language: Python - Size: 13.7 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 15 - Forks: 2

BradyFU/Awesome-Multimodal-Large-Language-Models
:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
Size: 82.9 MB - Last synced at: 5 days ago - Pushed at: 11 days ago - Stars: 15,751 - Forks: 1,022

IrohXu/Awesome-Multimodal-LLM-Autonomous-Driving
[WACV 2024 Survey Paper] Multimodal Large Language Models for Autonomous Driving
Size: 15 MB - Last synced at: about 1 hour ago - Pushed at: over 1 year ago - Stars: 287 - Forks: 11

yu-rp/Dimple
Dimple, the first Discrete Diffusion Multimodal Large Language Model
Language: Python - Size: 3.96 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 71 - Forks: 3

Sandia7171717171/CharmBench
CharmBench offers a challenging benchmark for large vision-language models, providing datasets and evaluation tools to enhance multimodal reasoning. Check out our latest updates and contribute to the project by starring the repo! ๐๐ฉ๐ป
Language: Jupyter Notebook - Size: 7.23 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

AIDC-AI/Ovis-U1
An unified model that seamlessly integrates multimodal understanding, text-to-image generation, and image editing within a single powerful framework.
Language: Python - Size: 10 MB - Last synced at: 6 days ago - Pushed at: 9 days ago - Stars: 267 - Forks: 5

thisisiron/LLaVA-Pool
๐ A flexible framework for training and configuring Vision-Language Models
Language: Python - Size: 3.17 MB - Last synced at: 4 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

willxxy/ECG-Byte
[MLHC 2025] ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling
Language: Python - Size: 28.5 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 19 - Forks: 0

zjysteven/Awesome-Byte-LLM
A curated list of papers and resources on byte-based large language models (LLMs) โ models that operate directly on raw bytes.
Size: 435 KB - Last synced at: 3 days ago - Pushed at: 7 months ago - Stars: 4 - Forks: 0

multimindlab/multimind-sdk
Your SDK solves all of this. One interface. Unified logic. Local + hosted models. Fine-tuning. Agent tools. Enterprise-ready. Hybrid RAG.Star ๐ if you like it!
Language: Python - Size: 46.7 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 27 - Forks: 2

Glasgow-AI4BioMed/Libra Fork of X-iZhang/Libra
[ACL 2025] Libra: Leveraging Temporal Images for Biomedical Radiology Analysis
Language: Python - Size: 13.6 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

The-Martyr/CausalMM
[ICLR 2025] Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality
Language: Python - Size: 7.1 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 32 - Forks: 3

GerrySant/multimodalhugs
MultimodalHugs is an extension of Hugging Face that offers a generalized framework for training, evaluating, and using multimodal AI models with minimal code differences, ensuring seamless compatibility with Hugging Face pipelines.
Language: Python - Size: 4.4 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 8 - Forks: 2

X-LANCE/SLAM-LLM
Speech, Language, Audio, Music Processing with Large Language Model
Language: Python - Size: 169 MB - Last synced at: 6 days ago - Pushed at: 26 days ago - Stars: 844 - Forks: 85

Glass1973/multimind-sdk-js
Access advanced AI features with MultiMind SDK for JavaScript. Simplify agent orchestration and fine-tuning without backend management. ๐๐
Language: TypeScript - Size: 137 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

vincentlux/Awesome-Multimodal-LLM
Reading list for Multimodal Large Language Models
Size: 110 KB - Last synced at: 3 days ago - Pushed at: almost 2 years ago - Stars: 68 - Forks: 7

zjunlp/OceanGPT
[ๆฒงๆธ] [ACL 2024] OceanGPT: A Large Language Model for Ocean Science Tasks
Language: Python - Size: 40.6 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 62 - Forks: 8

Hoar012/TDC-Video
Official implementation of TDC.
Language: Python - Size: 5.86 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 9 - Forks: 1

DataArcTech/ChartMoE
[ICLR2025 Oral] ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart Understanding
Language: Jupyter Notebook - Size: 9.77 MB - Last synced at: 5 days ago - Pushed at: 3 months ago - Stars: 85 - Forks: 4

X-PLUG/MobileAgent
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
Language: Python - Size: 386 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 4,414 - Forks: 451

MME-Benchmarks/Video-MME
โจโจ[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Size: 16.7 MB - Last synced at: 9 days ago - Pushed at: 2 months ago - Stars: 584 - Forks: 24

gautierdag/plancraft
Plancraft is a minecraft environment and agent suite to test planning capabilities in LLMs
Language: Python - Size: 124 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 15 - Forks: 0

AIDC-AI/Awesome-Unified-Multimodal-Models
Awesome Unified Multimodal Models
Size: 9.27 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 374 - Forks: 11

Shubin-vadim/Arxplover
Comprehensive multimodal system for analyzing documents with support for extracting and processing text, tables, and images
Language: HTML - Size: 2.73 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

Siyou-Li/u2Tokenizer
a multiscale multimodal large language models for radiology report generation (RRG) tasks
Language: Python - Size: 22.9 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 100 - Forks: 10

Monoese/VoiceCordAI
A Discord bot for real-time voice chat with AI services like OpenAI and Google Gemini.
Language: Python - Size: 306 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

burglarhobbit/Awesome-Medical-Large-Language-Models
Curated papers on Large Language Models in Healthcare and Medical domain
Size: 53.7 KB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 333 - Forks: 40

jingyi0000/R1-VL
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Language: Python - Size: 2.37 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 400 - Forks: 0

richard-peng-xia/awesome-multimodal-in-medical-imaging
A collection of resources on applications of multi-modal learning in medical imaging.
Size: 327 KB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 770 - Forks: 69

yaotingwangofficial/Awesome-MCoT
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Size: 11.5 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 669 - Forks: 20

ictnlp/LLaVA-Mini
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
Language: Python - Size: 54.6 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 499 - Forks: 22

declare-lab/Auto-Scaling
[Arxiv 2024] Official Implementation of the paper: "Towards Robust Instruction Tuning on Multimodal Large Language Models"
Language: Jupyter Notebook - Size: 67.6 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 9 - Forks: 1

abdur75648/V-Zen
V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM Resources
Size: 5.86 KB - Last synced at: about 13 hours ago - Pushed at: 12 months ago - Stars: 7 - Forks: 3

RainBowLuoCS/Awesome-Unified-Multimodal-Understanding-and-Generation
๐ฐ Must-read papers on Unified Multimodal Understanding and Generation (constantly updating ๐ค).
Size: 11.7 KB - Last synced at: 14 days ago - Pushed at: 29 days ago - Stars: 6 - Forks: 0

Video-Bench/Video-Bench
Video Generation Benchmark
Language: Python - Size: 10.1 MB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 39 - Forks: 3

friedrichor/Awesome-Multimodal-Papers
A curated list of awesome Multimodal studies.
Size: 63.4 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 216 - Forks: 20

vaew/Awesome-spatial-visual-reasoning-MLLMs
Repository for awesome spatial/visual reasoning MLLMs. (focus more on embodied applications)
Language: Python - Size: 3.41 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 54 - Forks: 1

HyeonjeongHa/MM-PoisonRAG
Official PyTorch implementation of "MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks"
Language: Python - Size: 28.5 MB - Last synced at: 8 days ago - Pushed at: 3 months ago - Stars: 6 - Forks: 1

RainBowLuoCS/OpenOmni
OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
Language: Python - Size: 8.5 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 64 - Forks: 5

Hyeongkeun/LAVCap
Official Pytorch Implementation of 'LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport' (ICASSP2025)
Language: Python - Size: 3.58 MB - Last synced at: 8 days ago - Pushed at: 3 months ago - Stars: 6 - Forks: 0

dvlab-research/VisionReasoner
The official implement of "VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning"
Language: Python - Size: 12.3 MB - Last synced at: 13 days ago - Pushed at: about 1 month ago - Stars: 190 - Forks: 12

mashijie1028/GenHancer
(ICCV 2025) Enhance CLIP and MLLM's fine-grained visual representations with generative models.
Language: Python - Size: 2.69 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 53 - Forks: 0

VITA-MLLM/VITA
โจโจVITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Language: Python - Size: 15.7 MB - Last synced at: 16 days ago - Pushed at: 4 months ago - Stars: 2,334 - Forks: 173

lqzxt/ChemTable
ChemTable is a large-scale benchmark designed to test the capabilities of multimodal large language models (MLLMs) in understanding real-world chemical tablesโone of the most information-dense and visually complex formats in scientific literature.
Language: Python - Size: 4.25 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 2 - Forks: 0

shaopengw/Awesome-Music-Generation
Awesome music generation modelโโMGยฒ
Language: Python - Size: 3.16 MB - Last synced at: 14 days ago - Pushed at: 4 months ago - Stars: 158 - Forks: 10

jolibrain/colette
Search and interact locally with technical documents of any kind
Language: HTML - Size: 7.74 MB - Last synced at: 13 days ago - Pushed at: 19 days ago - Stars: 11 - Forks: 6

dvlab-research/LSDBench
A benchmark that focuses on the sampling dilemma in long-video tasks. Through well-designed tasks, it evaluates the sampling efficiency of long-video VLMs.
Language: Python - Size: 2.57 MB - Last synced at: 9 days ago - Pushed at: 4 months ago - Stars: 16 - Forks: 0

eric-ai-lab/MSSBench
[ICLR 2025] Official codebase for the ICLR 2025 paper "Multimodal Situational Safety"
Language: Python - Size: 1.47 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 17 - Forks: 1

rivi89/Awesome-spatial-visual-reasoning-MLLMs
Language: Python - Size: 3.31 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 1 - Forks: 0

manasa-26/VoiceAssist-RAG
Multimodal Voice RAG Agent using Speech-to-Text, FAISS Search, and Text-to-Speech
Language: Python - Size: 22.5 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 1 - Forks: 0

diankun-wu/Spatial-MLLM
Official implementation of Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Language: Python - Size: 21.3 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 234 - Forks: 7

MSR3D/MSR3D
[NeurIPS 2024] Official code repository for MSR3D paper
Language: Python - Size: 75.7 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 60 - Forks: 3

JinXins/Awesome-Token-Merge-for-MLLMs
A paper list about Token Merge, Reduce, Resample, Drop for MLLMs.
Size: 103 KB - Last synced at: 15 days ago - Pushed at: 6 months ago - Stars: 62 - Forks: 0

mbzuai-oryx/LLMVoX
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Language: Python - Size: 132 MB - Last synced at: 23 days ago - Pushed at: about 2 months ago - Stars: 258 - Forks: 31

ryota-komatsu/slp2025
้ณๅญฆใทใณใใธใฆใ 2025ใใฅใผใใชใขใซใใใซใใขใผใใซๅคง่ฆๆจก่จ่ชใขใใซๅ ฅ้ใ่ณๆ
Language: Jupyter Notebook - Size: 19.6 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 16 - Forks: 2

multimodal-ai-lab/DEFAME
Fact-checking system for textual and visual inputs.
Language: Python - Size: 30.1 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 20 - Forks: 4

AIDC-AI/Parrot
๐ The code repository for "Parrot: Multilingual Visual Instruction Tuning" in PyTorch.
Language: Python - Size: 25.2 MB - Last synced at: 10 days ago - Pushed at: about 1 month ago - Stars: 87 - Forks: 2

YingqingHe/Awesome-LLMs-meet-Multimodal-Generation
๐ฅ๐ฅ๐ฅ A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
Language: HTML - Size: 12.7 MB - Last synced at: 19 days ago - Pushed at: 3 months ago - Stars: 479 - Forks: 28

GLUS-video/GLUS
[CVPR 2025] Official PyTorch Implementation of GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation
Language: Jupyter Notebook - Size: 66.4 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 43 - Forks: 4

BUAADreamer/Qwen2-VL-History
Qwen2-VLๅจๆๆ ้ขๅ็LLaMA-Factoryๅพฎ่ฐๆกไพ The case for fine-tuning Qwen2-VL in the field of historical literature and museums
Size: 73.8 MB - Last synced at: 5 days ago - Pushed at: 10 months ago - Stars: 11 - Forks: 2

Glasgow-AI4BioMed/RRG-BioNLP-ACL2024 Fork of X-iZhang/RRG-BioNLP-ACL2024
[BioNLP ACL'24] Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for Radiology Report Generation
Language: Python - Size: 737 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 1 - Forks: 0

Wang-ML-Lab/interpretable-foundation-models
[ICML 2024] Probabilistic Conceptual Explainers (PACE): Trustworthy Conceptual Explanations for Vision Foundation Models
Language: Python - Size: 52.7 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 15 - Forks: 3

UKPLab/arxiv2025-misleading-visualizations
Code and datasets accompanying the arXiv preprint: "Protecting multimodal large language models against misleading visualizations"
Language: JavaScript - Size: 22.6 MB - Last synced at: 24 days ago - Pushed at: 29 days ago - Stars: 2 - Forks: 0

ritzz-ai/GUI-R1
Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Language: Python - Size: 974 KB - Last synced at: 28 days ago - Pushed at: 2 months ago - Stars: 112 - Forks: 11

Paranioar/Awesome_Matching_Pretraining_Transfering
The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.
Size: 369 KB - Last synced at: 23 days ago - Pushed at: 7 months ago - Stars: 422 - Forks: 48

mbzuai-oryx/ALM-Bench
[CVPR 2025 ๐ฅ] ALM-Bench is a multilingual multi-modal diverse cultural benchmark for 100 languages across 19 categories. It assesses the next generation of LMMs on cultural inclusitivity.
Language: Python - Size: 26.7 MB - Last synced at: 23 days ago - Pushed at: about 2 months ago - Stars: 40 - Forks: 2

saky-semicolon/Multimodal-Readmission-Prediction
Multimodal fusion model for predicting 30-day hospital readmission using structured EHR data and BERT-based clinical text embeddings from the MIMIC-III dataset.
Language: Jupyter Notebook - Size: 1.69 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 0 - Forks: 0

ByteDance-Seed/Seed1.5-VL
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
Language: Jupyter Notebook - Size: 140 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 1,189 - Forks: 43

zjunlp/EasyDetect
[ACL 2024] An Easy-to-use Hallucination Detection Framework for LLMs.
Language: Python - Size: 11.5 MB - Last synced at: 28 days ago - Pushed at: 5 months ago - Stars: 34 - Forks: 2

Hoar012/RAP-MLLM
[CVPR 2025] RAP: Retrieval-Augmented Personalization
Language: Python - Size: 60.9 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 57 - Forks: 1

xid32/NAACL_2025_TWM
We introduce temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of Multimodal foundation models (MFMs). This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across QA, captioning, and retrieval tasks.
Language: Python - Size: 896 KB - Last synced at: 9 days ago - Pushed at: 6 months ago - Stars: 309 - Forks: 30

AstraZeneca/vlm
Official implementation for "Diffusion Instruction Tuning"
Language: Python - Size: 25.5 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 23 - Forks: 2

zjunlp/Deco
[ICLR 2025] MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
Language: Python - Size: 17.6 MB - Last synced at: 28 days ago - Pushed at: 7 months ago - Stars: 85 - Forks: 8

OpenGVLab/PIIP
[NeurIPS 2024 Spotlight โญ๏ธ] Parameter-Inverted Image Pyramid Networks (PIIP)
Language: Python - Size: 11.7 MB - Last synced at: 12 days ago - Pushed at: about 2 months ago - Stars: 92 - Forks: 2

luisrui/Modality-Interference-in-MLLMs
The source code for the paper "Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models"
Language: Python - Size: 0 Bytes - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

hacheyz/FlowchartQA
Create flowchart QA datasets using Python and Mermaid, free of AIGC.
Language: Python - Size: 457 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

cyw-3d/SAR3D
Official repository for "SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE"
Language: Python - Size: 11.8 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 154 - Forks: 1

Vignesh010101/Intelligent-Health-LLM-System
An Intelligent Health LLM System for Personalized Medication Guidance and Support.
Language: Jupyter Notebook - Size: 620 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 0

alianoroozi/ai-hub
A collection of AI experiments, including model training, ML system development, and end-to-end pipelines.
Language: Jupyter Notebook - Size: 43.4 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

X-iZhang/RRG-BioNLP-ACL2024
[BioNLP ACL'24] ๐ฌ Med-CXRGen, developed by Glasgow AI4BioMed Lab, brings vision-language adaptation to biomedical radiology via visual instruction tuning.
Language: Python - Size: 596 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 1

emanalytic/MultiModal-E-Commerce-Customer-Support-Chatbot
Multimodal Customer Service Chatbot
Language: Python - Size: 154 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 1

msamprovalaki/Exploring-Multimodal-Large-Language-Models-for-Medical-Image-Captioning
This repository includes the code for my Master Thesis, which investigates the application of Multimodal Large Language Models (MLLMs) for medical image captioning
Language: Python - Size: 5.45 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 6 - Forks: 0

natgluons/AI-docs-analyzer-API
Automate invoice analysis and identity verification, built with an open-source multimodal LLM and OCR (DocTR/TrOCR), using FastAPI, Supabase, PgVector, and Neo4j.
Language: Python - Size: 8.79 KB - Last synced at: 23 days ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0
