GitHub topics: multimodal-large-language-models

Repositories

willxxy/ECG-Bench

A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)

Language: Python - Size: 6.7 MB - Last synced at: about 6 hours ago - Pushed at: about 7 hours ago - Stars: 16 - Forks: 2

path2generalist/General-Level

On Path to Multimodal Generalist: General-Level and General-Bench

Language: Python - Size: 918 KB - Last synced at: about 17 hours ago - Pushed at: about 19 hours ago - Stars: 17 - Forks: 2

modelscope/ms-agent

MS-Agent: Lightweight Framework for Empowering Agents with Autonomous Exploration

Language: Python - Size: 74.1 MB - Last synced at: about 21 hours ago - Pushed at: about 24 hours ago - Stars: 3,263 - Forks: 374

AIDC-AI/Ovis

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

Language: Python - Size: 6.38 MB - Last synced at: about 12 hours ago - Pushed at: 28 days ago - Stars: 985 - Forks: 64

sepiatone/unfaithful_shortcuts Fork of whitebox-research/c2-proving-ground-martinez-cot

Repository for the report 'Investigating Unfaithful Shortcuts in the CoT Reasoning for Multimodal Inputs'

Language: Python - Size: 9.94 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

robosense2025/track1

Track 1: Driving with Language

Language: Python - Size: 756 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 13 - Forks: 0

NVIDIA/audio-flamingo

PyTorch implementation of Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities.

Size: 7.08 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 496 - Forks: 29

🩺 MMMED is a benchmark dataset for evaluating Vision-Language Models (VLMs) on medical multiple-choice question answering (MCQA) tasks. 🏥💡 It features 194 real-world medical questions from Spanish MIR residency exams, available in 🇪🇸 Spanish, 🇬🇧 English, and 🇮🇹 Italian.

Language: Jupyter Notebook - Size: 734 KB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 3 - Forks: 0

LoupXpro/AlphaExtract

AlphaExtract is a sophisticated PDF summarization tool that combines cutting-edge AI technology with efficient document processing. The project is built using Python and leverages Meta's LLaMA 4 MOE Maverick model along with Groq's inference engine to provide fast and accurate PDF summaries.

Language: Python - Size: 12.6 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 1

piomin/spring-ai-showcase

Sample Spring AI Application with several use cases

Language: Java - Size: 3.96 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 47 - Forks: 22

rese1f/MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Language: Python - Size: 78.7 MB - Last synced at: about 12 hours ago - Pushed at: 5 months ago - Stars: 634 - Forks: 41

thisisiron/vision-token-calculator

🧮 A calculator for vision tokens in VLMs.

Language: Python - Size: 23.4 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

baaivision/EVE

EVE Series: Encoder-Free Vision-Language Models from BAAI

Language: Python - Size: 7.29 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 332 - Forks: 8

LINs-lab/DynMoE

[ICLR 2025] Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

Language: Python - Size: 57.3 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 115 - Forks: 14

xjywhu/Awesome-Multimodal-LLM-for-Code

Multimodal Large Language Models for Code Generation under Multimodal Scenarios

Size: 485 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 98 - Forks: 4

akusayudodograu/Agentic-RAG-Story-Generation-with-Multimodal-GenAI

Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling

Size: 1000 Bytes - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 12 - Forks: 1

offline-function-calling/sdk

This repository consists of an SDK and a comprehensive set of resources to get started with function calling using offline, multimodal, large language models.

Language: Python - Size: 514 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

X-iZhang/Libra

[ACL 2025] ⚖️ Temporally-aware MLLM for Biomedical Radiology Analysis and Report Generation. Flexible toolkit with LLM backbone support, real-time validation, training resumption, and smart model saving.

Language: Python - Size: 13.7 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 15 - Forks: 2

BradyFU/Awesome-Multimodal-Large-Language-Models

:sparkles::sparkles:Latest Advances on Multimodal Large Language Models

Size: 82.9 MB - Last synced at: 5 days ago - Pushed at: 11 days ago - Stars: 15,751 - Forks: 1,022

IrohXu/Awesome-Multimodal-LLM-Autonomous-Driving

[WACV 2024 Survey Paper] Multimodal Large Language Models for Autonomous Driving

Size: 15 MB - Last synced at: about 1 hour ago - Pushed at: over 1 year ago - Stars: 287 - Forks: 11

yu-rp/Dimple

Dimple, the first Discrete Diffusion Multimodal Large Language Model

Language: Python - Size: 3.96 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 71 - Forks: 3

Sandia7171717171/CharmBench

CharmBench offers a challenging benchmark for large vision-language models, providing datasets and evaluation tools to enhance multimodal reasoning. Check out our latest updates and contribute to the project by starring the repo! 🌟👩💻

Language: Jupyter Notebook - Size: 7.23 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

AIDC-AI/Ovis-U1

An unified model that seamlessly integrates multimodal understanding, text-to-image generation, and image editing within a single powerful framework.

Language: Python - Size: 10 MB - Last synced at: 6 days ago - Pushed at: 9 days ago - Stars: 267 - Forks: 5

thisisiron/LLaVA-Pool

🌋 A flexible framework for training and configuring Vision-Language Models

Language: Python - Size: 3.17 MB - Last synced at: 4 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

willxxy/ECG-Byte

[MLHC 2025] ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

Language: Python - Size: 28.5 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 19 - Forks: 0

zjysteven/Awesome-Byte-LLM

A curated list of papers and resources on byte-based large language models (LLMs) — models that operate directly on raw bytes.

Size: 435 KB - Last synced at: 3 days ago - Pushed at: 7 months ago - Stars: 4 - Forks: 0

multimindlab/multimind-sdk

Your SDK solves all of this. One interface. Unified logic. Local + hosted models. Fine-tuning. Agent tools. Enterprise-ready. Hybrid RAG.Star 🌟 if you like it!

Language: Python - Size: 46.7 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 27 - Forks: 2

Glasgow-AI4BioMed/Libra Fork of X-iZhang/Libra

[ACL 2025] Libra: Leveraging Temporal Images for Biomedical Radiology Analysis

Language: Python - Size: 13.6 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

The-Martyr/CausalMM

[ICLR 2025] Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality

Language: Python - Size: 7.1 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 32 - Forks: 3

GerrySant/multimodalhugs

MultimodalHugs is an extension of Hugging Face that offers a generalized framework for training, evaluating, and using multimodal AI models with minimal code differences, ensuring seamless compatibility with Hugging Face pipelines.

Language: Python - Size: 4.4 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 8 - Forks: 2

X-LANCE/SLAM-LLM

Speech, Language, Audio, Music Processing with Large Language Model

Language: Python - Size: 169 MB - Last synced at: 6 days ago - Pushed at: 26 days ago - Stars: 844 - Forks: 85

Glass1973/multimind-sdk-js

Access advanced AI features with MultiMind SDK for JavaScript. Simplify agent orchestration and fine-tuning without backend management. 🌟🚀

Language: TypeScript - Size: 137 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

vincentlux/Awesome-Multimodal-LLM

Reading list for Multimodal Large Language Models

Size: 110 KB - Last synced at: 3 days ago - Pushed at: almost 2 years ago - Stars: 68 - Forks: 7

zjunlp/OceanGPT

[沧渊] [ACL 2024] OceanGPT: A Large Language Model for Ocean Science Tasks

Language: Python - Size: 40.6 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 62 - Forks: 8

Hoar012/TDC-Video

Official implementation of TDC.

Language: Python - Size: 5.86 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 9 - Forks: 1

DataArcTech/ChartMoE

[ICLR2025 Oral] ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart Understanding

Language: Jupyter Notebook - Size: 9.77 MB - Last synced at: 5 days ago - Pushed at: 3 months ago - Stars: 85 - Forks: 4

X-PLUG/MobileAgent

Mobile-Agent: The Powerful Mobile Device Operation Assistant Family

Language: Python - Size: 386 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 4,414 - Forks: 451

MME-Benchmarks/Video-MME

✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Size: 16.7 MB - Last synced at: 9 days ago - Pushed at: 2 months ago - Stars: 584 - Forks: 24

gautierdag/plancraft

Plancraft is a minecraft environment and agent suite to test planning capabilities in LLMs

Language: Python - Size: 124 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 15 - Forks: 0

AIDC-AI/Awesome-Unified-Multimodal-Models

Awesome Unified Multimodal Models

Size: 9.27 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 374 - Forks: 11

Shubin-vadim/Arxplover

Comprehensive multimodal system for analyzing documents with support for extracting and processing text, tables, and images

Language: HTML - Size: 2.73 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

Siyou-Li/u2Tokenizer

a multiscale multimodal large language models for radiology report generation (RRG) tasks

Language: Python - Size: 22.9 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 100 - Forks: 10

Monoese/VoiceCordAI

A Discord bot for real-time voice chat with AI services like OpenAI and Google Gemini.

Language: Python - Size: 306 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

burglarhobbit/Awesome-Medical-Large-Language-Models

Curated papers on Large Language Models in Healthcare and Medical domain

Size: 53.7 KB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 333 - Forks: 40

jingyi0000/R1-VL

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Language: Python - Size: 2.37 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 400 - Forks: 0

richard-peng-xia/awesome-multimodal-in-medical-imaging

A collection of resources on applications of multi-modal learning in medical imaging.

Size: 327 KB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 770 - Forks: 69

yaotingwangofficial/Awesome-MCoT

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Size: 11.5 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 669 - Forks: 20

ictnlp/LLaVA-Mini

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

Language: Python - Size: 54.6 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 499 - Forks: 22

declare-lab/Auto-Scaling

[Arxiv 2024] Official Implementation of the paper: "Towards Robust Instruction Tuning on Multimodal Large Language Models"

Language: Jupyter Notebook - Size: 67.6 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 9 - Forks: 1

abdur75648/V-Zen

V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM Resources

Size: 5.86 KB - Last synced at: about 13 hours ago - Pushed at: 12 months ago - Stars: 7 - Forks: 3

RainBowLuoCS/Awesome-Unified-Multimodal-Understanding-and-Generation

📰 Must-read papers on Unified Multimodal Understanding and Generation (constantly updating 🤗).

Size: 11.7 KB - Last synced at: 14 days ago - Pushed at: 29 days ago - Stars: 6 - Forks: 0

Video-Bench/Video-Bench

Video Generation Benchmark

Language: Python - Size: 10.1 MB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 39 - Forks: 3

friedrichor/Awesome-Multimodal-Papers

A curated list of awesome Multimodal studies.

Size: 63.4 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 216 - Forks: 20

vaew/Awesome-spatial-visual-reasoning-MLLMs

Repository for awesome spatial/visual reasoning MLLMs. (focus more on embodied applications)

Language: Python - Size: 3.41 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 54 - Forks: 1

HyeonjeongHa/MM-PoisonRAG

Official PyTorch implementation of "MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks"

Language: Python - Size: 28.5 MB - Last synced at: 8 days ago - Pushed at: 3 months ago - Stars: 6 - Forks: 1

RainBowLuoCS/OpenOmni

OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

Language: Python - Size: 8.5 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 64 - Forks: 5

Hyeongkeun/LAVCap

Official Pytorch Implementation of 'LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport' (ICASSP2025)

Language: Python - Size: 3.58 MB - Last synced at: 8 days ago - Pushed at: 3 months ago - Stars: 6 - Forks: 0

dvlab-research/VisionReasoner

The official implement of "VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning"

Language: Python - Size: 12.3 MB - Last synced at: 13 days ago - Pushed at: about 1 month ago - Stars: 190 - Forks: 12

mashijie1028/GenHancer

(ICCV 2025) Enhance CLIP and MLLM's fine-grained visual representations with generative models.

Language: Python - Size: 2.69 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 53 - Forks: 0

VITA-MLLM/VITA

✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Language: Python - Size: 15.7 MB - Last synced at: 16 days ago - Pushed at: 4 months ago - Stars: 2,334 - Forks: 173

lqzxt/ChemTable

ChemTable is a large-scale benchmark designed to test the capabilities of multimodal large language models (MLLMs) in understanding real-world chemical tables—one of the most information-dense and visually complex formats in scientific literature.

Language: Python - Size: 4.25 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 2 - Forks: 0

shaopengw/Awesome-Music-Generation

Awesome music generation model——MG²

Language: Python - Size: 3.16 MB - Last synced at: 14 days ago - Pushed at: 4 months ago - Stars: 158 - Forks: 10

jolibrain/colette

Search and interact locally with technical documents of any kind

Language: HTML - Size: 7.74 MB - Last synced at: 13 days ago - Pushed at: 19 days ago - Stars: 11 - Forks: 6

dvlab-research/LSDBench

A benchmark that focuses on the sampling dilemma in long-video tasks. Through well-designed tasks, it evaluates the sampling efficiency of long-video VLMs.

Language: Python - Size: 2.57 MB - Last synced at: 9 days ago - Pushed at: 4 months ago - Stars: 16 - Forks: 0

eric-ai-lab/MSSBench

[ICLR 2025] Official codebase for the ICLR 2025 paper "Multimodal Situational Safety"

Language: Python - Size: 1.47 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 17 - Forks: 1

rivi89/Awesome-spatial-visual-reasoning-MLLMs

Language: Python - Size: 3.31 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 1 - Forks: 0

manasa-26/VoiceAssist-RAG

Multimodal Voice RAG Agent using Speech-to-Text, FAISS Search, and Text-to-Speech

Language: Python - Size: 22.5 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 1 - Forks: 0

diankun-wu/Spatial-MLLM

Official implementation of Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Language: Python - Size: 21.3 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 234 - Forks: 7

MSR3D/MSR3D

[NeurIPS 2024] Official code repository for MSR3D paper

Language: Python - Size: 75.7 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 60 - Forks: 3

JinXins/Awesome-Token-Merge-for-MLLMs

A paper list about Token Merge, Reduce, Resample, Drop for MLLMs.

Size: 103 KB - Last synced at: 15 days ago - Pushed at: 6 months ago - Stars: 62 - Forks: 0

mbzuai-oryx/LLMVoX

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

Language: Python - Size: 132 MB - Last synced at: 23 days ago - Pushed at: about 2 months ago - Stars: 258 - Forks: 31

ryota-komatsu/slp2025

音学シンポジウム2025チュートリアル「マルチモーダル大規模言語モデル入門」資料

Language: Jupyter Notebook - Size: 19.6 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 16 - Forks: 2

multimodal-ai-lab/DEFAME

Fact-checking system for textual and visual inputs.

Language: Python - Size: 30.1 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 20 - Forks: 4

AIDC-AI/Parrot

🎉 The code repository for "Parrot: Multilingual Visual Instruction Tuning" in PyTorch.

Language: Python - Size: 25.2 MB - Last synced at: 10 days ago - Pushed at: about 1 month ago - Stars: 87 - Forks: 2

YingqingHe/Awesome-LLMs-meet-Multimodal-Generation

🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).

Language: HTML - Size: 12.7 MB - Last synced at: 19 days ago - Pushed at: 3 months ago - Stars: 479 - Forks: 28

GLUS-video/GLUS

[CVPR 2025] Official PyTorch Implementation of GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation

Language: Jupyter Notebook - Size: 66.4 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 43 - Forks: 4

BUAADreamer/Qwen2-VL-History

Qwen2-VL在文旅领域的LLaMA-Factory微调案例 The case for fine-tuning Qwen2-VL in the field of historical literature and museums

Size: 73.8 MB - Last synced at: 5 days ago - Pushed at: 10 months ago - Stars: 11 - Forks: 2

Glasgow-AI4BioMed/RRG-BioNLP-ACL2024 Fork of X-iZhang/RRG-BioNLP-ACL2024

[BioNLP ACL'24] Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for Radiology Report Generation

Language: Python - Size: 737 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 1 - Forks: 0

Wang-ML-Lab/interpretable-foundation-models

[ICML 2024] Probabilistic Conceptual Explainers (PACE): Trustworthy Conceptual Explanations for Vision Foundation Models

Language: Python - Size: 52.7 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 15 - Forks: 3

UKPLab/arxiv2025-misleading-visualizations

Code and datasets accompanying the arXiv preprint: "Protecting multimodal large language models against misleading visualizations"

Language: JavaScript - Size: 22.6 MB - Last synced at: 24 days ago - Pushed at: 29 days ago - Stars: 2 - Forks: 0

ritzz-ai/GUI-R1

Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Language: Python - Size: 974 KB - Last synced at: 28 days ago - Pushed at: 2 months ago - Stars: 112 - Forks: 11

Paranioar/Awesome_Matching_Pretraining_Transfering

The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.

Size: 369 KB - Last synced at: 23 days ago - Pushed at: 7 months ago - Stars: 422 - Forks: 48

mbzuai-oryx/ALM-Bench

[CVPR 2025 🔥] ALM-Bench is a multilingual multi-modal diverse cultural benchmark for 100 languages across 19 categories. It assesses the next generation of LMMs on cultural inclusitivity.

Language: Python - Size: 26.7 MB - Last synced at: 23 days ago - Pushed at: about 2 months ago - Stars: 40 - Forks: 2

saky-semicolon/Multimodal-Readmission-Prediction

Multimodal fusion model for predicting 30-day hospital readmission using structured EHR data and BERT-based clinical text embeddings from the MIMIC-III dataset.

Language: Jupyter Notebook - Size: 1.69 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 0 - Forks: 0

ByteDance-Seed/Seed1.5-VL

Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.

Language: Jupyter Notebook - Size: 140 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 1,189 - Forks: 43

zjunlp/EasyDetect

[ACL 2024] An Easy-to-use Hallucination Detection Framework for LLMs.

Language: Python - Size: 11.5 MB - Last synced at: 28 days ago - Pushed at: 5 months ago - Stars: 34 - Forks: 2

Hoar012/RAP-MLLM

[CVPR 2025] RAP: Retrieval-Augmented Personalization

Language: Python - Size: 60.9 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 57 - Forks: 1

xid32/NAACL_2025_TWM

We introduce temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of Multimodal foundation models (MFMs). This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across QA, captioning, and retrieval tasks.

Language: Python - Size: 896 KB - Last synced at: 9 days ago - Pushed at: 6 months ago - Stars: 309 - Forks: 30