GitHub topics: multimodal-learning

Repositories

fmenat/MultiviewCropClassification

Public repository of our IGARSS 2023 submission

Language: Python - Size: 132 MB - Last synced at: 29 minutes ago - Pushed at: about 1 hour ago - Stars: 16 - Forks: 1

fmenat/multiviewRS-models

List of deep learning models proposed for remote sensing (RS) multi-view data

Size: 17.6 KB - Last synced at: 32 minutes ago - Pushed at: about 2 hours ago - Stars: 11 - Forks: 0

egeyavuzcan/diffusion-flow-models-research

A comprehensive collection of research papers and resources on diffusion&flow based models, systematically organized by application and architecture. It highlights cutting-edge advances in flow-guided diffusion techniques for image, video, and multimodal generation.

Size: 1.95 KB - Last synced at: about 4 hours ago - Pushed at: about 6 hours ago - Stars: 0 - Forks: 0

friedrichor/Awesome-Multimodal-Papers

A curated list of awesome Multimodal studies.

Language: HTML - Size: 63.3 MB - Last synced at: about 7 hours ago - Pushed at: about 9 hours ago - Stars: 192 - Forks: 19

lll6gg/UI-R1

Code for "UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning"

Language: Python - Size: 1.08 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 91 - Forks: 6

willxxy/awesome-mmps

Corpus of resources for multimodal machine learning with physiological signals (mmps).

Size: 1.17 MB - Last synced at: about 5 hours ago - Pushed at: 4 days ago - Stars: 80 - Forks: 2

Eurus-Holmes/Awesome-Multimodal-Research

A curated list of Multimodal Related Research.

Language: Python - Size: 903 MB - Last synced at: 2 days ago - Pushed at: almost 2 years ago - Stars: 1,348 - Forks: 150

ytunprovoke/image-optimization-guide

Best practices for image optimization without losing quality. Improve your website speed and performance.

Size: 4.88 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

pliang279/MultiBench

[NeurIPS 2021] Multiscale Benchmarks for Multimodal Representation Learning

Language: HTML - Size: 49.9 MB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 541 - Forks: 80

AdityaLab/MM4TSA

A professional list on Multi-Modalities For Time Series Analysis (MM4TSA) Papers and Resource.

Size: 457 KB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 35 - Forks: 1

haomo-ai/EcoDatum

The official implementation of [Quality over Quantity: Boosting Data Efficiency Through Ensembled Multimodal Data Curation] in AAAI2025.

Language: Python - Size: 3.58 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 4 - Forks: 0

pliang279/awesome-multimodal-ml

Reading list for research topics in multimodal machine learning

Size: 459 KB - Last synced at: 4 days ago - Pushed at: 9 months ago - Stars: 6,421 - Forks: 879

The-Martyr/Awesome-Multimodal-Reasoning

Latest Advances on (RL based) Multimodal Reasoning and Generation in Multimodal Large Language Models

Size: 107 KB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 20 - Forks: 0

JoshD898/caretMultimodal

Multimodal model training in R

Language: R - Size: 2.29 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

akusayudodograu/Agentic-RAG-Story-Generation-with-Multimodal-GenAI

Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling

Size: 1000 Bytes - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 8 - Forks: 1

kyegomez/NaViT

My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"

Language: Python - Size: 210 KB - Last synced at: 1 day ago - Pushed at: about 1 month ago - Stars: 230 - Forks: 11

DmitryRyumin/ICASSP-2023-24-Papers

ICASSP 2023-2024 Papers: A complete collection of influential and exciting research papers from the ICASSP 2023-24 conferences. Explore the latest advancements in acoustics, speech and signal processing. Code included. Star the repository to support the advancement of audio and signal processing!

Language: Python - Size: 9.11 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 459 - Forks: 18

MusadiqPasha/Multi-Modal-Meme-Virality-Prediction-using-HGNNs

A multimodal meme virality predictor that integrates image, text, and metadata through ensemble learning and Hypergraph Neural Networks to classify memes as viral or non-viral.

Language: Jupyter Notebook - Size: 13.7 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

richard-peng-xia/awesome-multimodal-in-medical-imaging

A collection of resources on applications of multi-modal learning in medical imaging.

Size: 234 KB - Last synced at: 7 days ago - Pushed at: about 1 month ago - Stars: 730 - Forks: 65

Xovee/skapp Fork of YifanZhang-git/SKAPP

AAAI '25. Retrieval-Augmented Multimodal Social Media Popularity Prediction

Language: Python - Size: 89.8 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 17 - Forks: 0

willxxy/ECG-Bench

A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)

Language: Python - Size: 6.67 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 9 - Forks: 2

sangminwoo/awesome-vision-and-language

A curated list of awesome vision and language resources (still under construction... stay tuned!)

Size: 127 KB - Last synced at: 4 days ago - Pushed at: 6 months ago - Stars: 535 - Forks: 41

SuperBruceJia/Awesome-Mixture-of-Experts

Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts (MoME)

Size: 438 KB - Last synced at: 10 days ago - Pushed at: 4 months ago - Stars: 27 - Forks: 3

mims-harvard/COMPASS-web

Web Companion for Generalizable AI predicts immunotherapy outcomes across cancers and treatments

Language: Jupyter Notebook - Size: 53 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 0

mims-harvard/COMPASS

Generalizable AI predicts immunotherapy outcomes across cancers and treatments

Language: Jupyter Notebook - Size: 676 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 2 - Forks: 0

Khamies/comparative-study-brain-registration

Comparative study of classical and deep learning methods for multi-modal brain image registration using the RIRE dataset.

Language: Jupyter Notebook - Size: 23.5 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

thubZ09/All-Things-Multimodal

Hub for researchers exploring VLMs and Multimodal Learning:)

Size: 48 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 26 - Forks: 1

aclai-lab/MultiData.jl

Multimodal datasets for Machine-Learning

Language: Julia - Size: 389 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 3 - Forks: 0

pykale/pykale

Knowledge-Aware machine LEarning (KALE): accessible machine learning from multiple sources for interdisciplinary research, part of the 🔥PyTorch ecosystem. ⭐ Star to support our work!

Language: Python - Size: 46.3 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 457 - Forks: 66

lucaswychan/quant-lvlm

Easy-to-use large vision language model pipeline for quantitative analysis

Language: Python - Size: 953 KB - Last synced at: 5 days ago - Pushed at: 16 days ago - Stars: 2 - Forks: 0

ParitoshParmar/Piano-Skills-Assessment

Piano Skills Assessment [IEEE MMSP 2021]

Language: Python - Size: 854 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 17 - Forks: 2

alipay/Ant-Multi-Modal-Framework

Research Code for Multimodal-Cognition Team in Ant Group

Language: Python - Size: 17 MB - Last synced at: 17 days ago - Pushed at: 10 months ago - Stars: 143 - Forks: 5

olivier-bernard-creatis/olivier-bernard-creatis.github.io Fork of academicpages/academicpages.github.io

Website of Olivier Bernard, Professor at the university of Lyon (INSA) and Deputy Director of the CREATIS research laboratory

Language: Jupyter Notebook - Size: 151 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 0 - Forks: 0

ys-zong/awesome-self-supervised-multimodal-learning

[T-PAMI] A curated list of self-supervised multimodal learning resources.

Size: 5.32 MB - Last synced at: 2 days ago - Pushed at: 9 months ago - Stars: 252 - Forks: 8

ZaneBrackley/VIZMed

Thesis Project | Vision-Integrated Zero-Shot Medical AI

Language: Python - Size: 88.9 KB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 0 - Forks: 0

ilaria-manco/multimodal-ml-music

List of academic resources on Multimodal ML for Music

Language: TeX - Size: 268 KB - Last synced at: 2 days ago - Pushed at: about 2 years ago - Stars: 295 - Forks: 11

mims-harvard/AIM2

Artificial Intelligence in Medicine II

Language: HTML - Size: 341 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 3 - Forks: 0

PreferredAI/cornac

A Comparative Framework for Multimodal Recommender Systems

Language: Python - Size: 24.3 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 949 - Forks: 152

ChocoWu/SeTok

Codes for Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM

Language: Python - Size: 2.1 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 54 - Forks: 0

HenryHZY/Awesome-Multimodal-LLM

Research Trends in LLM-guided Multimodal Learning.

Size: 17.6 KB - Last synced at: 9 days ago - Pushed at: over 1 year ago - Stars: 358 - Forks: 16

microsoft/XPretrain

Multi-modality pre-training

Language: Python - Size: 3.59 MB - Last synced at: 4 days ago - Pushed at: about 1 year ago - Stars: 491 - Forks: 37

mbzuai-oryx/Camel-Bench

[NAACL 2025 🔥] CAMEL-Bench is an Arabic benchmark for evaluating multimodal models across eight domains with 29,000 questions.

Language: Python - Size: 14 MB - Last synced at: 10 days ago - Pushed at: 24 days ago - Stars: 31 - Forks: 1

IRVLUTD/Proto-CLIP

Code release for Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

Language: Python - Size: 69.1 MB - Last synced at: 4 days ago - Pushed at: 4 months ago - Stars: 41 - Forks: 6

IBM/AdaMML 📦

Official implementation of AdaMML. https://arxiv.org/abs/2105.05165.

Language: Python - Size: 113 KB - Last synced at: 5 days ago - Pushed at: about 3 years ago - Stars: 51 - Forks: 9

machine-intelligence-laboratory/TopicNet

Interface for easier topic modelling.

Language: Python - Size: 10.5 MB - Last synced at: 8 days ago - Pushed at: 10 months ago - Stars: 139 - Forks: 17

Hoar012/TDC-Video

Size: 3.05 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

MMMU-Benchmark/MMMU

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"

Language: Python - Size: 186 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 413 - Forks: 34

VectorInstitute/mmlearn

A toolkit for research on multimodal representation learning

Language: Python - Size: 4.92 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 15 - Forks: 3

Hyeongkeun/LAVCap

Official Pytorch Implementation of 'LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport' (ICASSP2025)

Language: Python - Size: 3.58 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 3 - Forks: 0

JanneHonkonen/ideas

My AI based ideas, designs and whatnot

Size: 22.5 KB - Last synced at: 27 days ago - Pushed at: 28 days ago - Stars: 0 - Forks: 0

ChiShengChen/MUSE_EEG

The official implement of Mind's eye: image recognition by EEG via multimodal similarity-keeping contrastive learning.

Language: Python - Size: 20.8 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 30 - Forks: 0

HUANGLIZI/LViT

[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"

Language: Python - Size: 90 MB - Last synced at: 29 days ago - Pushed at: 2 months ago - Stars: 338 - Forks: 32

pliang279/MFN

[AAAI 2018] Memory Fusion Network for Multi-view Sequential Learning

Language: Python - Size: 56.7 MB - Last synced at: 6 days ago - Pushed at: almost 5 years ago - Stars: 115 - Forks: 30

Haoyu-ha/LNLN

Towards Robust Multimodal Sentiment Analysis with Incomplete Data

Language: Python - Size: 29.3 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 53 - Forks: 4

amariucaitheodor/acquiring-linguistic-knowledge

Master's thesis of Theodor Amariucai, supervised by Alexander Warstadt and Prof. Ryan Cotterell.

Language: Python - Size: 5.14 MB - Last synced at: 29 days ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 1

mlfoundations/open_flamingo

An open-source framework for training large multimodal models.

Language: Python - Size: 7.36 MB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 3,882 - Forks: 301

VectorInstitute/shared-encoder

Codebase for the paper titled 'A Shared Encoder Approach to Multimodal Representation Learning'

Language: Python - Size: 141 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 7 - Forks: 1

praveena2j/Cross-Attentional-AV-Fusion

FG2021: Cross Attentional AV Fusion for Dimensional Emotion Recognition

Language: Python - Size: 92.8 KB - Last synced at: 29 days ago - Pushed at: 5 months ago - Stars: 28 - Forks: 5

praveena2j/Joint-Cross-Attention-for-Audio-Visual-Fusion

IEEE T-BIOM : "Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention"

Language: Python - Size: 290 KB - Last synced at: 29 days ago - Pushed at: 5 months ago - Stars: 38 - Forks: 11

KaiyangZhou/CoOp

Prompt Learning for Vision-Language Models (IJCV'22, CVPR'22)

Language: Python - Size: 1.38 MB - Last synced at: about 1 month ago - Pushed at: 12 months ago - Stars: 1,926 - Forks: 214

willxxy/ECG-Byte

[arxiv 2024] ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

Language: Python - Size: 27.5 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 14 - Forks: 0

MingliangLiang3/GLIP

Centered Masking for Language-Image Pre-training

Language: Jupyter Notebook - Size: 15.9 MB - Last synced at: 29 days ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

t0gae/AI-Dementia-Diagnosis

AI-Driven Multimodal Dementia Diagnosis: 3D MRI morphometry, and sensor data using cross-modal attention (LSTM + 3D-ResNet + Transformer). Aims to reduce late-stage diagnosis by 60% through early detection.

Language: Jupyter Notebook - Size: 13.7 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

AILab-CVC/UniRepLKNet

[CVPR'24] UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

Language: Python - Size: 4.82 MB - Last synced at: 28 days ago - Pushed at: 7 months ago - Stars: 980 - Forks: 57

aiishwarrya/VisualLanguageModel

A custom Vision-Language Model (VLM) built from scratch, using SigLip for contrastive learning and a ViT-based encoder to generate meaningful image captions and semantic descriptions.

Size: 2.49 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

mmaaz60/mvits_for_class_agnostic_od

[ECCV'22] Official repository of paper titled "Class-agnostic Object Detection with Multi-modal Transformer".

Language: Python - Size: 34.1 MB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 308 - Forks: 25

mbaqer/V2X-mmWave-Beamforming

PyTorch implementation of multi-modality sensing in 60 GHz mmWave beamforming for connected vehicles.

Language: Jupyter Notebook - Size: 5.09 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

3dlg-hcvc/tricolo

[WACV 2024] TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval

Language: Python - Size: 7.17 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 25 - Forks: 1

declare-lab/LLM-PuzzleTest

This repository is maintained to release dataset and models for multimodal puzzle reasoning.

Language: Python - Size: 131 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 78 - Forks: 7

pej0918/Prompt-The-Missing

[CVPR 2025 Workshop] Prompt The Missing : Efficient and Robust Audio-Visual Classification under Uncertain Modalities

Language: Python - Size: 3.44 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

Pointcept/GPT4Point

[CVPR'24 Highlight] GPT4Point: A Unified Framework for Point-Language Understanding and Generation.

Language: Python - Size: 114 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 381 - Forks: 24

ArrowLuo/CLIP4Clip

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

Language: Python - Size: 1.61 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 929 - Forks: 126

henghuiding/ReLA

[CVPR2023 Highlight] GRES: Generalized Referring Expression Segmentation

Language: Python - Size: 2.06 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 693 - Forks: 19

aehrc/cxrmate

CXRMate: Longitudinal Data and a Semantic Similarity Reward for Chest X-Ray Report Generation

Language: Python - Size: 4.03 MB - Last synced at: 30 days ago - Pushed at: 3 months ago - Stars: 15 - Forks: 3

DmitryRyumin/ICCV-2023-Papers

ICCV 2023 Papers: Discover cutting-edge research from ICCV 2023, the leading computer vision conference. Stay updated on the latest in computer vision and deep learning, with code included. ⭐ support visual intelligence development!

Language: Python - Size: 16.8 MB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 954 - Forks: 43

xieh97/language-based-audio-retrieval

List of academic resources on Language-Based Audio Retrieval

Size: 7.81 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

henghuiding/MeViS

[ICCV 2023] MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions

Language: Python - Size: 52.2 MB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 521 - Forks: 22

mhw32/multimodal-vae-public

A PyTorch implementation of "Multimodal Generative Models for Scalable Weakly-Supervised Learning" (https://arxiv.org/abs/1802.05335)

Language: Python - Size: 3.9 MB - Last synced at: 21 days ago - Pushed at: over 6 years ago - Stars: 158 - Forks: 36

miccunifi/SEARLE

[ICCV 2023] - Zero-shot Composed Image Retrieval with Textual Inversion

Language: Python - Size: 20.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 170 - Forks: 10

TencentARC/ViT-Lens

[CVPR 2024] ViT-Lens: Towards Omni-modal Representations

Language: Python - Size: 132 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 174 - Forks: 10

pliang279/MultiViz

[ICLR 2023] MultiViz: Towards Visualizing and Understanding Multimodal Models

Language: Python - Size: 790 MB - Last synced at: 29 days ago - Pushed at: 9 months ago - Stars: 96 - Forks: 5

sbelharbi/feature-vs-text-compound-emotion

Textualized and Feature-based Models for Compound Multimodal Emotion Recognition in the Wild, ABAW 7th - Challenge - Compound Expression (CE) Recognition Challenge

Language: Python - Size: 1.41 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 0

ksm26/Open-Source-Models-with-Hugging-Face

"Open Source Models with Hugging Face" course empowers you with the skills to leverage open-source models from the Hugging Face Hub for various tasks in NLP, audio, image, and multimodal domains.

Language: Jupyter Notebook - Size: 21 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 24 - Forks: 19

merveenoyan/siglip

Projects based on SigLIP (Zhai et. al, 2023) and Hugging Face transformers integration 🤗

Language: Jupyter Notebook - Size: 1.66 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 224 - Forks: 12

mims-harvard/Madrigal

Madrigal: Multimodal AI predicts clinical outcomes of drug combinations from preclinical data

Language: Jupyter Notebook - Size: 18.3 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 20 - Forks: 6

fork123aniket/Multi-Round-VLM-powered-Multimodal-Conversational-AI-Navigation-Bot

Streamlit App Combining Vision, Language, and Audio AI Models

Language: Python - Size: 18.6 KB - Last synced at: 19 days ago - Pushed at: 3 months ago - Stars: 3 - Forks: 0

Hoar012/RAP-MLLM

[CVPR 2025] RAP: Retrieval-Augmented Personalization

Language: Python - Size: 57.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 30 - Forks: 0

breezedeus/Coin-CLIP

Coin-CLIP: fine-tuned with a vast collection of coin images from CLIP using contrastive learning. It enhances feature extraction for coins, boosting image search accuracy. This model merges Visual Transformer (ViT) with CLIP's multimodal learning, optimized for numismatic applications.

Language: Python - Size: 50.3 MB - Last synced at: 9 days ago - Pushed at: over 1 year ago - Stars: 20 - Forks: 3

Related Keywords

multimodal-learning 301 deep-learning 69 computer-vision 46 multimodal 45 pytorch 44 machine-learning 40 multimodal-deep-learning 38 natural-language-processing 18 large-language-models 17 nlp 16 multimodal-large-language-models 15 multimodality 14 representation-learning 14 transformer 13 visual-question-answering 13 clip 13 self-supervised-learning 12 image-captioning 12 llm 12 artificial-intelligence 12 vision-and-language 11 emotion-recognition 11 attention-model 10 contrastive-learning 10 multimodal-sentiment-analysis 10 attention-mechanism 10 generative-ai 9 video-understanding 9 vision-language-model 9 vision-language 9 multimodal-data 8 foundation-models 8 audio-visual-learning 8 affective-computing 8 ai 7 llms 7 transfer-learning 7 remote-sensing 6 sentiment-analysis 6 speech-processing 6 vision-language-transformer 6 robotics 6 deep-neural-networks 6 weakly-supervised-learning 6 bert 6 python 6 attention 6 awesome-list 5 multimodal-datasets 5 dataset 5 convolutional-neural-networks 5 prompt-learning 5 diffusion-models 5 large-multimodal-models 5 classification 5 pre-training 5 reinforcement-learning 5 language-model 5 multimodal-fusion 5 medical-imaging 5 pytorch-lightning 4 domain-adaptation 4 multitask-learning 4 cross-modal-retrieval 4 gpt4 4 biosignals 4 vqa 4 video-grounding 4 knowledge-distillation 4 regression 4 multimodal-representation 4 zero-shot-learning 4 generative-model 4 multisensor-fusion 4 zero-shot-classification 4 multi-modal-learning 4 tensorflow 4 video-analysis 4 vision-language-learning 4 medical-ai 3 music-information-retrieval 3 action-recognition 3 image-text-retrieval 3 data-science 3 emotion 3 benchmark 3 therapeutics 3 few-shot-learning 3 python3 3 tabular-data 3 speech-recognition 3 text-classification 3 acmmm2024 3 question-answering 3 audio 3 videoqa 3 pytorch-implementation 3 segmentation 3 image-retrieval 3 ecg 3