An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: multimodal-learning

thubZ09/All-Things-Multimodal

Hub for researchers exploring VLMs and Multimodal Learning:)

Size: 47.9 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 25 - Forks: 1

ChocoWu/SeTok

Codes for Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM

Language: Python - Size: 2.1 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 54 - Forks: 0

HenryHZY/Awesome-Multimodal-LLM

Research Trends in LLM-guided Multimodal Learning.

Size: 17.6 KB - Last synced at: about 23 hours ago - Pushed at: over 1 year ago - Stars: 358 - Forks: 16

JoshD898/caretMultimodal

Multimodal model training in R

Language: R - Size: 2.28 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

SuperBruceJia/Awesome-Mixture-of-Experts

Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts (MoME)

Size: 438 KB - Last synced at: 2 days ago - Pushed at: 3 months ago - Stars: 24 - Forks: 3

AdityaLab/MM4TSA

A professional list on Multi-Modalities For Time Series Analysis (MM4TSA) Papers and Resource.

Size: 457 KB - Last synced at: 1 day ago - Pushed at: 18 days ago - Stars: 27 - Forks: 0

microsoft/XPretrain

Multi-modality pre-training

Language: Python - Size: 3.59 MB - Last synced at: 1 day ago - Pushed at: 12 months ago - Stars: 491 - Forks: 37

ytunprovoke/image-optimization-guide

Best practices for image optimization without losing quality. Improve your website speed and performance.

Size: 4.88 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

mbzuai-oryx/Camel-Bench

[NAACL 2025 πŸ”₯] CAMEL-Bench is an Arabic benchmark for evaluating multimodal models across eight domains with 29,000 questions.

Language: Python - Size: 14 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 31 - Forks: 1

Eurus-Holmes/Awesome-Multimodal-Research

A curated list of Multimodal Related Research.

Language: Python - Size: 903 MB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 1,346 - Forks: 149

The-Martyr/Awesome-Multimodal-Reasoning

Latest Advances on (RL based) Multimodal Reasoning and Generation in Multimodal Large Language Models

Size: 60.5 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 19 - Forks: 0

Hoar012/TDC-Video

Size: 3.05 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

MMMU-Benchmark/MMMU

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"

Language: Python - Size: 186 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 413 - Forks: 34

willxxy/ECG-Bench

A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)

Language: Python - Size: 6.68 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 7 - Forks: 1

akusayudodograu/Agentic-RAG-Story-Generation-with-Multimodal-GenAI

Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling

Size: 1000 Bytes - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 8 - Forks: 1

PreferredAI/cornac

A Comparative Framework for Multimodal Recommender Systems

Language: Python - Size: 24.2 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 945 - Forks: 153

Hyeongkeun/LAVCap

Official Pytorch Implementation of 'LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport' (ICASSP2025)

Language: Python - Size: 3.58 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 3 - Forks: 0

ys-zong/awesome-self-supervised-multimodal-learning

[T-PAMI] A curated list of self-supervised multimodal learning resources.

Size: 5.32 MB - Last synced at: 7 days ago - Pushed at: 8 months ago - Stars: 251 - Forks: 8

JanneHonkonen/ideas

My AI based ideas, designs and whatnot

Size: 22.5 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

ChiShengChen/MUSE_EEG

The official implement of Mind's eye: image recognition by EEG via multimodal similarity-keeping contrastive learning.

Language: Python - Size: 20.8 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 30 - Forks: 0

HUANGLIZI/LViT

[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"

Language: Python - Size: 90 MB - Last synced at: 8 days ago - Pushed at: about 1 month ago - Stars: 338 - Forks: 32

mims-harvard/AIM2

Artificial Intelligence in Medicine II

Language: HTML - Size: 336 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 3 - Forks: 0

pliang279/awesome-multimodal-ml

Reading list for research topics in multimodal machine learning

Size: 459 KB - Last synced at: 9 days ago - Pushed at: 8 months ago - Stars: 6,381 - Forks: 875

willxxy/awesome-mmps

Corpus of resources for multimodal machine learning with physiological signals (mmps).

Size: 1.03 MB - Last synced at: 10 days ago - Pushed at: 17 days ago - Stars: 70 - Forks: 2

richard-peng-xia/awesome-multimodal-in-medical-imaging

A collection of resources on applications of multi-modal learning in medical imaging.

Size: 234 KB - Last synced at: 12 days ago - Pushed at: 19 days ago - Stars: 714 - Forks: 68

Haoyu-ha/LNLN

Towards Robust Multimodal Sentiment Analysis with Incomplete Data

Language: Python - Size: 29.3 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 53 - Forks: 4

amariucaitheodor/acquiring-linguistic-knowledge

Master's thesis of Theodor Amariucai, supervised by Alexander Warstadt and Prof. Ryan Cotterell.

Language: Python - Size: 5.14 MB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 1

mlfoundations/open_flamingo

An open-source framework for training large multimodal models.

Language: Python - Size: 7.36 MB - Last synced at: 12 days ago - Pushed at: 8 months ago - Stars: 3,882 - Forks: 301

VectorInstitute/shared-encoder

Codebase for the paper titled 'A Shared Encoder Approach to Multimodal Representation Learning'

Language: Python - Size: 141 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 7 - Forks: 1

ilaria-manco/multimodal-ml-music

List of academic resources on Multimodal ML for Music

Language: TeX - Size: 268 KB - Last synced at: 7 days ago - Pushed at: about 2 years ago - Stars: 293 - Forks: 11

praveena2j/Cross-Attentional-AV-Fusion

FG2021: Cross Attentional AV Fusion for Dimensional Emotion Recognition

Language: Python - Size: 92.8 KB - Last synced at: 9 days ago - Pushed at: 5 months ago - Stars: 28 - Forks: 5

praveena2j/Joint-Cross-Attention-for-Audio-Visual-Fusion

IEEE T-BIOM : "Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention"

Language: Python - Size: 290 KB - Last synced at: 9 days ago - Pushed at: 5 months ago - Stars: 38 - Forks: 11

VectorInstitute/mmlearn

A toolkit for research on multimodal representation learning

Language: Python - Size: 4.91 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 14 - Forks: 3

kyegomez/NaViT

My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"

Language: Python - Size: 210 KB - Last synced at: 13 days ago - Pushed at: 17 days ago - Stars: 226 - Forks: 10

KaiyangZhou/CoOp

Prompt Learning for Vision-Language Models (IJCV'22, CVPR'22)

Language: Python - Size: 1.38 MB - Last synced at: 14 days ago - Pushed at: 11 months ago - Stars: 1,926 - Forks: 214

willxxy/ECG-Byte

[arxiv 2024] ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

Language: Python - Size: 27.5 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 14 - Forks: 0

MingliangLiang3/GLIP

Centered Masking for Language-Image Pre-training

Language: Jupyter Notebook - Size: 15.9 MB - Last synced at: 8 days ago - Pushed at: 15 days ago - Stars: 1 - Forks: 0

t0gae/AI-Dementia-Diagnosis

AI-Driven Multimodal Dementia Diagnosis: 3D MRI morphometry, and sensor data using cross-modal attention (LSTM + 3D-ResNet + Transformer). Aims to reduce late-stage diagnosis by 60% through early detection.

Language: Jupyter Notebook - Size: 13.7 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

sangminwoo/awesome-vision-and-language

A curated list of awesome vision and language resources (still under construction... stay tuned!)

Size: 127 KB - Last synced at: 9 days ago - Pushed at: 6 months ago - Stars: 532 - Forks: 41

friedrichor/Awesome-Multimodal-Papers

A curated list of awesome Multimodal studies.

Language: HTML - Size: 63.2 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 172 - Forks: 16

AILab-CVC/UniRepLKNet

[CVPR'24] UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

Language: Python - Size: 4.82 MB - Last synced at: 7 days ago - Pushed at: 6 months ago - Stars: 980 - Forks: 57

aiishwarrya/VisualLanguageModel

A custom Vision-Language Model (VLM) built from scratch, using SigLip for contrastive learning and a ViT-based encoder to generate meaningful image captions and semantic descriptions.

Size: 2.49 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

mmaaz60/mvits_for_class_agnostic_od

[ECCV'22] Official repository of paper titled "Class-agnostic Object Detection with Multi-modal Transformer".

Language: Python - Size: 34.1 MB - Last synced at: 14 days ago - Pushed at: almost 2 years ago - Stars: 308 - Forks: 25

mbaqer/V2X-mmWave-Beamforming

PyTorch implementation of multi-modality sensing in 60 GHz mmWave beamforming for connected vehicles.

Language: Jupyter Notebook - Size: 5.09 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 2 - Forks: 0

3dlg-hcvc/tricolo

[WACV 2024] TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval

Language: Python - Size: 7.17 MB - Last synced at: 16 days ago - Pushed at: about 1 year ago - Stars: 25 - Forks: 1

pliang279/MFN

[AAAI 2018] Memory Fusion Network for Multi-view Sequential Learning

Language: Python - Size: 56.7 MB - Last synced at: 12 days ago - Pushed at: over 4 years ago - Stars: 114 - Forks: 30

declare-lab/LLM-PuzzleTest

This repository is maintained to release dataset and models for multimodal puzzle reasoning.

Language: Python - Size: 131 MB - Last synced at: 13 days ago - Pushed at: about 2 months ago - Stars: 78 - Forks: 7

pej0918/Prompt-The-Missing

[CVPR 2025 Workshop] Prompt The Missing : Efficient and Robust Audio-Visual Classification under Uncertain Modalities

Language: Python - Size: 3.44 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 1 - Forks: 0

Pointcept/GPT4Point

[CVPR'24 Highlight] GPT4Point: A Unified Framework for Point-Language Understanding and Generation.

Language: Python - Size: 114 MB - Last synced at: 16 days ago - Pushed at: 12 months ago - Stars: 381 - Forks: 24

ArrowLuo/CLIP4Clip

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

Language: Python - Size: 1.61 MB - Last synced at: 18 days ago - Pushed at: about 1 year ago - Stars: 929 - Forks: 126

henghuiding/ReLA

[CVPR2023 Highlight] GRES: Generalized Referring Expression Segmentation

Language: Python - Size: 2.06 MB - Last synced at: 17 days ago - Pushed at: over 1 year ago - Stars: 693 - Forks: 19

aehrc/cxrmate

CXRMate: Longitudinal Data and a Semantic Similarity Reward for Chest X-Ray Report Generation

Language: Python - Size: 4.03 MB - Last synced at: 9 days ago - Pushed at: 2 months ago - Stars: 15 - Forks: 3

DmitryRyumin/ICCV-2023-Papers

ICCV 2023 Papers: Discover cutting-edge research from ICCV 2023, the leading computer vision conference. Stay updated on the latest in computer vision and deep learning, with code included. ⭐ support visual intelligence development!

Language: Python - Size: 16.8 MB - Last synced at: 13 days ago - Pushed at: 8 months ago - Stars: 954 - Forks: 43

xieh97/language-based-audio-retrieval

List of academic resources on Language-Based Audio Retrieval

Size: 7.81 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 1 - Forks: 0

henghuiding/MeViS

[ICCV 2023] MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions

Language: Python - Size: 52.2 MB - Last synced at: 16 days ago - Pushed at: 10 months ago - Stars: 521 - Forks: 22

Daming-W/EcoDatum

The official implementation of [Quality over Quantity: Boosting Data Efficiency Through Ensembled Multimodal Data Curation] in AAAI2025.

Size: 6.84 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 1 - Forks: 0

mhw32/multimodal-vae-public

A PyTorch implementation of "Multimodal Generative Models for Scalable Weakly-Supervised Learning" (https://arxiv.org/abs/1802.05335)

Language: Python - Size: 3.9 MB - Last synced at: about 9 hours ago - Pushed at: over 6 years ago - Stars: 158 - Forks: 36

miccunifi/SEARLE

[ICCV 2023] - Zero-shot Composed Image Retrieval with Textual Inversion

Language: Python - Size: 20.1 MB - Last synced at: 18 days ago - Pushed at: 12 months ago - Stars: 170 - Forks: 10

TencentARC/ViT-Lens

[CVPR 2024] ViT-Lens: Towards Omni-modal Representations

Language: Python - Size: 132 MB - Last synced at: 17 days ago - Pushed at: 3 months ago - Stars: 174 - Forks: 10

pliang279/MultiViz

[ICLR 2023] MultiViz: Towards Visualizing and Understanding Multimodal Models

Language: Python - Size: 790 MB - Last synced at: 9 days ago - Pushed at: 8 months ago - Stars: 96 - Forks: 5

sbelharbi/feature-vs-text-compound-emotion

Textualized and Feature-based Models for Compound Multimodal Emotion Recognition in the Wild, ABAW 7th - Challenge - Compound Expression (CE) Recognition Challenge

Language: Python - Size: 1.41 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 4 - Forks: 0

ksm26/Open-Source-Models-with-Hugging-Face

"Open Source Models with Hugging Face" course empowers you with the skills to leverage open-source models from the Hugging Face Hub for various tasks in NLP, audio, image, and multimodal domains.

Language: Jupyter Notebook - Size: 21 MB - Last synced at: 24 days ago - Pushed at: about 1 year ago - Stars: 24 - Forks: 19

alipay/Ant-Multi-Modal-Framework

Research Code for Multimodal-Cognition Team in Ant Group

Language: Python - Size: 17 MB - Last synced at: 25 days ago - Pushed at: 9 months ago - Stars: 138 - Forks: 5

merveenoyan/siglip

Projects based on SigLIP (Zhai et. al, 2023) and Hugging Face transformers integration πŸ€—

Language: Jupyter Notebook - Size: 1.66 MB - Last synced at: 17 days ago - Pushed at: about 2 months ago - Stars: 224 - Forks: 12

mims-harvard/Madrigal

Madrigal: Multimodal AI predicts clinical outcomes of drug combinations from preclinical data

Language: Jupyter Notebook - Size: 18.3 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 20 - Forks: 6

fork123aniket/Multi-Round-VLM-powered-Multimodal-Conversational-AI-Navigation-Bot

Streamlit App Combining Vision, Language, and Audio AI Models

Language: Python - Size: 18.6 KB - Last synced at: 23 days ago - Pushed at: 3 months ago - Stars: 3 - Forks: 0

Hoar012/RAP-MLLM

[CVPR 2025] RAP: Retrieval-Augmented Personalization

Language: Python - Size: 57.7 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 30 - Forks: 0

breezedeus/Coin-CLIP

Coin-CLIP: fine-tuned with a vast collection of coin images from CLIP using contrastive learning. It enhances feature extraction for coins, boosting image search accuracy. This model merges Visual Transformer (ViT) with CLIP's multimodal learning, optimized for numismatic applications.

Language: Python - Size: 50.3 MB - Last synced at: 19 days ago - Pushed at: over 1 year ago - Stars: 20 - Forks: 3

ai4ce/EgoPAT3D

[CVPR 2022] Egocentric Action Target Prediction in 3D

Language: Jupyter Notebook - Size: 93.3 MB - Last synced at: 10 days ago - Pushed at: over 1 year ago - Stars: 32 - Forks: 3

ZaneBrackley/VIZMed

Thesis Project | Vision-Integrated Zero-Shot Medical AI

Language: Python - Size: 85.9 KB - Last synced at: 24 days ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Jorffy/DAIE

Code for "Dual-Level Adaptive Incongruity-Enhanced Model for Multimodal Sarcasm Detection".

Language: Python - Size: 183 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 20 - Forks: 0

praveena2j/RJCMA

ABAW6 (CVPR-W) We achieved second place in the valence arousal challenge of ABAW6

Language: Python - Size: 171 KB - Last synced at: 9 days ago - Pushed at: 11 months ago - Stars: 18 - Forks: 3

kyegomez/CM3Leon

An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal AI that uses just a decoder to generate both text and images

Language: Python - Size: 754 KB - Last synced at: 15 days ago - Pushed at: over 1 year ago - Stars: 359 - Forks: 18

pykale/pykale

Knowledge-Aware machine LEarning (KALE): accessible machine learning from multiple sources for interdisciplinary research, part of the πŸ”₯PyTorch ecosystem. ⭐ Star to support our work!

Language: Python - Size: 46.4 MB - Last synced at: 8 days ago - Pushed at: 11 days ago - Stars: 455 - Forks: 64

taco-group/DecAlign

A novel cross-modal decoupling and alignment framework for multimodal representation learning.

Language: JavaScript - Size: 13.8 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

praveena2j/RJCAforSpeakerVerification

[FG 2024] "Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention"

Language: Python - Size: 1 MB - Last synced at: 15 days ago - Pushed at: 5 months ago - Stars: 4 - Forks: 0

pengfei-luo/multimodal-knowledge-graph

A collection of resources on multimodal knowledge graph, including datasets, papers and contests.

Size: 50.8 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 162 - Forks: 17

snap-research/MMVID

[CVPR 2022] Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Language: Python - Size: 77.5 MB - Last synced at: 13 days ago - Pushed at: almost 3 years ago - Stars: 192 - Forks: 23

zjunlp/HVPNeT

[NAACL 2022 Findings] Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction

Language: Python - Size: 1.88 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 108 - Forks: 11

OFA-Sys/OFASys

OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

Language: Python - Size: 20.3 MB - Last synced at: 14 days ago - Pushed at: over 2 years ago - Stars: 147 - Forks: 13

jyrao/UniSoccer

[CVPR 2025] "Towards Universal Soccer Video Understanding".

Language: Python - Size: 80.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 106 - Forks: 5

kyegomez/AutoRT

Implementation of AutoRT: "AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents"

Language: Python - Size: 2.49 MB - Last synced at: 11 days ago - Pushed at: 5 months ago - Stars: 39 - Forks: 3

HuaizhengZhang/Awsome-Deep-Learning-for-Video-Analysis

Papers, code and datasets about deep learning and multi-modal learning for video analysis

Size: 98.6 KB - Last synced at: about 1 month ago - Pushed at: over 3 years ago - Stars: 786 - Forks: 171

praveena2j/JointCrossAttentional-AV-Fusion

ABAW3 (CVPRW): A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

Language: Python - Size: 148 KB - Last synced at: 9 days ago - Pushed at: over 1 year ago - Stars: 43 - Forks: 9

dlcjfgmlnasa/SynthSleepNet

[Arxiv] Toward Foundational Model for Sleep Analysis Using a Multimodal Hybrid Self-Supervised Learning Framework

Language: Python - Size: 521 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 7 - Forks: 2

RyanJJP/CHARMS

The code repository for ICML24 paper "Tabular Insights, Visual Impacts: Transferring Expertise from Tables to Images"

Language: Python - Size: 973 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 13 - Forks: 1

Xovee/skapp Fork of YifanZhang-git/SKAPP

AAAI '25. Retrieval-Augmented Multimodal Social Media Popularity Prediction

Language: Python - Size: 84 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 16 - Forks: 0

BUAADreamer/SPN4CIR

[ACM MM 2024] Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

Language: Python - Size: 4.2 MB - Last synced at: 8 days ago - Pushed at: 6 months ago - Stars: 30 - Forks: 3

marcomistretta/marcomistretta

Welcome to my GitHub page!

Size: 6 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

rajibrhasan/modality_gap

A repository for visualization of modality gap in VLMs

Language: Python - Size: 14.6 KB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

pliang279/factorized

[ICLR 2019] Learning Factorized Multimodal Representations

Language: Python - Size: 45.9 KB - Last synced at: 12 days ago - Pushed at: over 4 years ago - Stars: 67 - Forks: 10

rabiulcste/vismin

[NeurIPS24] VisMin: Visual Minimal-Change Understanding

Language: Python - Size: 66.1 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 12 - Forks: 1

rajibrhasan/LLaVA

This repository contains the implementation of a modified LLaVA architecture designed to address information imbalance between modalities in multimodal learning.

Language: Python - Size: 14.8 MB - Last synced at: 13 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

bryanbocao/vitag

Repository of the paper ViTag in SECON 2022πŸš€ and demo (Best Demo AwardπŸ†).

Language: Python - Size: 401 KB - Last synced at: 16 days ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 1

pliang279/MultiBench

[NeurIPS 2021] Multiscale Benchmarks for Multimodal Representation Learning

Language: HTML - Size: 49.9 MB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 519 - Forks: 75

peirong26/UNA

[CVPR 2025] Unraveling Normal Anatomy via Fluid-Driven Anomaly Randomization

Language: Python - Size: 1.25 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 0

OmniTitanAI/OmniTitan-RL-AI

A universal RL engine transcending modality barriers, empowering cross-industry intelligence with superhuman decision efficiency. Created by @sudip_royedu

Language: Python - Size: 0 Bytes - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

kevalmorabia97/CoVA-Web-Object-Detection

A Context-aware Visual Attention-based training pipeline for Object Detection from a Webpage screenshot!

Language: Python - Size: 1.4 MB - Last synced at: 14 days ago - Pushed at: about 2 months ago - Stars: 92 - Forks: 14

IRVLUTD/Proto-CLIP

Code release for Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

Language: Python - Size: 69.1 MB - Last synced at: 15 days ago - Pushed at: 4 months ago - Stars: 39 - Forks: 6

minjoong507/BM-DETR

[WACV 2025] Official Pytorch code for "Background-aware Moment Detection for Video Moment Retrieval"

Language: Python - Size: 3.07 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 14 - Forks: 0

Related Keywords
multimodal-learning 295 deep-learning 68 computer-vision 46 multimodal 45 pytorch 44 machine-learning 40 multimodal-deep-learning 38 natural-language-processing 18 large-language-models 17 nlp 16 representation-learning 14 multimodal-large-language-models 14 multimodality 14 visual-question-answering 13 transformer 13 image-captioning 12 llm 12 clip 12 artificial-intelligence 12 self-supervised-learning 12 vision-and-language 11 emotion-recognition 11 attention-mechanism 10 contrastive-learning 10 multimodal-sentiment-analysis 10 attention-model 10 vision-language-model 9 vision-language 9 video-understanding 9 generative-ai 8 multimodal-data 8 affective-computing 8 audio-visual-learning 8 foundation-models 8 ai 7 llms 7 transfer-learning 7 weakly-supervised-learning 6 speech-processing 6 attention 6 robotics 6 deep-neural-networks 6 remote-sensing 6 python 6 vision-language-transformer 6 sentiment-analysis 6 dataset 5 bert 5 classification 5 multimodal-fusion 5 medical-imaging 5 large-multimodal-models 5 language-model 5 pre-training 5 multimodal-datasets 5 prompt-learning 5 awesome-list 5 convolutional-neural-networks 5 regression 4 biosignals 4 tensorflow 4 video-analysis 4 gpt4 4 zero-shot-learning 4 zero-shot-classification 4 multimodal-representation 4 reinforcement-learning 4 generative-model 4 video-grounding 4 cross-modal-retrieval 4 vision-language-learning 4 multisensor-fusion 4 domain-adaptation 4 pytorch-lightning 4 knowledge-distillation 4 diffusion-models 4 multi-modal-learning 4 multitask-learning 4 vqa 4 ensemble-learning 3 internvl2 3 action-recognition 3 attention-is-all-you-need 3 few-shot-learning 3 music-information-retrieval 3 object-detection 3 signal-processing 3 audio 3 neural-network 3 emotion 3 data-fusion 3 transformer-architecture 3 speech-recognition 3 information-extraction 3 videoqa 3 keras 3 python3 3 segmentation 3 eeg 3 medical-ai 3