GitHub topics: multimodal-deep-learning

Repositories

kyegomez/the-compiler

Seed, Code, Harvest: Grow Your Own App with Tree of Thoughts!

Language: Python - Size: 10.4 MB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 145 - Forks: 17

ashutosh1919/data2vec-pytorch

Ready to run PyTorch implementation of Data2Vec 2.0: Highly efficient self-supervised representation learning for vision, speech and text.

Language: Python - Size: 116 KB - Last synced at: 4 days ago - Pushed at: about 2 years ago - Stars: 14 - Forks: 2

nlp-unibo/multimodal-am-fallacy

Multimodal Fallacy Classification in Political Debates: Dataset and Experiments.

Language: Python - Size: 11.2 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 4 - Forks: 0

AlfredoBaione/Music_to_figurative_art

A project for generating artistic images semantically relatead to music inputs.

Language: Python - Size: 5.65 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

MohamedTharwat21/MemexQA

MemexQA is a project designed to tackle the challenge of real-life multimodal question answering by leveraging both visual and textual data from personal photo albums.

Language: Python - Size: 5.12 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

kyegomez/Kosmos2.5

My implementation of Kosmos2.5 from the paper: "KOSMOS-2.5: A Multimodal Literate Model"

Language: Python - Size: 231 KB - Last synced at: 1 day ago - Pushed at: about 1 month ago - Stars: 73 - Forks: 6

Avir-AI/handimage_mamba

[IKT 2024] A Multi-Task Framework Using Mamba for Identity, Age, and Gender Classification from Hand Images

Language: Python - Size: 73.2 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

codezakh/DataEnvGym

A testbed for agents and environments that can automatically improve models through data generation.

Language: Python - Size: 9.16 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 19 - Forks: 5

kyegomez/swarms-pytorch

Swarming algorithms like PSO, Ant Colony, Sakana, and more in PyTorch 😊

Language: Python - Size: 58.2 MB - Last synced at: 8 days ago - Pushed at: about 1 month ago - Stars: 121 - Forks: 10

DWCTOD/ECCV2022-Papers-with-Code-Demo

收集 ECCV 最新的成果，包括论文、代码和demo视频等，欢迎大家推荐！

Size: 170 KB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 286 - Forks: 23

chikap421/videosam

This repository accompanies the paper "VideoSAM: A Large Vision Foundation Model for High-Speed Video Segmentation"

Language: Jupyter Notebook - Size: 160 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 5 - Forks: 1

jianghaojun/Awesome-3D-Vision-and-Language

A collection of 3D vision and language (e.g., 3D Visual Grounding, 3D Question Answering and 3D Dense Caption) papers and datasets.

Size: 33.2 KB - Last synced at: 5 days ago - Pushed at: about 2 years ago - Stars: 97 - Forks: 5

discover-Austin/multimodal-emotion-recognition

A deep learning system for real-time emotion recognition from both text and images using transformers.

Language: Python - Size: 0 Bytes - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

deepmancer/deepmancer

"When in doubt, use brute force." - Ken Thompson

Size: 431 KB - Last synced at: 5 days ago - Pushed at: 3 months ago - Stars: 3 - Forks: 0

declare-lab/multimodal-deep-learning

This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.

Language: OpenEdge ABL - Size: 181 MB - Last synced at: 3 months ago - Pushed at: about 2 years ago - Stars: 801 - Forks: 157

Computational-social-science/Skew-pair_Fusion

We propose a holistic framework that formalizes a dual interpretable mechanism, comprising universal skew-layer alignment and bootstrapping sparsity, to enhance fusion gain in hybrid neural networks.

Language: Jupyter Notebook - Size: 6.78 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

kyegomez/PALI3

Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"

Language: Python - Size: 2.61 MB - Last synced at: 30 days ago - Pushed at: about 1 month ago - Stars: 145 - Forks: 4

omeregev/click2mask

[AAAI 2025] Official Implementation for "Click2Mask: Local Editing with Dynamic Mask Generation" Paper.

Language: Python - Size: 61.9 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 14 - Forks: 2

vvvb-github/AVSegFormer

[AAAI 2024] AVSegFormer: Audio-Visual Segmentation with Transformer

Language: Python - Size: 486 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 62 - Forks: 5

drprojects/DeepViewAgg

[CVPR'22 Best Paper Finalist] Official PyTorch implementation of the method presented in "Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation"

Language: Python - Size: 302 MB - Last synced at: about 1 month ago - Pushed at: 9 months ago - Stars: 228 - Forks: 25

naver/artemis

Official code release for ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity (published at ICLR 2022)

Language: Python - Size: 1.26 MB - Last synced at: 28 days ago - Pushed at: over 2 years ago - Stars: 48 - Forks: 4

first-coding/Multimodal-Assistant

Language: Python - Size: 1.29 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

JanTeichertKluge/DMLSim

This library provides packages on DoubleML / Causal Machine Learning and Neural Networks in Python for Simulation and Case Studies.

Language: Python - Size: 145 KB - Last synced at: 29 days ago - Pushed at: almost 2 years ago - Stars: 6 - Forks: 0

jahez07/Multimodal-Fusion-Strategy-to-Classify-Malware

This work focuses on proposing a novel approach towards classifying malware binaries by extracting visual features from malware executables.

Language: Jupyter Notebook - Size: 257 MB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 4 - Forks: 0

sisinflab/Ducho

Python framework to extract multimodal features for multimodal recommendation in a highly-customizable way.

Language: Python - Size: 3.62 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 22 - Forks: 5

kritiksoman/Multimodal

Listen. Write. Speak. Read. Think.

Language: Jupyter Notebook - Size: 15.2 MB - Last synced at: 10 days ago - Pushed at: about 3 years ago - Stars: 11 - Forks: 0

yuhui-zh15/drml

Official Code Release for "Diagnosing and Rectifying Vision Models using Language" (ICLR 2023)

Language: Jupyter Notebook - Size: 19.2 MB - Last synced at: 28 days ago - Pushed at: almost 2 years ago - Stars: 33 - Forks: 0

Yuco-Z/Awesome-Multi-Modal-Dialog

[Paperlist] Awesome paper list of multimodal dialog, including methods, datasets and metrics

Size: 169 KB - Last synced at: 6 days ago - Pushed at: 4 months ago - Stars: 39 - Forks: 4

AI4Patents/IMPACT

IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design Patents (NeurIPS 2024)

Language: Jupyter Notebook - Size: 23.6 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 7 - Forks: 1

westlake-repl/IDvs.MoRec

End-to-end Training for Multimodal Recommendation Systems

Language: Python - Size: 57.6 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 139 - Forks: 18

fork123aniket/Agentic-RAG-Story-Generation-with-Multimodal-GenAI

Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling

Language: Python - Size: 94.7 KB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

IDEA-Research/ChatRex

Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

Language: Python - Size: 8.8 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 124 - Forks: 3

JunweiLiang/FVTA_MemexQA

Real-world photo sequence question answering system (MemexQA). CVPR'18 and TPAMI'19

Language: Python - Size: 723 KB - Last synced at: 23 days ago - Pushed at: almost 6 years ago - Stars: 32 - Forks: 15

This repository contains the development of SynthAVSR, the first Audiovisual Speech Recognition (AVSR) system tailored for the Spanish and Catalan languages. Based on the AV-HuBERT (Audio-Visual Hidden Unit BERT) model, SynthAVSR leverages synthetic audiovisual data to bridge the gap in speech recognition technology for these languages.

Language: Python - Size: 290 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Adm-2005/DeMorph

Deepfake Detection Solution using Multimodal Approach.

Language: Python - Size: 10.9 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 2 - Forks: 0

ThomasHelfer/multimodal-supernovae

A codebase dedicated to exploring multimodal learning approaches by integrating images of host galaxies of supernovae and their corresponding light-curves and spectra.

Language: Jupyter Notebook - Size: 1.66 GB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 12 - Forks: 2

ZhaoPeiduo/BLIP2-Japanese

Modifying LAVIS' BLIP2 Q-former with models pretrained on Japanese datasets.

Language: Python - Size: 75.9 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 12 - Forks: 1

mbappeenjoyer/GIF-QA

Documentation of the approach employed to tackle the task of GIF Question Answering

Language: Jupyter Notebook - Size: 2.72 MB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

theislab/scarches

Reference mapping for single-cell genomics

Language: Jupyter Notebook - Size: 825 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 347 - Forks: 52

fevieira27/ImageRecognitionAI-R

R Script for AI Image and Location Recognition that can also generate an automated prompt for AI text-generation of a social media post.

Language: R - Size: 879 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

kyegomez/Pegasus

PegasusX: The Future of Multimodal Embeddings 🦄 🦄

Language: Python - Size: 37.5 MB - Last synced at: 27 days ago - Pushed at: 7 months ago - Stars: 16 - Forks: 5

ksm26/Introducing-Multimodal-Llama-3.2

This repository focuses on the cutting-edge features of Llama 3.2, including multimodal capabilities, advanced tokenization, and tool calling for building next-gen AI applications. It highlights Llama's enhanced image reasoning, multilingual support, and the Llama Stack API for seamless customization and orchestration.

Language: Jupyter Notebook - Size: 3.79 MB - Last synced at: about 2 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

FuxiaoLiu/DocumentCLIP

[ICPRAI 2024] DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents

Language: Python - Size: 2.49 MB - Last synced at: 1 day ago - Pushed at: about 1 year ago - Stars: 16 - Forks: 0

busraoguzoglu/Image-Similarity-Search

Using CLIP/Titan/ALIGN for Multimodal Image Search: Searching images with a keyword or with a sample image

Language: Jupyter Notebook - Size: 11.2 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

attarmau/Multimodal-Misinformation-Detection

Multimodal deep learning model for fake news classification.

Size: 7.81 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

kassy11/daicwoz_voice

Preprocessing and feature extraction for raw voice data of DAIC-WOZ

Language: Python - Size: 1.95 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

FIVEYOUNGWOO/WiFiMobNet

WiFi-Camera multimodal learning-based object detection and pose estimation.

Language: Python - Size: 560 MB - Last synced at: 3 days ago - Pushed at: 5 months ago - Stars: 2 - Forks: 1

sutdcv/SUTD-TrafficQA

[CVPR2021] SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events

Language: JavaScript - Size: 6 MB - Last synced at: about 2 months ago - Pushed at: 9 months ago - Stars: 53 - Forks: 2

Vasugi2003/Fusion-AI---MultiModal-Persuvasiveness-Prediction

Developed a system to predict persuasiveness using multi-modal data (text, images, audio). Utilized BERT for text embeddings, ResNet for image features, and Librosa for audio analysis. Fused data from all modalities for enhanced prediction accuracy.

Language: Jupyter Notebook - Size: 770 KB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

haamoon/mmtm

Implementation of CVPR 2020 paper "MMTM: Multimodal Transfer Module for CNN Fusion"

Language: Python - Size: 47.9 KB - Last synced at: 4 months ago - Pushed at: almost 5 years ago - Stars: 112 - Forks: 21

Eva-Kaushik/EMKGCN-MultiModal-Music-Recommender

The `MKGCN` class, coupled with the Spotify API, orchestrates a multi-modal knowledge graph convolutional network to enhance music recommendation systems by integrating user interaction data and diverse music modalities.

Language: Jupyter Notebook - Size: 11 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 4 - Forks: 1

UmarIgan/Machine-Learning

A set of jupyter notebooks

Language: Jupyter Notebook - Size: 16.7 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 23 - Forks: 8

kelechi-c/ripple_net

image retrieval/tagging with CLIP

Language: Python - Size: 416 KB - Last synced at: 20 days ago - Pushed at: 10 months ago - Stars: 13 - Forks: 1

deepur71/InstructPix2Pix

Implementation of InstructPix2Pix from scratch

Language: Python - Size: 0 Bytes - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

ahmdtaha/distributed_sigmoid_loss

Unofficial implementation for Sigmoid Loss for Language Image Pre-Training

Language: Python - Size: 62.5 KB - Last synced at: 18 days ago - Pushed at: over 1 year ago - Stars: 10 - Forks: 0

firojalam/multimodal_social_media

multimodal social media content (text, image) classification

Language: Python - Size: 3.54 MB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 50 - Forks: 14

ilaria-manco/muscaps

Source code for "MusCaps: Generating Captions for Music Audio" (IJCNN 2021)

Language: Jupyter Notebook - Size: 91.9 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 77 - Forks: 7

zhongshsh/MoExtend

ACL 2024 (SRW), Official Codebase of our Paper: "MoExtend: Tuning New Experts for Modality and Task Extension"

Language: Python - Size: 542 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 10 - Forks: 0

meysam-safarzadeh/multimodal

This project is a multi-modal transformer based model to fuse RGB, Thermal, and depth modalities in order to predict pain intensity in 5 classes.

Language: Python - Size: 111 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

chikap421/mseg_vcuq

This repository accompanies the paper "MSEG-VCUQ: Multimodal SEGmentation with Enhanced Vision Foundation Models, Convolutional Neural Networks, and Uncertainty Quantification for High-Speed Video Phase Detection Data"

Language: MATLAB - Size: 1.48 GB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

Gaoqiandi/MultiT2

MultiT2 is an algorithm that connects disparate data from bacterial aromatic polyketides through multimodal learning. It specifically focuses on integrating protein sequences (CLFs) and chemical structures (SMILES) to predict and discover type II polyketide (T2PK) natural products.

Language: Jupyter Notebook - Size: 63 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

BorgwardtLab/DeepEST

Language: Python - Size: 396 KB - Last synced at: 28 days ago - Pushed at: over 1 year ago - Stars: 4 - Forks: 0

eftekhar-hossain/Bengali-Aggression-Memes Fork of shawlyahsan/Bengali-Aggression-Memes

[EACL'24] A Multimodal Framework to Detect Target Aware Aggression Memes

Language: Python - Size: 535 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

unitaryai/VTC

VTC: Improving Video-Text Retrieval with User Comments

Language: Python - Size: 5.45 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 11 - Forks: 0

yanganYNU/AFFGCN

Attention Feature Fusion base on spatial-temporal Graph Convolutional Network（AFFGCN）

Language: Python - Size: 144 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 36 - Forks: 1

ManifoldRG/NEKO

In Progress Implementation of GATO style Generalist Multimodal model capable of image, text, RL and Robotics tasks

Language: Python - Size: 515 KB - Last synced at: 7 months ago - Pushed at: 11 months ago - Stars: 46 - Forks: 10

yongfanbeta/awesome-multimodal-healthcare

Reading list for multimodal learning in healthcare

Size: 8.79 KB - Last synced at: 24 days ago - Pushed at: over 1 year ago - Stars: 11 - Forks: 2

anamabo/SegmentWater

Tools to create output for Paligemma to segment water in satellite images.

Language: Jupyter Notebook - Size: 27.3 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

JerryX1110/awesome-rvos

Referring Video Object Segmentation / Multi-Object Tracking Repo

Language: Python - Size: 79.1 KB - Last synced at: 2 days ago - Pushed at: almost 2 years ago - Stars: 87 - Forks: 4

phellonchen/awesome-visual-dialog

Recent Advances in Visual Dialog

Size: 36.1 KB - Last synced at: 3 days ago - Pushed at: over 2 years ago - Stars: 30 - Forks: 1

choyingw/CFCNet

NeurIPS 2019: Deep RGB-D Canonical Correlation Analysis For Sparse Depth Completion

Language: Python - Size: 31.8 MB - Last synced at: about 1 month ago - Pushed at: about 4 years ago - Stars: 37 - Forks: 4

katerynaCh/MMA-DFER

This repository provides the codes for MMA-DFER: multimodal (audiovisual) emotion recognition method. This is an official implementation for the paper MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild.

Language: Python - Size: 1.77 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 10 - Forks: 1