GitHub topics: multi-modality
jina-ai/clip-as-service
🏄 Scalable embedding, reasoning, ranking for images and sentences with CLIP
Language: Python - Size: 27.4 MB - Last synced at: about 4 hours ago - Pushed at: about 1 year ago - Stars: 12,639 - Forks: 2,075

BradyFU/Awesome-Multimodal-Large-Language-Models
:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
Size: 82.8 MB - Last synced at: 1 day ago - Pushed at: 5 days ago - Stars: 14,772 - Forks: 943

LSXI7/MINIMA
[CVPR 2025] MINIMA: Modality Invariant Image Matching
Language: Python - Size: 44.3 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 347 - Forks: 24

kyegomez/VortexFusion
Transformers + Mambas + LSTMS All in One Model
Language: Python - Size: 2.16 MB - Last synced at: about 23 hours ago - Pushed at: 1 day ago - Stars: 8 - Forks: 1

GLUS-video/GLUS
[CVPR 2025] Official PyTorch Implementation of GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation
Language: Jupyter Notebook - Size: 66.4 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 31 - Forks: 2

kyegomez/MultiModal-ToT
Multi-Modal Tree of thoughts for DALLE-3 like auto self improvement
Language: Python - Size: 81.2 MB - Last synced at: about 23 hours ago - Pushed at: 5 months ago - Stars: 16 - Forks: 2

kyegomez/swarms
The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework. Website: https://swarms.ai
Language: Python - Size: 103 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 4,793 - Forks: 547

haotian-liu/LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Language: Python - Size: 13.4 MB - Last synced at: 5 days ago - Pushed at: 8 months ago - Stars: 22,209 - Forks: 2,442

EvolvingLMMs-Lab/Otter
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
Language: Python - Size: 7.39 MB - Last synced at: 2 days ago - Pushed at: about 1 year ago - Stars: 3,247 - Forks: 213

Orlando-CS/Awesome-VLA
✨✨latest advancements in VLA models(VIsion Language Action)
Size: 0 Bytes - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

The-Martyr/CausalMM
[ICLR 2025] Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality
Language: Python - Size: 2.99 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 17 - Forks: 1

RLHF-V/RLHF-V
[CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Language: Python - Size: 70.6 MB - Last synced at: 1 day ago - Pushed at: 7 months ago - Stars: 276 - Forks: 8

InternLM/InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Language: Python - Size: 199 MB - Last synced at: 10 days ago - Pushed at: 3 months ago - Stars: 2,805 - Forks: 171

yuanze-lin/Olympus
[CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router for Computer Vision Tasks"
Language: Python - Size: 3.49 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 172 - Forks: 35

kyegomez/TinyGPTV
Simple Implementation of TinyGPTV in super simple Zeta lego blocks
Language: Python - Size: 2.17 MB - Last synced at: about 23 hours ago - Pushed at: 5 months ago - Stars: 16 - Forks: 0

skit-ai/SpeechLLM
This repository contains the training, inference, evaluation code for SpeechLLM models and details about the model releases on huggingface.
Language: Python - Size: 3.88 MB - Last synced at: 9 days ago - Pushed at: 10 months ago - Stars: 98 - Forks: 9

kyegomez/MoE-Mamba
Implementation of MoE Mamba from the paper: "MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts" in Pytorch and Zeta
Language: Python - Size: 2.17 MB - Last synced at: about 23 hours ago - Pushed at: 14 days ago - Stars: 102 - Forks: 5

StaRainJ/MINIMA Fork of LSXI7/MINIMA
[CVPR 2025] MINIMA: Modality Invariant Image Matching
Language: Python - Size: 42.2 MB - Last synced at: 14 days ago - Pushed at: 15 days ago - Stars: 4 - Forks: 0

DLR-RM/3DObjectTracking
Algorithms and Publications on 3D Object Tracking
Language: C++ - Size: 201 MB - Last synced at: 12 days ago - Pushed at: 11 months ago - Stars: 860 - Forks: 150

kyegomez/qformer
Implementation of Qformer from BLIP2 in Zeta Lego blocks.
Language: Python - Size: 2.19 MB - Last synced at: 12 days ago - Pushed at: 5 months ago - Stars: 38 - Forks: 0

lucidrains/deep-daze
Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network). Technique was originally created by https://twitter.com/advadnoun
Language: Python - Size: 6.68 MB - Last synced at: 10 days ago - Pushed at: about 3 years ago - Stars: 4,365 - Forks: 319

vbdi/divprune
[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Language: Python - Size: 11 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 7 - Forks: 0

OpenGVLab/Multi-Modality-Arena
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
Language: Python - Size: 21.5 MB - Last synced at: 17 days ago - Pushed at: 12 months ago - Stars: 511 - Forks: 37

sshh12/multi_token
Embed arbitrary modalities (images, audio, documents, etc) into large language models.
Language: Python - Size: 1.22 MB - Last synced at: 20 days ago - Pushed at: about 1 year ago - Stars: 183 - Forks: 14

dvlab-research/VisionZip
Official repository for VisionZip (CVPR 2025)
Language: Python - Size: 18.2 MB - Last synced at: 24 days ago - Pushed at: about 2 months ago - Stars: 260 - Forks: 10

ziqihuangg/Collaborative-Diffusion
[CVPR 2023] Collaborative Diffusion
Language: Python - Size: 4.25 MB - Last synced at: 24 days ago - Pushed at: over 1 year ago - Stars: 424 - Forks: 34

trendscenter/fit
Fusion ICA Toolbox (MATLAB)
Language: MATLAB - Size: 17 MB - Last synced at: 11 days ago - Pushed at: about 2 months ago - Stars: 31 - Forks: 6

kyegomez/MambaByte
Implementation of MambaByte in "MambaByte: Token-free Selective State Space Model" in Pytorch and Zeta
Language: Python - Size: 2.16 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 115 - Forks: 7

dvlab-research/UVTR
Unifying Voxel-based Representation with Transformer for 3D Object Detection (NeurIPS 2022)
Language: Python - Size: 621 KB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 233 - Forks: 17

ZwwWayne/mmMOT
[ICCV2019] Robust Multi-Modality Multi-Object Tracking
Language: Python - Size: 2.53 MB - Last synced at: 20 days ago - Pushed at: over 5 years ago - Stars: 254 - Forks: 24

kyegomez/Andromeda
An all-new Language Model That Processes Ultra-Long Sequences of 100,000+ Ultra-Fast
Language: Python - Size: 66 MB - Last synced at: 14 days ago - Pushed at: 8 months ago - Stars: 146 - Forks: 23

kyegomez/the-compiler
Seed, Code, Harvest: Grow Your Own App with Tree of Thoughts!
Language: Python - Size: 10.4 MB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 145 - Forks: 17

kyegomez/MC-ViT
Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"
Language: Python - Size: 2.17 MB - Last synced at: 4 days ago - Pushed at: 16 days ago - Stars: 21 - Forks: 1

jonathanjsjsc/Swarm
🦟 Interactive swarm simulation where pointer swarms follow your cursor - WebGL / threejs
Language: HTML - Size: 37.1 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

kyegomez/HRTX
Multi-Modal Multi-Embodied Hivemind-like Iteration of RTX-2
Language: Python - Size: 2.19 MB - Last synced at: 4 days ago - Pushed at: 5 months ago - Stars: 16 - Forks: 3

Skyline-9/Shotluck-Holmes
[ACM MMGR '24] 🔍 Shotluck Holmes: A family of small-scale LLVMs for shot-level video understanding
Language: Python - Size: 26.3 MB - Last synced at: 24 days ago - Pushed at: 6 months ago - Stars: 11 - Forks: 0

kyegomez/Kosmos2.5
My implementation of Kosmos2.5 from the paper: "KOSMOS-2.5: A Multimodal Literate Model"
Language: Python - Size: 231 KB - Last synced at: 15 days ago - Pushed at: 16 days ago - Stars: 73 - Forks: 6

amazon-science/crossmodal-contrastive-learning
CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations, ICCV 2021
Language: Python - Size: 766 KB - Last synced at: 13 days ago - Pushed at: about 3 years ago - Stars: 62 - Forks: 11

kyegomez/Athena-for-Search
The World's First AI-Enabled Multi-Modality Native Search Engine
Language: TypeScript - Size: 5.58 MB - Last synced at: about 23 hours ago - Pushed at: about 1 year ago - Stars: 24 - Forks: 6

kyegomez/MLXTransformer
Simple Implementation of a Transformer in the new framework MLX by Apple
Language: Python - Size: 2.18 MB - Last synced at: about 23 hours ago - Pushed at: 5 months ago - Stars: 20 - Forks: 1

voidful/MMLM
Toward Multi Modality Language Model - implementation of GPT-4o/Project Astra
Language: Python - Size: 688 KB - Last synced at: 16 days ago - Pushed at: 4 months ago - Stars: 14 - Forks: 4

chenshuang-zhang/imagenet_d
[CVPR 2024 Highlight] ImageNet-D
Language: Python - Size: 49.3 MB - Last synced at: 20 days ago - Pushed at: 6 months ago - Stars: 41 - Forks: 5

kyegomez/forest-of-thoughts
A forest of autonomous agents.
Language: Python - Size: 224 KB - Last synced at: about 23 hours ago - Pushed at: 3 months ago - Stars: 19 - Forks: 1

ChenHongruixuan/BRIGHT
[IEEE GRSS DFC 2025 Track II] BRIGHT: A globally distributed multimodal VHR dataset for all-weather disaster response
Language: Python - Size: 159 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 88 - Forks: 11

yangcaoai/CoDA_NeurIPS2023
Official code for NeurIPS2023 paper: CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection
Language: Jupyter Notebook - Size: 71.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 188 - Forks: 17

kyegomez/Fuyu
Implementation of Adepts Fuyu all-new Multi-Modality model in pytorch
Language: Python - Size: 393 KB - Last synced at: about 23 hours ago - Pushed at: 5 months ago - Stars: 25 - Forks: 3

jina-ai/rungpt
An open-source cloud-native of large multi-modal models (LMMs) serving framework.
Language: Python - Size: 5.29 MB - Last synced at: 1 day ago - Pushed at: over 1 year ago - Stars: 161 - Forks: 22

aws-samples/multi-modal-examples-for-amazon-sagemaker
A workshop for collections of multi-modal LLM examples, samples, reference architecture and demos on Amazon SageMaker.
Language: Jupyter Notebook - Size: 33.7 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 2

han-liu/awesome-missing-modality-for-medical-images
A comprehensive review of techniques to address the missing-modality problem for medical images
Size: 75.2 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 31 - Forks: 2

ChennyDeng/MM-APE
Towards Multi-Modal Animal Pose Estimation: An In-Depth Analysis
Size: 135 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 23 - Forks: 0

BubbleWang-wly/EIEA
Explicit-Implicit Entity Alignment Method in Multi-modal Knowledge Graphs
Language: Python - Size: 755 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

rentainhe/TRAR-VQA
[ICCV 2021] Official implementation of the paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering"
Language: Python - Size: 927 KB - Last synced at: 8 days ago - Pushed at: over 3 years ago - Stars: 66 - Forks: 18

manyaafonso/ultrasound_denoising_GAN
Demo to use GANs for denoising and synthesising ultrasound images
Language: Jupyter Notebook - Size: 4.58 MB - Last synced at: 22 days ago - Pushed at: 7 months ago - Stars: 2 - Forks: 1

kyegomez/MM1
PyTorch Implementation of the paper "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training"
Language: Python - Size: 2.17 MB - Last synced at: 5 days ago - Pushed at: 8 days ago - Stars: 23 - Forks: 1

kyegomez/GATS
Implementation of GATS from the paper: "GATS: Gather-Attend-Scatter" in pytorch and zeta
Language: Python - Size: 2.17 MB - Last synced at: 1 day ago - Pushed at: 5 months ago - Stars: 8 - Forks: 0

rsy6318/CorrI2P
[TCSVT] CorrI2P: Deep Image-to-Point Cloud Registration via Dense CorrespondenceThe code of CorrI2P
Language: Python - Size: 1.27 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 68 - Forks: 9

mida-project/prototype-multi-modality-assistant
[IJHCS] An assistant prototype for breast cancer diagnosis prepared with a multimodality strategy. The work was published in the International Journal of Human-Computer Studies.
Language: JavaScript - Size: 4.04 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 2 - Forks: 1

kyegomez/SwarmOS
An all-new OS that orchestrates autonomous agents as workers to execute tasks.
Language: Shell - Size: 2.21 MB - Last synced at: about 23 hours ago - Pushed at: 5 months ago - Stars: 17 - Forks: 2

dvlab-research/Prompt-Highlighter
[CVPR 2024] Prompt Highlighter: Interactive Control for Multi-Modal LLMs
Language: Python - Size: 17.4 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 110 - Forks: 2

kyegomez/VisionDatasets
Open source scripts to create large scale datasets with rich detail for multi-modal models
Language: Python - Size: 34.9 MB - Last synced at: about 23 hours ago - Pushed at: about 1 year ago - Stars: 11 - Forks: 0

kyegomez/HSSS
Implementation of a Hierarchical Mamba as described in the paper: "Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling"
Language: Python - Size: 2.19 MB - Last synced at: about 23 hours ago - Pushed at: 5 months ago - Stars: 13 - Forks: 2

kyegomez/AoA-torch
Implementation of Attention on Attention in Zeta
Language: Python - Size: 2.19 MB - Last synced at: about 23 hours ago - Pushed at: 1 day ago - Stars: 4 - Forks: 0

nagababumo/Open-Source-Models-with-Hugging-Face
Language: Jupyter Notebook - Size: 17 MB - Last synced at: about 2 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

DerrickWang005/CRIS.pytorch
An official PyTorch implementation of the CRIS paper
Language: Python - Size: 23.1 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 232 - Forks: 35

researchmm/MM-Diffusion
[CVPR'23] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
Language: Python - Size: 4.18 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 346 - Forks: 23

Oztobuzz/Vista
This is the official repository for Vista dataset - A Vietnamese multimodal dataset contains more than 700,000 samples of conversations and images
Language: Python - Size: 1.79 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 13 - Forks: 0

SsGood/MMGL
Multi-modal Graph learning for Disease Prediction (IEEE Trans. on Medical imaging, TMI2022)
Language: Jupyter Notebook - Size: 10.7 MB - Last synced at: 12 months ago - Pushed at: about 1 year ago - Stars: 81 - Forks: 13

kyegomez/Gemini
The open source implementation of Gemini, the model that will "eclipse ChatGPT" by Google
Language: Python - Size: 679 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 354 - Forks: 40

mit-acl/deep_panther
Language: C++ - Size: 174 MB - Last synced at: 12 months ago - Pushed at: about 1 year ago - Stars: 44 - Forks: 10

kyegomez/Sophia
Effortless plugin and play Optimizer to cut model training costs by 50%. New optimizer that is 2x faster than Adam on LLMs.
Language: Python - Size: 536 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 359 - Forks: 25

OpenGVLab/LORIS
Long-Term Rhythmic Video Soundtracker, ICML2023
Language: Python - Size: 857 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 50 - Forks: 1

zjukg/AdaMF-MAT
[Paper][LREC-COLING 2024] Unleashing the Power of Imbalanced Modality Information for Multi-modal Knowledge Graph Completion
Language: Python - Size: 1.91 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 11 - Forks: 1

Lee-Gihun/MEDIAR
(NeurIPS 2022 CellSeg Challenge - 1st Winner) Open source code for "MEDIAR: Harmony of Data-Centric and Model-Centric for Multi-Modality Microscopy"
Language: Python - Size: 15 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 112 - Forks: 21

Ravi-Teja-konda/TunedLlavaDelights
Explore the rich flavors of Indian desserts with TunedLlavaDelights. Utilizing the in Llava fine-tuning, our project unveils detailed nutritional profiles, taste notes, and optimal consumption times for beloved sweets. Dive into a fusion of AI innovation and culinary tradition
Language: Python - Size: 43.3 MB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

Messi-Q/Cross-Modality-Bug-Detection
Cross-Modality Mutual Learning for Smart Contract Vulnerability Detection
Language: Python - Size: 54.8 MB - Last synced at: about 1 year ago - Pushed at: about 2 years ago - Stars: 18 - Forks: 2

anondo1969/FedSemiCovidDetector
Repository for the journal article, 'Federated Semi-Supervised Multi-Task Learning to Detect COVID-19 and Lungs Segmentation Marking Using Chest Radiography Images and Raspberry Pi Devices: An Internet of Medical Things Application', Mahbub Ul Alam, Rahim Rahmani. Sensors 21, no. 15: 5025, https://doi.org/10.3390/s21155025.
Language: Python - Size: 4.24 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

xufangzhi/MoCA
[Pattern Recognition] The implementation of MoCA
Language: Python - Size: 31.3 KB - Last synced at: 13 days ago - Pushed at: about 2 years ago - Stars: 12 - Forks: 1

wwweiwei/Pre-CoFactv2-AAAI-2023
Official Implementation for Pre-CoFactv2 (AAAI-23 DeFactify2.0 Workshop 1st Place)
Language: Python - Size: 10.8 MB - Last synced at: about 1 year ago - Pushed at: about 2 years ago - Stars: 5 - Forks: 1

jackyjsy/CVPR21Chal-SLR
This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.
Language: Python - Size: 51.9 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 174 - Forks: 45

anondo1969/thremaltimodal-covidetector
COVID-19 detection from thermal image and tabular medical data utilizing multi-modal machine learning, Mahbub Ul Alam, Jaakko Hollmén and Rahim Rahmani. IEEE 36th International Symposium on Computer-Based Medical Systems (CBMS), 2023, pp. 646-653.
Language: Python - Size: 12.8 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

anondo1969/FedSepsis
Repository for the article, 'FedSepsis: A Federated Multi-Modal Deep Learning-Based Internet of Medical Things Application for Early Detection of Sepsis from Electronic Health Records Using Raspberry Pi and Jetson Nano Devices', Mahbub Ul Alam, Rahim Rahmani. Sensors 23, no. 2: 970, https://doi.org/10.3390/s23020970.
Language: Jupyter Notebook - Size: 265 KB - Last synced at: almost 2 years ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 1

ecom-research/ComposeAE 📦
Official code for WACV 2021 paper - Compositional Learning of Image-Text Query for Image Retrieval
Language: Python - Size: 672 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 49 - Forks: 13

sagahansson/aics-project
Final project for the course LT2318 Artificial Intelligence: Cognitive Systems. The project concerns multimodal hate speech detection in memes.
Language: TeX - Size: 9.6 MB - Last synced at: about 2 years ago - Pushed at: almost 4 years ago - Stars: 0 - Forks: 0
