GitHub topics: vision-language-models
zli12321/Vision-Language-Models-Overview
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository
Size: 555 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 239 - Forks: 12

oadamharoon/text2nav
Minimalist framework for language-guided robot navigation using frozen vision-language embeddings. Achieves 74% success rate without fine-tuning. RSS 2025 Workshop paper.
Language: Python - Size: 182 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

billpsomas/efficient-probing
This repo contains the official implementation of the paper "Attention, Please! Revisiting Attentive Probing for Masked Image Modeling"
Language: Python - Size: 123 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 5 - Forks: 0

jusiro/FCA
[IPMI'25] Full conformal adaptation. Adapting medical vision-language models with reliability guarantees.
Language: Python - Size: 2.76 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

mbzuai-oryx/GeoPixel
GeoPixel: A Pixel Grounding Large Multimodal Model for Remote Sensing is specifically developed for high-resolution remote sensing image analysis, offering advanced multi-target pixel grounding capabilities.
Language: Python - Size: 29.8 MB - Last synced at: 10 days ago - Pushed at: about 1 month ago - Stars: 96 - Forks: 9

prism-visual-spatial-intelligence/Awesome-Visual-Spatial-Reasoning
This is a project about visual spatial reasoning.
Language: Shell - Size: 116 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 16 - Forks: 0

BAAI-Agents/GPA-LM
This repo is a live list of papers on game playing and large multimodality model - "A Survey on Game Playing Agents and Large Models: Methods, Applications, and Challenges".
Size: 3.81 MB - Last synced at: 2 days ago - Pushed at: 10 months ago - Stars: 148 - Forks: 7

lezhang7/SAIL
[CVPR 2025 Highlight] Official Pytorch codebase for paper: "Assessing and Learning Alignment of Unimodal Vision and Language Models"
Language: Jupyter Notebook - Size: 64.7 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 39 - Forks: 3

robosense2025/track2
Track 2: Social Navigation
Size: 4.88 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 1 - Forks: 0

D2I-Group/awesome-vision-time-series
This is an official repository for "Harnessing Vision Models for Time Series Analysis: A Survey".
Language: Python - Size: 3.71 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 26 - Forks: 1

zohaibterminator/MediSense
An AI medical assistant
Language: TypeScript - Size: 498 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

OpenGVLab/PIIP
[NeurIPS 2024 Spotlight ⭐️] Parameter-Inverted Image Pyramid Networks (PIIP)
Language: Python - Size: 11.7 MB - Last synced at: 22 days ago - Pushed at: about 1 month ago - Stars: 91 - Forks: 2

Pradeep9167/Spatial-MLLM
Spatial-MLLM enhances multi-language learning models by integrating visual-based spatial intelligence. This project aims to improve understanding and processing of spatial data, making it a valuable resource for researchers and developers. 🌍🚀
Language: Python - Size: 18.4 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 0 - Forks: 0

danelpeng/Awesome-Continual-Leaning-with-PTMs
This is a curated list of "Continual Learning with Pretrained Models" research.
Size: 351 KB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 0

elkhouryk/RS-TransCLIP
[ICASSP 2025] Open-source code for the paper "Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification"
Language: Python - Size: 171 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 54 - Forks: 2

chensy618/SuperpixelCUB
Automated key point identification and description for Vision Transformers using vision-language models
Language: Jupyter Notebook - Size: 96.6 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 0 - Forks: 0

AI-14/micar-vl-moe
[IJCNN 2025] [Official code] - MicarVLMoE: A modern gated cross-aligned vision-language mixture of experts model for medical image captioning and report generation
Language: Python - Size: 1.17 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 2 - Forks: 0

VectorInstitute/VLDBench
VLDBench: A large-scale benchmark for evaluating Vision-Language Models (VLMs) and Large Language Models (LLMs) on multimodal disinformation detection.
Language: Python - Size: 259 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 3 - Forks: 0

HeathSun/AdaSeg4MR
An innovative mixed reality (MR) pipeline that integrates real-time instance segmentation and speech-guided natural language interaction. It aims to create a more intuitive and immersive experience for users interacting with virtual and real-world environments.
Language: Python - Size: 151 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

lmb-freiburg/understanding-clip-ood
Official code for the paper: "When and How Does CLIP Enable Domain and Compositional Generalization?" (ICML 2025 Spotlight)
Language: Python - Size: 13.2 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

yu-rp/apiprompting
[ECCV 2024] API: Attention Prompting on Image for Large Vision-Language Models
Language: Python - Size: 8.63 MB - Last synced at: 23 days ago - Pushed at: 9 months ago - Stars: 88 - Forks: 5

amirivojdan/DSE-697
DSE 697 - Large Language Modeling & Gen AI
Language: Jupyter Notebook - Size: 66.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

baaivision/EVE
EVE Series: Encoder-Free Vision-Language Models from BAAI
Language: Python - Size: 6.95 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 326 - Forks: 8

NishilBalar/Awesome-LVLM-Hallucination
up-to-date curated list of state-of-the-art Large vision language models hallucinations research work, papers & resources
Size: 189 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 125 - Forks: 6

baaivision/DenseFusion
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
Language: Python - Size: 18.1 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 145 - Forks: 1

alexdrnd/micar-vl-moe
[IJCNN 2025] [Official code] - MicarVLMoE: A modern gated cross-aligned vision-language mixture of experts model for medical image captioning and report generation
Language: Python - Size: 935 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

HySonLab/Design2Code
Large Language Model in combination with Large Vision Model for the task of code generation given design sketch.
Language: Python - Size: 270 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 3 - Forks: 0

drive-bench/toolkit
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
Language: Python - Size: 14.4 MB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 66 - Forks: 1

jiayuww/SpatialEval
[NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning abilities of MLLMs and LLMs
Language: Python - Size: 3.95 MB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 23 - Forks: 0

snap-research/MyVLM
Official Implementation for "MyVLM: Personalizing VLMs for User-Specific Queries" (ECCV 2024)
Language: Python - Size: 47.3 MB - Last synced at: 3 months ago - Pushed at: 12 months ago - Stars: 167 - Forks: 11

fork123aniket/Multi-Round-VLM-powered-Multimodal-Conversational-AI-Navigation-Bot
Streamlit App Combining Vision, Language, and Audio AI Models
Language: Python - Size: 18.6 KB - Last synced at: 2 months ago - Pushed at: 5 months ago - Stars: 3 - Forks: 0

AaltoML/BayesVLM
Code for Post-hoc Probabilistic Vision-Language Models
Language: Python - Size: 70.5 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 5 - Forks: 1

auniquesun/Point-Cache
[CVPR 2025] Official implementation of the paper "Point-Cache: Test-time Dynamic and Hierarchical Cache for Robust and Generalizable Point Cloud Analysis"
Language: Jupyter Notebook - Size: 21.7 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

s-vco/s-vco
Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images
Language: Python - Size: 24.1 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 5 - Forks: 0

Alchemist-Aloha/screengpt
ScreenGPT is a project that leverages LLM to understand the screen content. It provides response based on the user defined prompts and the screen content. You need an OpenAI compatible API key to use this software.
Language: Python - Size: 1.99 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

sitamgithub-MSIT/paligemma2-mix-litserve
Leverage PaliGemma 2 mix model variant capabilities using LitServe.
Language: Python - Size: 768 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

sitamgithub-MSIT/paligemma2-docci-litserve
Leverage PaliGemma 2's DOCCI fine-tuned variant capabilities using LitServe.
Language: Python - Size: 468 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

akskuchi/dHM-visual-storytelling
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition – EMNLP 2024 (Findings)
Language: Python - Size: 5.05 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 2 - Forks: 0

XiangshengGu/ActionVLM
This project explores the use of large foundational vision-language models in reinforcement learning, where the models function as agents, reward functions, or reward function code generators in unseen environments given a state and a goal.
Size: 22.3 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

Roni7128/NTU-2024Fall-DLCV
CommE5052: Deep Learning for Computer Vision (Prof. Frank Wang)
Size: 1.95 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

ytaek-oh/awesome-vl-compositionality
Awesome Vision-Language Compositionality, a comprehensive curation of research papers in literature.
Size: 126 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 13 - Forks: 1

sitamgithub-MSIT/paligemma-docci
Image Captioning with PaliGemma 2 Vision Language Model.
Language: Python - Size: 1.26 MB - Last synced at: about 7 hours ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

sitamgithub-MSIT/BioMedAI
Leverage Dragonfly-Med's capabilities using LitServe.
Language: Jupyter Notebook - Size: 9.84 MB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

berlin0308/NTU-2024Fall-DLCV
CommE5052: Deep Learning for Computer Vision (Prof. Frank Wang)
Language: Jupyter Notebook - Size: 41.6 MB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

chu0802/SnD
This is an official implementation of our work, Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models, accepted to ECCV'24
Language: Python - Size: 26.4 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 8 - Forks: 1

sitamgithub-MSIT/PicQ
PicQ: Demo for MiniCPM-o 2.6 to answer questions about images using natural language.
Language: Python - Size: 4.74 MB - Last synced at: about 2 months ago - Pushed at: 5 months ago - Stars: 4 - Forks: 1

sukanyabag/Finetuning-Qwen2-7B-VQA-on-Radiology-Scans
This repository is doing the finetuning of the Qwen2 7B VLM for performing VQA (Visual Question Answering) on various kinds of patient radiologies or medical scans.
Language: Jupyter Notebook - Size: 286 KB - Last synced at: 4 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

sitamgithub-MSIT/VidiQA
VidiQA: Demo for MiniCPM-V 2.6 to answer questions about videos using natural language.
Language: Python - Size: 6.99 MB - Last synced at: about 2 months ago - Pushed at: 7 months ago - Stars: 2 - Forks: 0

Shengwei-Peng/TOCFL-MultiBench
TOCFL-MultiBench: A multimodal benchmark for evaluating Chinese language proficiency using text, audio, and visual data with deep learning. Features Selective Token Constraint Mechanism (STCM) for enhanced decoding stability.
Language: Python - Size: 170 KB - Last synced at: 6 days ago - Pushed at: 6 months ago - Stars: 3 - Forks: 0

YuweiYin/Code4Chart
C4C: Does Visualization Code Improve Chart Understanding for Vision-Language Models?
Language: Python - Size: 13 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

andrewliao11/Q-Spatial-Bench-code
Official repo of the paper "Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models"
Language: Python - Size: 184 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

sled-group/COMFORT
Repo for the paper "Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities"
Language: Python - Size: 33.8 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 4 - Forks: 0

vanillaer/CPL-ICML2024
[ICML 2024] Offical code repo for ICML2024 paper "Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled Data"
Language: Python - Size: 5.07 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 5 - Forks: 0

erfanshayegani/Jailbreak-In-Pieces
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models - 🔥 ICLR 2024 Spotlight - 🏆 Best Paper Award SoCal NLP 2023
Language: Python - Size: 1.9 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

Ibtissam-SAADI/CLIVP-FER
Facial Expression Recognition using vision language models (VLMs)
Language: Python - Size: 550 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0
