GitHub topics: vision-language-models

Repositories

chensy618/SuperpixelCUB

Automated key point identification and description for Vision Transformers using vision-language models

Language: Jupyter Notebook - Size: 106 MB - Last synced at: about 9 hours ago - Pushed at: about 11 hours ago - Stars: 1 - Forks: 0

NishilBalar/Awesome-LVLM-Hallucination

up-to-date curated list of state-of-the-art Large vision language models hallucinations research work, papers & resources

Size: 238 KB - Last synced at: about 19 hours ago - Pushed at: about 21 hours ago - Stars: 147 - Forks: 8

drive-bench/toolkit

[ICCV 2025] Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Language: Python - Size: 14.5 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 93 - Forks: 3

OpenGVLab/PIIP

[NeurIPS 2024 Spotlight ⭐️ & TPAMI 2025] Parameter-Inverted Image Pyramid Networks (PIIP)

Language: Python - Size: 11.4 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 94 - Forks: 2

elkhouryk/RS-TransCLIP

[ICASSP 2025] Open-source code for the paper "Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification"

Language: Python - Size: 171 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 57 - Forks: 2

robosense2025/track2

Track 2: Social Navigation

Size: 2.23 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 18 - Forks: 0

zli12321/Vision-Language-Models-Overview

A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.

Size: 656 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 275 - Forks: 14

rahisenpai/CSE344-CV

Course Assignments for Computer Vision (CSE344) at IIITD, Winter'25

Language: Jupyter Notebook - Size: 312 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

stogiannidis/srbench

Source code for the Paper "Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models"

Language: Python - Size: 446 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 13 - Forks: 0

mvrl/RCME

[ICCV'25] Radial Cross-Modal embeddings

Language: Python - Size: 4.49 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 2 - Forks: 1

LYL1015/JarvisArt

JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

Language: JavaScript - Size: 49.6 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 569 - Forks: 20

taco-group/4KAgent

4KAgent: Agentic Any Image to 4K Super-Resolution

Size: 5.2 MB - Last synced at: 11 days ago - Pushed at: 12 days ago - Stars: 320 - Forks: 11

mghiasvand1/Awesome-VLM-Synthetic-Data

🔥 The first survey on bridging VLMs and synthetic data, for which I completed the entire process of reading 125 papers and writing the research paper in just 10 days.

Size: 78.1 KB - Last synced at: 17 days ago - Pushed at: 18 days ago - Stars: 9 - Forks: 0

hrlics/HoPE

HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

Size: 1.98 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 11 - Forks: 1

BAAI-Agents/GPA-LM

This repo is a live list of papers on game playing and large multimodality model - "A Survey on Game Playing Agents and Large Models: Methods, Applications, and Challenges".

Size: 3.81 MB - Last synced at: 6 days ago - Pushed at: 11 months ago - Stars: 148 - Forks: 7

baaivision/EVE

EVE Series: Encoder-Free Vision-Language Models from BAAI

Language: Python - Size: 7.29 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 332 - Forks: 8

yu-rp/apiprompting

[ECCV 2024] API: Attention Prompting on Image for Large Vision-Language Models

Language: Python - Size: 8.63 MB - Last synced at: 20 days ago - Pushed at: 10 months ago - Stars: 96 - Forks: 5

oadamharoon/text2nav

Minimalist framework for language-guided robot navigation using frozen vision-language embeddings. Achieves 74% success rate without fine-tuning. RSS 2025 Workshop paper.

Language: Python - Size: 182 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

billpsomas/efficient-probing

This repo contains the official implementation of the paper "Attention, Please! Revisiting Attentive Probing for Masked Image Modeling"

Language: Python - Size: 123 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 5 - Forks: 0

jusiro/FCA

[IPMI'25] Full conformal adaptation. Adapting medical vision-language models with reliability guarantees.

Language: Python - Size: 2.76 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

GeoPixel: A Pixel Grounding Large Multimodal Model for Remote Sensing is specifically developed for high-resolution remote sensing image analysis, offering advanced multi-target pixel grounding capabilities.

Language: Python - Size: 29.8 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 96 - Forks: 9

prism-visual-spatial-intelligence/Awesome-Visual-Spatial-Reasoning

This is a project about visual spatial reasoning.

Language: Shell - Size: 116 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 16 - Forks: 0

lezhang7/SAIL

[CVPR 2025 Highlight] Official Pytorch codebase for paper: "Assessing and Learning Alignment of Unimodal Vision and Language Models"

Language: Jupyter Notebook - Size: 64.7 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 39 - Forks: 3

D2I-Group/awesome-vision-time-series

This is an official repository for "Harnessing Vision Models for Time Series Analysis: A Survey".

Language: Python - Size: 3.71 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 26 - Forks: 1

zohaibterminator/MediSense

An AI medical assistant

Language: TypeScript - Size: 498 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

danelpeng/Awesome-Continual-Leaning-with-PTMs

This is a curated list of "Continual Learning with Pretrained Models" research.

Size: 351 KB - Last synced at: 27 days ago - Pushed at: about 2 months ago - Stars: 18 - Forks: 0

fork123aniket/Multi-Round-VLM-powered-Multimodal-Conversational-AI-Navigation-Bot

Streamlit App Combining Vision, Language, and Audio AI Models

Language: Python - Size: 18.6 KB - Last synced at: 22 days ago - Pushed at: 6 months ago - Stars: 3 - Forks: 0

AI-14/micar-vl-moe

[IJCNN 2025] [Official code] - MicarVLMoE: A modern gated cross-aligned vision-language mixture of experts model for medical image captioning and report generation

Language: Python - Size: 1.17 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 0

VectorInstitute/VLDBench

VLDBench: A large-scale benchmark for evaluating Vision-Language Models (VLMs) and Large Language Models (LLMs) on multimodal disinformation detection.

Language: Python - Size: 259 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 3 - Forks: 0

HeathSun/AdaSeg4MR

An innovative mixed reality (MR) pipeline that integrates real-time instance segmentation and speech-guided natural language interaction. It aims to create a more intuitive and immersive experience for users interacting with virtual and real-world environments.

Language: Python - Size: 151 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

lmb-freiburg/understanding-clip-ood

Official code for the paper: "When and How Does CLIP Enable Domain and Compositional Generalization?" (ICML 2025 Spotlight)

Language: Python - Size: 13.2 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

amirivojdan/DSE-697

DSE 697 - Large Language Modeling & Gen AI

Language: Jupyter Notebook - Size: 66.7 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

baaivision/DenseFusion

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Language: Python - Size: 18.1 MB - Last synced at: 2 months ago - Pushed at: 8 months ago - Stars: 145 - Forks: 1

alexdrnd/micar-vl-moe

[IJCNN 2025] [Official code] - MicarVLMoE: A modern gated cross-aligned vision-language mixture of experts model for medical image captioning and report generation

Language: Python - Size: 935 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

HySonLab/Design2Code

Large Language Model in combination with Large Vision Model for the task of code generation given design sketch.

Language: Python - Size: 270 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 3 - Forks: 0

jiayuww/SpatialEval

[NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning abilities of MLLMs and LLMs

Language: Python - Size: 3.95 MB - Last synced at: 4 months ago - Pushed at: 6 months ago - Stars: 23 - Forks: 0

snap-research/MyVLM

Official Implementation for "MyVLM: Personalizing VLMs for User-Specific Queries" (ECCV 2024)

Language: Python - Size: 47.3 MB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 167 - Forks: 11

AaltoML/BayesVLM

Code for Post-hoc Probabilistic Vision-Language Models

Language: Python - Size: 70.5 MB - Last synced at: 4 months ago - Pushed at: 5 months ago - Stars: 5 - Forks: 1

auniquesun/Point-Cache

[CVPR 2025] Official implementation of the paper "Point-Cache: Test-time Dynamic and Hierarchical Cache for Robust and Generalizable Point Cloud Analysis"

Language: Jupyter Notebook - Size: 21.7 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

s-vco/s-vco

Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images

Language: Python - Size: 24.1 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 5 - Forks: 0

Alchemist-Aloha/screengpt

ScreenGPT is a project that leverages LLM to understand the screen content. It provides response based on the user defined prompts and the screen content. You need an OpenAI compatible API key to use this software.

Language: Python - Size: 1.99 MB - Last synced at: 4 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

sitamgithub-MSIT/paligemma2-mix-litserve

Leverage PaliGemma 2 mix model variant capabilities using LitServe.

Language: Python - Size: 768 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

sitamgithub-MSIT/paligemma2-docci-litserve

Leverage PaliGemma 2's DOCCI fine-tuned variant capabilities using LitServe.

Language: Python - Size: 468 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

akskuchi/dHM-visual-storytelling

Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition – EMNLP 2024 (Findings)

Language: Python - Size: 5.05 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 2 - Forks: 0

XiangshengGu/ActionVLM

This project explores the use of large foundational vision-language models in reinforcement learning, where the models function as agents, reward functions, or reward function code generators in unseen environments given a state and a goal.

Size: 22.3 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

Roni7128/NTU-2024Fall-DLCV

CommE5052: Deep Learning for Computer Vision (Prof. Frank Wang)

Size: 1.95 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

ytaek-oh/awesome-vl-compositionality

Awesome Vision-Language Compositionality, a comprehensive curation of research papers in literature.

Size: 126 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 13 - Forks: 1

sitamgithub-MSIT/paligemma-docci

Image Captioning with PaliGemma 2 Vision Language Model.

Language: Python - Size: 1.26 MB - Last synced at: 30 days ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

sitamgithub-MSIT/BioMedAI

Leverage Dragonfly-Med's capabilities using LitServe.

Language: Jupyter Notebook - Size: 9.84 MB - Last synced at: 4 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

berlin0308/NTU-2024Fall-DLCV

CommE5052: Deep Learning for Computer Vision (Prof. Frank Wang)

Language: Jupyter Notebook - Size: 41.6 MB - Last synced at: 4 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

chu0802/SnD

This is an official implementation of our work, Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models, accepted to ECCV'24

Language: Python - Size: 26.4 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 8 - Forks: 1

sitamgithub-MSIT/PicQ

PicQ: Demo for MiniCPM-o 2.6 to answer questions about images using natural language.

Language: Python - Size: 4.74 MB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 4 - Forks: 1

sukanyabag/Finetuning-Qwen2-7B-VQA-on-Radiology-Scans

This repository is doing the finetuning of the Qwen2 7B VLM for performing VQA (Visual Question Answering) on various kinds of patient radiologies or medical scans.

Language: Jupyter Notebook - Size: 286 KB - Last synced at: 5 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

sitamgithub-MSIT/VidiQA

VidiQA: Demo for MiniCPM-V 2.6 to answer questions about videos using natural language.

Language: Python - Size: 6.99 MB - Last synced at: 3 months ago - Pushed at: 8 months ago - Stars: 2 - Forks: 0

Shengwei-Peng/TOCFL-MultiBench

TOCFL-MultiBench: A multimodal benchmark for evaluating Chinese language proficiency using text, audio, and visual data with deep learning. Features Selective Token Constraint Mechanism (STCM) for enhanced decoding stability.

Language: Python - Size: 170 KB - Last synced at: 9 days ago - Pushed at: 7 months ago - Stars: 3 - Forks: 0

YuweiYin/Code4Chart

C4C: Does Visualization Code Improve Chart Understanding for Vision-Language Models?

Language: Python - Size: 13 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

andrewliao11/Q-Spatial-Bench-code

Official repo of the paper "Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models"

Language: Python - Size: 184 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

sled-group/COMFORT

Repo for the paper "Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities"

Language: Python - Size: 33.8 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 4 - Forks: 0

vanillaer/CPL-ICML2024

[ICML 2024] Offical code repo for ICML2024 paper "Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled Data"

Language: Python - Size: 5.07 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 5 - Forks: 0

erfanshayegani/Jailbreak-In-Pieces

Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models - 🔥 ICLR 2024 Spotlight - 🏆 Best Paper Award SoCal NLP 2023

Language: Python - Size: 1.9 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

Ibtissam-SAADI/CLIVP-FER

Facial Expression Recognition using vision language models (VLMs)

Language: Python - Size: 550 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

Related Keywords

vision-language-models 61 deep-learning 12 large-language-models 11 computer-vision 8 generative-ai 7 multimodal-large-language-models 7 clip 6 transformers 6 llm 6 python 6 vlm 5 spatial-reasoning 5 multimodal 4 image-captioning 4 mllm 4 fastapi 4 large-multimodal-models 3 lightning-ai 3 foundation-models 3 question-answering 3 litserve 3 machine-learning 3 gradio 3 llava 3 mixture-of-experts 3 huggingface-transformers 3 diffusion-models 3 large-vision-language-models 3 huggingface-spaces 3 paligemma 3 image-processing 2 multimodal-deep-learning 2 agent 2 gaussian-splatting 2 multi-concept-personalization 2 alignment-strategies 2 chest-xrays 2 ct-scans 2 feature-pyramid-network 2 continual-learning 2 medical-image-captioning 2 medical-imaging 2 mri-images 2 radiology-report-generation 2 ai-safety 2 nlp 2 embodied-ai 2 natural-language-processing 2 vision-language-model 2 multimodal-ai 2 awesome 2 artificial-intelligence 2 textual-inversion 2 internvl 2 instance-segmentation 2 gradio-interface 2 gpt-4v 2 claude 2 minicpm-v 2 object-detection 2 vision-transformer 2 image-classification 2 remote-sensing 2 multilingual-models 2 mixed-reality 1 llama 1 lora 1 groq-api 1 ai-agent 1 vlms 1 quantization-aware-training 1 facial-expression-recognition 1 healthcare 1 text-to-speech 1 voice-control 1 interpretability 1 mechanistic-interpretability 1 ood-generalization 1 robustness 1 llm-finetuning 1 image-descriptions 1 visual-perception 1 code-generation 1 multimodal-learning 1 vision-language 1 vision-language-learning 1 vision-language-navigation 1 vision-language-transformer 1 driver-s-emotion 1 contrastive-language--image-pretraining 1 multi-modal-models 1 cross-modality-safety-alignment 1 alignment 1 unlabeled-data 1 pseudolabels 1 code-language-model 1 chart-understanding 1 automated-data-analysis 1 chinese-language 1 benchmark 1