GitHub topics: vision-language-model

Repositories

haotian-liu/LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Language: Python - Size: 13.4 MB - Last synced at: 34 minutes ago - Pushed at: 9 months ago - Stars: 22,386 - Forks: 2,467

Blaizzy/mlx-vlm

MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.

Language: Python - Size: 33.8 MB - Last synced at: about 1 hour ago - Pushed at: about 1 hour ago - Stars: 1,232 - Forks: 117

Blacksujit/Deep-Learning-Specialization-Repo

This repo contains the neural networks learning's with tensorflow with all the high level deep learning concepts i am learning with project implementation

Language: Jupyter Notebook - Size: 20.2 MB - Last synced at: about 2 hours ago - Pushed at: about 3 hours ago - Stars: 0 - Forks: 0

Rajadhopiya/Gender-Classifier-Mini

Gender-Classifier-Mini is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify images based on gender using the SiglipForImageClassification architecture.

Language: Python - Size: 12.7 KB - Last synced at: about 11 hours ago - Pushed at: about 12 hours ago - Stars: 0 - Forks: 0

StarlightSearch/EmbedAnything

Production-ready Inference, Ingestion and Indexing built in Rust 🦀

Language: Rust - Size: 36.7 MB - Last synced at: about 14 hours ago - Pushed at: about 14 hours ago - Stars: 551 - Forks: 49

dvlab-research/MGM

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

Language: Python - Size: 57.1 MB - Last synced at: about 2 hours ago - Pushed at: about 1 year ago - Stars: 3,272 - Forks: 281

TongUI-agent/TongUI-agent

Release of code, datasets and model for our work TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

Language: HTML - Size: 3.69 MB - Last synced at: about 19 hours ago - Pushed at: about 20 hours ago - Stars: 7 - Forks: 2

MrAlonso9/Hand-Gesture-2-Robot

Hand-Gesture-2-Robot is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to recognize hand gestures and map them to specific robot commands using the SiglipForImageClassification architecture.

Language: Python - Size: 12.7 KB - Last synced at: about 23 hours ago - Pushed at: about 24 hours ago - Stars: 0 - Forks: 0

akusayudodograu/Agentic-RAG-Story-Generation-with-Multimodal-GenAI

Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling

Size: 1000 Bytes - Last synced at: about 23 hours ago - Pushed at: 1 day ago - Stars: 8 - Forks: 1

Wisimaji/qwe

Description: QWE is a lightweight and efficient command-line tool designed for quick and easy text manipulation tasks. It offers a variety of functions such as searching, replacing, and formatting text, making it a versatile tool for developers and data analysts alike.

Size: 1000 Bytes - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

manufactai/finetuning-cookbook

A collection of practical examples and tutorials for fine-tuning large language models using Factory. Includes Docker images, Jupyter notebooks, and utility scripts for easy model training and deployment.

Language: Jupyter Notebook - Size: 1.54 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 0

liupei101/VLSA

[ICLR 2025] Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology

Language: Jupyter Notebook - Size: 12.2 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 36 - Forks: 3

Jl-wei/guing

A mobile GUI search engine using a vision-language model

Language: Python - Size: 16 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 12 - Forks: 1

illuin-tech/colpali

The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.

Language: Python - Size: 796 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1,795 - Forks: 153

yu-rp/apiprompting

[ECCV 2024] API: Attention Prompting on Image for Large Vision-Language Models

Language: Python - Size: 8.63 MB - Last synced at: 2 days ago - Pushed at: 7 months ago - Stars: 87 - Forks: 5

runjtu/vpr-arxiv-daily Fork of Vincentqyw/cv-arxiv-daily

Automatically Update Visual Place Recognition Papers Daily using Github Actions (Update Every 12th hours)

Language: Python - Size: 25 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2 - Forks: 0

SuyogKamble/simpleVLM

building a simple VLM. Implementing LlaMA-SmolLM2 from scratch + SigLip2 Vision Model. KV-Caching is supported and implemented from scratch as well

Language: Jupyter Notebook - Size: 7.33 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

gokayfem/awesome-vlm-architectures

Famous Vision Language Models and Their Architectures

Language: Markdown - Size: 2.26 MB - Last synced at: 3 days ago - Pushed at: 2 months ago - Stars: 804 - Forks: 42

linzhiqiu/t2v_metrics

Evaluating text-to-image/video/3D models with VQAScore

Language: Python - Size: 197 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 293 - Forks: 21

thisisiron/LLaVA-Pool

🌋 A flexible framework for training and configuring Vision-Language Models

Language: Python - Size: 3.09 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

Event-AHU/Medical_Image_Analysis

Foundation models based medical image analysis

Language: Python - Size: 28.3 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 129 - Forks: 4

YanNeu/RePOPE

Relabeling of the POPE benchmark

Language: Python - Size: 10.2 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 4 - Forks: 0

jagennath-hari/SpatialFusion-LM

SpatialFusion-LM is a real-time spatial reasoning framework that combines neural depth, 3D reconstruction, and language-driven scene understanding.

Language: Python - Size: 84.9 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

AIDC-AI/Parrot

🎉 The code repository for "Parrot: Multilingual Visual Instruction Tuning" in PyTorch.

Language: Python - Size: 25.2 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 36 - Forks: 1

zubair-irshad/Awesome-Robotics-3D

A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites

Size: 730 KB - Last synced at: 4 days ago - Pushed at: 6 months ago - Stars: 687 - Forks: 35

Gumpest/SparseVLMs

[ICML'25] Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference".

Language: Python - Size: 5.25 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 95 - Forks: 7

PRITHIVSAKTHIUR/SigLIP2-MultiDomain-App

SigLIP2 is a vision-language encoder model fine-tuned from google/siglip2-base-patch16-224

Language: Python - Size: 0 Bytes - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

2U1/Gemma3-Finetune

An open-source implementaion for Gemma3 series by Google.

Language: Python - Size: 88.9 KB - Last synced at: 1 day ago - Pushed at: 7 days ago - Stars: 22 - Forks: 4

NVlabs/describe-anything

Implementation for Describe Anything: Detailed Localized Image and Video Captioning

Language: Python - Size: 66 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 839 - Forks: 37

KejiaZhang-Robust/VAP

Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs

Language: Python - Size: 33.5 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 22 - Forks: 0

SkalskiP/vlms-zero-to-hero

This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.

Language: Jupyter Notebook - Size: 338 KB - Last synced at: 5 days ago - Pushed at: 3 months ago - Stars: 1,062 - Forks: 97

llm-jp/awesome-japanese-llm

日本語LLMまとめ - Overview of Japanese LLMs

Language: TypeScript - Size: 11.7 MB - Last synced at: 5 days ago - Pushed at: 7 days ago - Stars: 1,154 - Forks: 33

DAILtech/LLaVA-Deploy-Guide

💻 Tutorial for deploying LLaVA (Large Language & Vision Assistant) on Ubuntu + CUDA – step-by-step guide with CLI & web UI.

Language: Python - Size: 167 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 11 - Forks: 4

MrGiovanni/RadGPT

AbdomenAtlas 3.0 (9,262 CT volumes + medical reports). These “superhuman” reports are more accurate, detailed, standardized, and generated faster than traditional human-made reports.

Language: Python - Size: 11.5 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 67 - Forks: 1

jingyi0000/R1-VL

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Language: Python - Size: 2.3 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 276 - Forks: 0

jingyi0000/VLM_survey

Collection of AWESOME vision-language models for vision tasks

Size: 370 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 2,698 - Forks: 205

2U1/SmolVLM-Finetune

An open-source implementaion for fine-tuning SmolVLM.

Language: Python - Size: 85 KB - Last synced at: 1 day ago - Pushed at: 7 days ago - Stars: 26 - Forks: 5

MiniMax-AI/MiniMax-01

The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention

Language: Python - Size: 9.17 MB - Last synced at: 7 days ago - Pushed at: 27 days ago - Stars: 2,565 - Forks: 192

yshinya6/clip-refine

Code repository for "Post-pre-training for Modality Alignment in Vision-Language Foundation Models" (CVPR2025)

Language: Python - Size: 49.8 KB - Last synced at: 1 day ago - Pushed at: 7 days ago - Stars: 12 - Forks: 0

PKU-Alignment/align-anything

Align Anything: Training All-modality Model with Feedback

Language: Jupyter Notebook - Size: 108 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 3,520 - Forks: 408

NVlabs/PS3

Scaling Vision Pre-Training to 4K Resolution

Size: 5.07 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 154 - Forks: 6

2U1/Llama3.2-Vision-Finetune

An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.

Language: Python - Size: 89.8 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 155 - Forks: 23

IrohXu/Awesome-Multimodal-LLM-Autonomous-Driving

[WACV 2024 Survey Paper] Multimodal Large Language Models for Autonomous Driving

Size: 15 MB - Last synced at: 4 days ago - Pushed at: about 1 year ago - Stars: 286 - Forks: 12

AIDC-AI/Ovis

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

Language: Python - Size: 5.56 MB - Last synced at: 7 days ago - Pushed at: about 1 month ago - Stars: 899 - Forks: 56

dvlab-research/VisionZip

Official repository for VisionZip (CVPR 2025)

Language: Python - Size: 18.2 MB - Last synced at: 4 days ago - Pushed at: 2 months ago - Stars: 274 - Forks: 12

eliaskempf/ideal_words

A PyTorch implementation of ideal word computation.

Language: Python - Size: 48.8 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 4 - Forks: 0

katha-ai/VELOCITI

VELOCITI Benchmark Evaluation and Visualisation Code

Language: Python - Size: 186 KB - Last synced at: 8 days ago - Pushed at: 18 days ago - Stars: 6 - Forks: 0

ApplyU-ai/ColorBlindnessEval

ColorBlindnessEval: Can Vision Language Models Pass Color Blindness Tests?

Size: 4.18 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 2 - Forks: 0

jusiro/DLILP

[IPMI'25] A Reality-check of vision-language pre-training for radiology. DLILP, a disentangled language-image-label pre-training criteria for VLMs.

Language: Python - Size: 107 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 1 - Forks: 0

jusiro/CLAP

[CVPR'24] Validation-free few-shot adaptation of CLIP, using a well-initialized Linear Probe (ZSLP) and class-adaptive constraints (CLAP).

Language: Python - Size: 1.46 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 71 - Forks: 3

mvish7/AlignVLM

This repository contains the implementation of AlignVLM paper, which proposes a novel method for vision language alignment

Language: Python - Size: 14.6 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

zjlucam/VisionAssistant

Parameter Efficient Multi-Model Vision Assistant for Polymer Solvation Behaviour Inference

Language: Python - Size: 184 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 4 - Forks: 1

X-iZhang/Libra

Code for the paper "Libra: Leveraging Temporal Images for Biomedical Radiology Analysis"

Language: Python - Size: 13.4 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 4 - Forks: 1

jingyaogong/minimind-v

🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM！🌏 Train a 26M-parameter VLM from scratch in just 1 hours!

Language: Python - Size: 21.4 MB - Last synced at: 9 days ago - Pushed at: 10 days ago - Stars: 3,209 - Forks: 305

zytx121/Awesome-VLGFM

A Survey on Vision-Language Geo-Foundation Models (VLGFMs)

Size: 3.33 MB - Last synced at: 7 days ago - Pushed at: 3 months ago - Stars: 164 - Forks: 8

PJLab-ADG/awesome-knowledge-driven-AD

A curated list of awesome knowledge-driven autonomous driving (continually updated)

Size: 912 KB - Last synced at: 5 days ago - Pushed at: 11 months ago - Stars: 462 - Forks: 24

2U1/Qwen2-VL-Finetune

An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.

Language: Python - Size: 157 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 668 - Forks: 79

InternLM/InternLM-XComposer

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Language: Python - Size: 199 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 2,815 - Forks: 173

2U1/Phi3-Vision-Finetune Fork of GaiZhenbiao/Phi3V-Finetuning

An open-source implementaion for fine-tuning Phi3-Vision and Phi3.5-Vision by Microsoft.

Language: Python - Size: 926 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 92 - Forks: 16

dheeraj7000/reflect

🔍 Reflect - Personal Journal & Sentiment Analysis App

Language: Python - Size: 15.6 KB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs

A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.

Size: 189 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 509 - Forks: 22

ammarlodhi255/Chest-xray-report-generation-app-with-chatbot-end-to-end-implementation

AI-powered Chest X-ray report generation app using VLM (Swin-T5) and LLM (LLaMA-3) for multilingual Q&A and medical education support.

Language: Jupyter Notebook - Size: 25.1 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

HVision-NKU/MaskCLIPpp

Official repository of the paper "High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation"

Language: Python - Size: 17.8 MB - Last synced at: 10 days ago - Pushed at: about 1 month ago - Stars: 24 - Forks: 1

OpenGVLab/InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

Language: Python - Size: 38.5 MB - Last synced at: 13 days ago - Pushed at: 19 days ago - Stars: 7,817 - Forks: 591

taco-group/LangCoop

Official implementation of LangCoop: Collaborative Driving with Natural Language

Language: Python - Size: 53.9 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 7 - Forks: 0

FoundationVision/Groma

[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization

Language: Python - Size: 13.5 MB - Last synced at: 7 days ago - Pushed at: 11 months ago - Stars: 563 - Forks: 43

OpenDriveLab/ELM

[ECCV 2024] Embodied Understanding of Driving Scenarios

Language: Python - Size: 5.36 MB - Last synced at: 5 days ago - Pushed at: 4 months ago - Stars: 191 - Forks: 15

shreydan/simpleVLM

building a simple VLM. Implementing LlaMA-SmolLM2 from scratch + SigLip2 Vision Model. KV-Caching is supported and implemented from scratch as well

Language: Jupyter Notebook - Size: 7.33 MB - Last synced at: 13 days ago - Pushed at: 16 days ago - Stars: 2 - Forks: 0

princeton-nlp/CharXiv

[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Language: Python - Size: 831 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 108 - Forks: 12

yasserben/FLOSS

FLOSS: Plug-in Training-free and label-free text template selection that boosts OVSS methods

Language: Python - Size: 4.22 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 17 - Forks: 1

taco-group/Re-Align

A novel alignment framework that leverages image retrieval to mitigate hallucinations in Vision Language Models.

Language: Python - Size: 18.6 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 40 - Forks: 1

X-iZhang/RRG-BioNLP-ACL2024

Code for the paper "Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for Radiology Report Generation" (BioNLP ACL'24)

Size: 581 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 1

billbillbilly/urbanworm

Urban-Worm is a Python library that integrates remote sensing imagery, street view data, and multimodal model to assess environments and urban units

Language: Python - Size: 6.84 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 3

OpenGVLab/Multi-Modality-Arena

Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!

Language: Python - Size: 21.5 MB - Last synced at: 16 days ago - Pushed at: about 1 year ago - Stars: 517 - Forks: 38

shikiw/OPERA

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Language: Python - Size: 15.7 MB - Last synced at: 15 days ago - Pushed at: 9 months ago - Stars: 332 - Forks: 28

ndurner/oai_chat

Multi-modal Chatbot based on OpenAI

Language: Python - Size: 128 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 4 - Forks: 0

mixpeek/awesome-multimodal-search

Collections of multimodal search libraries, service and research papers

Size: 3.12 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 7 - Forks: 0

PRITHIVSAKTHIUR/Image-Captioning-Florence2

This application utilizes the powerful Florence-2 vision-language model from Microsoft to generate comprehensive captions for images. The model is capable of understanding visual content and expressing it in natural language.

Language: Python - Size: 8.79 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

sherlockchou86/PyLangPipe

a simple lightweight large language model pipeline framework.

Language: Python - Size: 790 KB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 16 - Forks: 2

tsunghan-wu/reverse_vlm

🔥 Official implementation of "Generate, but Verify: Reducing Visual Hallucination in Vision-Language Models with Retrospective Resampling"

Language: Python - Size: 380 KB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 3 - Forks: 0

LAMDA-CL/PROOF

Learning without Forgetting for Vision-Language Models (TPAMI 2025)

Language: Python - Size: 581 KB - Last synced at: 18 days ago - Pushed at: 2 months ago - Stars: 34 - Forks: 2

RaptorMai/MLLM-CompBench

[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes

Language: Jupyter Notebook - Size: 10.7 MB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 38 - Forks: 2