GitHub topics: vision-language-model

Repositories

yuanze-lin/Olympus

[CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router for Computer Vision Tasks"

Language: Python - Size: 3.49 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 172 - Forks: 35

KumaarBalbir/droneAnalyzer

FlytBase Assignment - Building Drone Security Analyst Agent for a docked drone that monitors a fixed property daily.

Language: Python - Size: 29.3 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 0 - Forks: 0

QwenLM/Qwen-VL

The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.

Language: Python - Size: 26 MB - Last synced at: 26 days ago - Pushed at: 9 months ago - Stars: 5,750 - Forks: 439

CraftJarvis/ROCKET-1

Official implementation of paper "ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting" (CVPR 2025)

Language: Java - Size: 75.2 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 38 - Forks: 0

tongnie/awesome-llm4tr

Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey, Framework, and Roadmap

Size: 1.31 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 11 - Forks: 0

ictnlp/LLaVA-Mini

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

Language: Python - Size: 54.6 MB - Last synced at: 26 days ago - Pushed at: 4 months ago - Stars: 441 - Forks: 19

Ruiyang-061X/VL-Uncertainty

🔎Official code for our paper: "VL-Uncertainty: Detecting Hallucination in Large Vision-Language Model via Uncertainty Estimation".

Language: Python - Size: 7.12 MB - Last synced at: 25 days ago - Pushed at: about 2 months ago - Stars: 31 - Forks: 2

neonwatty/meme-search

The open source Meme Search Engine and Finder. Free and built to self-host locally with Python, Ruby, and Docker.

Language: Ruby - Size: 14.8 MB - Last synced at: 27 days ago - Pushed at: about 1 month ago - Stars: 534 - Forks: 23

shikiw/Modality-Integration-Rate

The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".

Language: Python - Size: 17.7 MB - Last synced at: 28 days ago - Pushed at: 5 months ago - Stars: 97 - Forks: 3

aiishwarrya/VisualLanguageModel

A custom Vision-Language Model (VLM) built from scratch, using SigLip for contrastive learning and a ViT-based encoder to generate meaningful image captions and semantic descriptions.

Size: 2.49 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 0 - Forks: 0

Zackriya-Solutions/diagram2graph

An AI Vision Language Model System for extracting structured knowledge graph information(JSON) from images of process diagrams

Language: Jupyter Notebook - Size: 15.6 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 3 - Forks: 0

kingsdigitallab/kdl-vqa

Python tool for batch visual question answering (BVQA).

Language: Python - Size: 44.2 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 0

PRITHIVSAKTHIUR/Hand-Gesture-2-Robot

Hand-Gesture-2-Robot is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to recognize hand gestures and map them to specific robot commands using the SiglipForImageClassification architecture.

Language: Python - Size: 12.7 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

ShareGPT4Omni/ShareGPT4V

[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions

Language: Python - Size: 644 KB - Last synced at: 27 days ago - Pushed at: 10 months ago - Stars: 211 - Forks: 5

vbdi/divprune

[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

Language: Python - Size: 11 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 7 - Forks: 0

YanNeu/DASH-B

Object Hallucination Benchmark for Vision Language Models

Language: Python - Size: 512 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

YanNeu/DASH

DASH: Detection and Assessment of Systematic Hallucinations of VLMs

Language: Python - Size: 4.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

Davidlequnchen/VLM-CADFeatureRecognition

This repository provides code and resources for automating manufacturing feature recognition in CAD designs using vision-language models.

Language: Jupyter Notebook - Size: 166 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

SeoBuAs/Advanced_Anomaly_Detection_in_CCTV_Systems_with_VLM

CCTV Abnormaly Detection and Logging System

Size: 0 Bytes - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

reidbarber/webmarker

Mark web pages for use with vision-language models

Language: TypeScript - Size: 677 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 30 - Forks: 3

tian1327/SWAT

[CVPR 2025] Few-shot Recognition via Stage-Wise Retrieval-Augmented Finetuning

Language: Python - Size: 22.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 14 - Forks: 1

IDEA-FinAI/ChartMoE

[ICLR2025 Oral] ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart Understanding

Language: Jupyter Notebook - Size: 9.76 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 50 - Forks: 1

PRITHIVSAKTHIUR/Gender-Classifier-Mini

Gender-Classifier-Mini is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify images based on gender using the SiglipForImageClassification architecture.

Language: Python - Size: 0 Bytes - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

tanmaypatil/study-notes

Sample application for using students study notes and searching and creating quizzes using notes.

Language: Python - Size: 1.17 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

sshh12/multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large language models.

Language: Python - Size: 1.22 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 183 - Forks: 14

psunlpgroup/VisOnlyQA

This repository contains the code and data for the paper "VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information"

Language: Python - Size: 174 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 22 - Forks: 1

dvlab-research/LSDBench

A benchmark that focuses on the sampling dilemma in long-video tasks. Through well-designed tasks, it evaluates the sampling efficiency of long-video VLMs.

Language: Python - Size: 2.57 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 0

YyzHarry/vlm-fairness

[Science Advances] Demographic Bias of Vision-Language Foundation Models in Medical Imaging

Language: Python - Size: 1.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 16 - Forks: 3

s-emanuilov/litepali

LitePali is a minimal, efficient implementation of ColPali for image retrieval and indexing, optimized for cloud deployment.

Language: Python - Size: 691 KB - Last synced at: 22 days ago - Pushed at: 7 months ago - Stars: 46 - Forks: 1

FreedomIntelligence/TRIM

We introduce new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance.

Language: Python - Size: 26.9 MB - Last synced at: 4 days ago - Pushed at: 5 months ago - Stars: 12 - Forks: 0

Tanveer81/ReVisionLLM

This is the official implementation of ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

Language: Python - Size: 5.58 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 11 - Forks: 0

ZQuang2202/PromptGD

PromptGD - A simple baseline for Language-driven Grasp Detection task.

Language: Jupyter Notebook - Size: 1.91 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 5 - Forks: 0

VITA-MLLM/Long-VITA

✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

Language: Python - Size: 3.85 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 265 - Forks: 29

fork123aniket/Multi-Round-VLM-powered-Multimodal-Conversational-AI-Navigation-Bot

Streamlit App Combining Vision, Language, and Audio AI Models

Language: Python - Size: 18.6 KB - Last synced at: 13 days ago - Pushed at: 3 months ago - Stars: 3 - Forks: 0

Albi1999/multi-agent-anpr

Side group project for the Vision and Cognitive Systems Course of the MSc in Data Science @ UniPD 2024/2025

Language: Jupyter Notebook - Size: 431 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

jacobmarks/fiftyone_florence2_plugin

Run SOTA Vision-Language Model Florence-2 on your data!

Language: Jupyter Notebook - Size: 5.86 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 9 - Forks: 1

VoxAct-B/voxactb

VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual Manipulation (CoRL 2024)

Language: Python - Size: 400 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 39 - Forks: 2

Flame-Code-VLM/Flame-Code-VLM

Flame is an open-source multimodal AI system designed to translate UI design mockups into high-quality React code. It leverages vision-language modeling, automated data synthesis, and structured training workflows to bridge the gap between design and front-end development.

Language: Python - Size: 7.24 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 470 - Forks: 28

jmoral4/aigamer

AIGamer Testbed for the game WarSim (and maybe others). Support for Ollama, Claude, and easily extensible for OpenAI, Gemini

Language: C# - Size: 36.1 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

hustvl/AlphaDrive

Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

Size: 2.55 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 132 - Forks: 6

icon-lab/MedTrim

Official implementation of "Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models"

Language: Python - Size: 40 KB - Last synced at: 20 days ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 1

Pavansomisetty21/Qwen2-Vision-Finetuning-Unsloth---Maths-OCR-Formulae-Extraction-

we finetune unsloth llama model to extract mathematical fomulas in the images with optical character recognition(OCR)

Language: Jupyter Notebook - Size: 43 KB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

Jl-wei/guing

A mobile GUI search engine using a vision-language model

Language: Python - Size: 15.6 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 11 - Forks: 1

2U1/Pixtral-Finetune

An open-source implementaion for fine-tuning Pixtral by MistralAI.

Language: Python - Size: 58.6 KB - Last synced at: 24 days ago - Pushed at: 3 months ago - Stars: 13 - Forks: 1

zhengli97/PromptKD

[CVPR 2024] Official PyTorch Code for "PromptKD: Unsupervised Prompt Distillation for Vision-Language Models"

Language: Python - Size: 11.2 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 281 - Forks: 3

yuhui-zh15/AutoConverter

Official implementation of "Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation" (CVPR 2025)

Language: Python - Size: 46.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 24 - Forks: 2

sitamgithub-MSIT/aya-vision-litserve

Leverage Aya Vision's capabilities using LitServe.

Language: Python - Size: 1.84 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

NVlabs/prismer

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

Language: Python - Size: 4.25 MB - Last synced at: 26 days ago - Pushed at: over 1 year ago - Stars: 1,310 - Forks: 73

Xza85hrf/Arachne-Picrawler Fork of sunfounder/picrawler

ChatGPT 4o intergration for PiCrawler robot from SunFounder

Language: Python - Size: 110 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Derekkk/VIVA_EMNLP24

[EMNLP'24] VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values

Language: Python - Size: 3.28 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 0

ybendou/ProKeR

[CVPR 2025] This repository is the official implementation of "ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models"

Language: Python - Size: 40.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 10 - Forks: 0

ALucek/multimodal-llm-breakdown

Outlining and demonstrating how language models are able to understand image, video, and text content.

Language: Jupyter Notebook - Size: 14.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

anasster/diploma-thesis-repo

Repository for my Diploma Thesis code (Python)

Language: Python - Size: 52.7 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

feuler/openwebui-visual-retrieval

ColQwen2 local Vespa DB deploy and feed and Open-Webui retrieval function

Language: Python - Size: 34.2 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

Traffic-Alpha/VLM-TSC

Language: Python - Size: 703 KB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

Pavansomisetty21/Visual-Question-Answering-using-Gemini-LLM

In this we explore into visual Question Answering Using Gemini LLM and image was in URL or any other extension

Language: Jupyter Notebook - Size: 6.7 MB - Last synced at: 5 days ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

shafieiali42/PromptAD-Robustness

Evaluation of PromptAD’s robustness under various image corruptions for few-shot anomaly detection.

Language: Jupyter Notebook - Size: 315 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

ddw2AIGROUP2CQUPT/PA-LLaVA

A Large Language-Vision Assistant for Pathology Image Understanding (BIBM-2024)

Language: Python - Size: 1.27 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 41 - Forks: 3

thomas-yanxin/KarmaVLM

🧘🏻‍♂️KarmaVLM (相生)：A family of high efficiency and powerful visual language model.

Language: Python - Size: 2.68 MB - Last synced at: 5 days ago - Pushed at: about 1 year ago - Stars: 88 - Forks: 3

PRITHIVSAKTHIUR/Agent-Dino

Dino: The Minimalist Multipurpose Chat System

Language: Python - Size: 363 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

showlab/ShowUI

[CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.

Language: Python - Size: 27.4 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1,064 - Forks: 64

hzjian123/VLArena

Closed-loop evaluation for end-to-end VLM autonomous driving agent

Language: Python - Size: 367 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 10 - Forks: 3

eljandoubi/PaliGemma

Coding PaliGemma from scratch using pytorch for inference.

Language: Python - Size: 1.11 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

rajibrhasan/modality_gap

A repository for visualization of modality gap in VLMs

Language: Python - Size: 14.6 KB - Last synced at: about 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

ai4ce/LLM4VPR

Can multimodal LLM help visual place recognition?

Language: Python - Size: 7.92 MB - Last synced at: 23 days ago - Pushed at: 10 months ago - Stars: 37 - Forks: 1

Skyline-9/Shotluck-Holmes

[ACM MMGR '24] 🔍 Shotluck Holmes: A family of small-scale LLVMs for shot-level video understanding

Language: Python - Size: 26.3 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 11 - Forks: 0

ChristianLin0420/elsa-vla

A simple and scalable codebase for training and fine-tuning vision-language-action models (VLAs) for generalist robotic manipulation:

Language: Python - Size: 314 KB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

rajibrhasan/LLaVA

This repository contains the implementation of a modified LLaVA architecture designed to address information imbalance between modalities in multimodal learning.

Language: Python - Size: 14.8 MB - Last synced at: 26 days ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Luis355/qw

qw is a lightweight text editor designed for quick and efficient editing tasks. It offers a simple yet powerful interface for users to easily manipulate text files.

Size: 0 Bytes - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

abbasjoyia99/DeepEraseKit

DeepEraseKit is a universal Swift package for iOS and macOS that removes backgrounds in real time while capturing video. Powered by Apple's Vision framework, it supports multiple background options: none, blur, color, and image, making it ideal for virtual backgrounds and augmented reality applications.

Language: Swift - Size: 26.4 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0