An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: vision-language-model

yuanze-lin/Olympus

[CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router for Computer Vision Tasks"

Language: Python - Size: 3.49 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 172 - Forks: 35

KumaarBalbir/droneAnalyzer

FlytBase Assignment - Building Drone Security Analyst Agent for a docked drone that monitors a fixed property daily.

Language: Python - Size: 29.3 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 0 - Forks: 0

QwenLM/Qwen-VL

The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.

Language: Python - Size: 26 MB - Last synced at: 26 days ago - Pushed at: 9 months ago - Stars: 5,750 - Forks: 439

CraftJarvis/ROCKET-1

Official implementation of paper "ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting" (CVPR 2025)

Language: Java - Size: 75.2 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 38 - Forks: 0

tongnie/awesome-llm4tr

Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey, Framework, and Roadmap

Size: 1.31 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 11 - Forks: 0

ictnlp/LLaVA-Mini

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

Language: Python - Size: 54.6 MB - Last synced at: 26 days ago - Pushed at: 4 months ago - Stars: 441 - Forks: 19

Ruiyang-061X/VL-Uncertainty

🔎Official code for our paper: "VL-Uncertainty: Detecting Hallucination in Large Vision-Language Model via Uncertainty Estimation".

Language: Python - Size: 7.12 MB - Last synced at: 25 days ago - Pushed at: about 2 months ago - Stars: 31 - Forks: 2

neonwatty/meme-search

The open source Meme Search Engine and Finder. Free and built to self-host locally with Python, Ruby, and Docker.

Language: Ruby - Size: 14.8 MB - Last synced at: 27 days ago - Pushed at: about 1 month ago - Stars: 534 - Forks: 23

shikiw/Modality-Integration-Rate

The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".

Language: Python - Size: 17.7 MB - Last synced at: 28 days ago - Pushed at: 5 months ago - Stars: 97 - Forks: 3

aiishwarrya/VisualLanguageModel

A custom Vision-Language Model (VLM) built from scratch, using SigLip for contrastive learning and a ViT-based encoder to generate meaningful image captions and semantic descriptions.

Size: 2.49 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 0 - Forks: 0

Zackriya-Solutions/diagram2graph

An AI Vision Language Model System for extracting structured knowledge graph information(JSON) from images of process diagrams

Language: Jupyter Notebook - Size: 15.6 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 3 - Forks: 0

kingsdigitallab/kdl-vqa

Python tool for batch visual question answering (BVQA).

Language: Python - Size: 44.2 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 0

PRITHIVSAKTHIUR/Hand-Gesture-2-Robot

Hand-Gesture-2-Robot is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to recognize hand gestures and map them to specific robot commands using the SiglipForImageClassification architecture.

Language: Python - Size: 12.7 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

ShareGPT4Omni/ShareGPT4V

[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions

Language: Python - Size: 644 KB - Last synced at: 27 days ago - Pushed at: 10 months ago - Stars: 211 - Forks: 5

vbdi/divprune

[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

Language: Python - Size: 11 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 7 - Forks: 0

YanNeu/DASH-B

Object Hallucination Benchmark for Vision Language Models

Language: Python - Size: 512 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

YanNeu/DASH

DASH: Detection and Assessment of Systematic Hallucinations of VLMs

Language: Python - Size: 4.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

Davidlequnchen/VLM-CADFeatureRecognition

​This repository provides code and resources for automating manufacturing feature recognition in CAD designs using vision-language models.

Language: Jupyter Notebook - Size: 166 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

SeoBuAs/Advanced_Anomaly_Detection_in_CCTV_Systems_with_VLM

CCTV Abnormaly Detection and Logging System

Size: 0 Bytes - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

reidbarber/webmarker

Mark web pages for use with vision-language models

Language: TypeScript - Size: 677 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 30 - Forks: 3

tian1327/SWAT

[CVPR 2025] Few-shot Recognition via Stage-Wise Retrieval-Augmented Finetuning

Language: Python - Size: 22.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 14 - Forks: 1

IDEA-FinAI/ChartMoE

[ICLR2025 Oral] ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart Understanding

Language: Jupyter Notebook - Size: 9.76 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 50 - Forks: 1

PRITHIVSAKTHIUR/Gender-Classifier-Mini

Gender-Classifier-Mini is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify images based on gender using the SiglipForImageClassification architecture.

Language: Python - Size: 0 Bytes - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

tanmaypatil/study-notes

Sample application for using students study notes and searching and creating quizzes using notes.

Language: Python - Size: 1.17 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

sshh12/multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large language models.

Language: Python - Size: 1.22 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 183 - Forks: 14

psunlpgroup/VisOnlyQA

This repository contains the code and data for the paper "VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information"

Language: Python - Size: 174 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 22 - Forks: 1

dvlab-research/LSDBench

A benchmark that focuses on the sampling dilemma in long-video tasks. Through well-designed tasks, it evaluates the sampling efficiency of long-video VLMs.

Language: Python - Size: 2.57 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 0

YyzHarry/vlm-fairness

[Science Advances] Demographic Bias of Vision-Language Foundation Models in Medical Imaging

Language: Python - Size: 1.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 16 - Forks: 3

s-emanuilov/litepali

LitePali is a minimal, efficient implementation of ColPali for image retrieval and indexing, optimized for cloud deployment.

Language: Python - Size: 691 KB - Last synced at: 22 days ago - Pushed at: 7 months ago - Stars: 46 - Forks: 1

FreedomIntelligence/TRIM

We introduce new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance.

Language: Python - Size: 26.9 MB - Last synced at: 4 days ago - Pushed at: 5 months ago - Stars: 12 - Forks: 0

Tanveer81/ReVisionLLM

This is the official implementation of ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

Language: Python - Size: 5.58 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 11 - Forks: 0

ZQuang2202/PromptGD

PromptGD - A simple baseline for Language-driven Grasp Detection task.

Language: Jupyter Notebook - Size: 1.91 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 5 - Forks: 0

VITA-MLLM/Long-VITA

✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

Language: Python - Size: 3.85 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 265 - Forks: 29

fork123aniket/Multi-Round-VLM-powered-Multimodal-Conversational-AI-Navigation-Bot

Streamlit App Combining Vision, Language, and Audio AI Models

Language: Python - Size: 18.6 KB - Last synced at: 13 days ago - Pushed at: 3 months ago - Stars: 3 - Forks: 0

Albi1999/multi-agent-anpr

Side group project for the Vision and Cognitive Systems Course of the MSc in Data Science @ UniPD 2024/2025

Language: Jupyter Notebook - Size: 431 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

jacobmarks/fiftyone_florence2_plugin

Run SOTA Vision-Language Model Florence-2 on your data!

Language: Jupyter Notebook - Size: 5.86 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 9 - Forks: 1

VoxAct-B/voxactb

VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual Manipulation (CoRL 2024)

Language: Python - Size: 400 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 39 - Forks: 2

Flame-Code-VLM/Flame-Code-VLM

Flame is an open-source multimodal AI system designed to translate UI design mockups into high-quality React code. It leverages vision-language modeling, automated data synthesis, and structured training workflows to bridge the gap between design and front-end development.

Language: Python - Size: 7.24 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 470 - Forks: 28

jmoral4/aigamer

AIGamer Testbed for the game WarSim (and maybe others). Support for Ollama, Claude, and easily extensible for OpenAI, Gemini

Language: C# - Size: 36.1 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

hustvl/AlphaDrive

Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

Size: 2.55 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 132 - Forks: 6

icon-lab/MedTrim

Official implementation of "Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models"

Language: Python - Size: 40 KB - Last synced at: 20 days ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 1

Pavansomisetty21/Qwen2-Vision-Finetuning-Unsloth---Maths-OCR-Formulae-Extraction-

we finetune unsloth llama model to extract mathematical fomulas in the images with optical character recognition(OCR)

Language: Jupyter Notebook - Size: 43 KB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

Jl-wei/guing

A mobile GUI search engine using a vision-language model

Language: Python - Size: 15.6 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 11 - Forks: 1

2U1/Pixtral-Finetune

An open-source implementaion for fine-tuning Pixtral by MistralAI.

Language: Python - Size: 58.6 KB - Last synced at: 24 days ago - Pushed at: 3 months ago - Stars: 13 - Forks: 1

zhengli97/PromptKD

[CVPR 2024] Official PyTorch Code for "PromptKD: Unsupervised Prompt Distillation for Vision-Language Models"

Language: Python - Size: 11.2 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 281 - Forks: 3

yuhui-zh15/AutoConverter

Official implementation of "Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation" (CVPR 2025)

Language: Python - Size: 46.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 24 - Forks: 2

sitamgithub-MSIT/aya-vision-litserve

Leverage Aya Vision's capabilities using LitServe.

Language: Python - Size: 1.84 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

NVlabs/prismer

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

Language: Python - Size: 4.25 MB - Last synced at: 26 days ago - Pushed at: over 1 year ago - Stars: 1,310 - Forks: 73

Xza85hrf/Arachne-Picrawler Fork of sunfounder/picrawler

ChatGPT 4o intergration for PiCrawler robot from SunFounder

Language: Python - Size: 110 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Derekkk/VIVA_EMNLP24

[EMNLP'24] VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values

Language: Python - Size: 3.28 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 0

ybendou/ProKeR

[CVPR 2025] This repository is the official implementation of "ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models"

Language: Python - Size: 40.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 10 - Forks: 0

ALucek/multimodal-llm-breakdown

Outlining and demonstrating how language models are able to understand image, video, and text content.

Language: Jupyter Notebook - Size: 14.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

anasster/diploma-thesis-repo

Repository for my Diploma Thesis code (Python)

Language: Python - Size: 52.7 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

feuler/openwebui-visual-retrieval

ColQwen2 local Vespa DB deploy and feed and Open-Webui retrieval function

Language: Python - Size: 34.2 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

Traffic-Alpha/VLM-TSC

Language: Python - Size: 703 KB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

Pavansomisetty21/Visual-Question-Answering-using-Gemini-LLM

In this we explore into visual Question Answering Using Gemini LLM and image was in URL or any other extension

Language: Jupyter Notebook - Size: 6.7 MB - Last synced at: 5 days ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

shafieiali42/PromptAD-Robustness

Evaluation of PromptAD’s robustness under various image corruptions for few-shot anomaly detection.

Language: Jupyter Notebook - Size: 315 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

ddw2AIGROUP2CQUPT/PA-LLaVA

A Large Language-Vision Assistant for Pathology Image Understanding (BIBM-2024)

Language: Python - Size: 1.27 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 41 - Forks: 3

thomas-yanxin/KarmaVLM

🧘🏻‍♂️KarmaVLM (相生):A family of high efficiency and powerful visual language model.

Language: Python - Size: 2.68 MB - Last synced at: 5 days ago - Pushed at: about 1 year ago - Stars: 88 - Forks: 3

PRITHIVSAKTHIUR/Agent-Dino

Dino: The Minimalist Multipurpose Chat System

Language: Python - Size: 363 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

showlab/ShowUI

[CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.

Language: Python - Size: 27.4 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1,064 - Forks: 64

hzjian123/VLArena

Closed-loop evaluation for end-to-end VLM autonomous driving agent

Language: Python - Size: 367 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 10 - Forks: 3

eljandoubi/PaliGemma

Coding PaliGemma from scratch using pytorch for inference.

Language: Python - Size: 1.11 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

rajibrhasan/modality_gap

A repository for visualization of modality gap in VLMs

Language: Python - Size: 14.6 KB - Last synced at: about 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

ai4ce/LLM4VPR

Can multimodal LLM help visual place recognition?

Language: Python - Size: 7.92 MB - Last synced at: 23 days ago - Pushed at: 10 months ago - Stars: 37 - Forks: 1

Skyline-9/Shotluck-Holmes

[ACM MMGR '24] 🔍 Shotluck Holmes: A family of small-scale LLVMs for shot-level video understanding

Language: Python - Size: 26.3 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 11 - Forks: 0

ChristianLin0420/elsa-vla

A simple and scalable codebase for training and fine-tuning vision-language-action models (VLAs) for generalist robotic manipulation:

Language: Python - Size: 314 KB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

rajibrhasan/LLaVA

This repository contains the implementation of a modified LLaVA architecture designed to address information imbalance between modalities in multimodal learning.

Language: Python - Size: 14.8 MB - Last synced at: 26 days ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Luis355/qw

qw is a lightweight text editor designed for quick and efficient editing tasks. It offers a simple yet powerful interface for users to easily manipulate text files.

Size: 0 Bytes - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

abbasjoyia99/DeepEraseKit

DeepEraseKit is a universal Swift package for iOS and macOS that removes backgrounds in real time while capturing video. Powered by Apple's Vision framework, it supports multiple background options: none, blur, color, and image, making it ideal for virtual backgrounds and augmented reality applications.

Language: Swift - Size: 26.4 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

tyler-romero/seahorse

A small vision language model meant for research

Language: Python - Size: 3.65 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 2 - Forks: 0

hustvl/MaskAdapter

[CVPR 2025] Official repository of the paper "Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation"

Language: Python - Size: 15.4 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 50 - Forks: 0

FMXExpress/AI-Vision-Chat

Chat with large languages models about the contents of an image via this native desktop client for Windows, macOS, and Linux.

Language: Pascal - Size: 3.48 MB - Last synced at: 2 days ago - Pushed at: over 1 year ago - Stars: 20 - Forks: 4

zihaosheng/VLM-RL

VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving

Language: Python - Size: 861 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 71 - Forks: 5

AstraZeneca/vlm

Official implementation for "Diffusion Instruction Tuning"

Size: 5.57 MB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 14 - Forks: 0

MaxLSB/mini-paligemma2

Minimalist implementation of PaliGemma 2 & PaliGemma VLM from scratch

Language: Python - Size: 4.22 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

xyz9911/FLAME

[AAAI-25 Oral] Official Implementation of "FLAME: Learning to Navigate with Multimodal LLM in Urban Environments"

Language: Python - Size: 8.57 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 30 - Forks: 3

sitamgithub-MSIT/align-anything-litserve

Leverage Align-DS-V's capabilities using LitServe.

Language: Python - Size: 1.14 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

wusize/CLIPSelf

[ICLR2024 Spotlight] Code Release of CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Language: Python - Size: 32 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 183 - Forks: 9

huangwl18/VoxPoser

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Language: Python - Size: 7.11 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 637 - Forks: 82

MiZhenxing/ThinkDiff

Codes for ThinkDiff

Size: 2.55 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

hasanar1f/PAINT

PAINT (Paying Attention to INformed Tokens) is a plug-and-play framework that intervenes in the self-attention of the LLM and selectively boost the visual attention informed tokens to mitigate hallucination of Vision Language Models

Language: Python - Size: 59.9 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 4 - Forks: 0

spatial-comfort/spatial-comfort.github.io

Official website for "Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference under Ambiguities"

Language: JavaScript - Size: 9.95 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

xuyang-liu16/VGDiffZero

[ICASSP 2024] VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

Language: Python - Size: 1.07 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 14 - Forks: 1

nsidn98/LLaMAR

Code for our paper LLaMAR: LM-based Long-Horizon Planner for Multi-Agent Robotics

Language: Jupyter Notebook - Size: 58.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

ymrohit/openscenesense-ollama

OpenSceneSense Ollama is a Python library that harnesses AI for advanced local video analysis, offering customizable frame and audio insights for dynamic applications in media, education, and content moderation.

Language: Python - Size: 22.7 MB - Last synced at: 23 days ago - Pushed at: 6 months ago - Stars: 18 - Forks: 3

visual-haystacks/mirage

🔥 [ICLR 2025] Official PyTorch Model "Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark"

Language: Python - Size: 11.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 8 - Forks: 0

visual-haystacks/vhs_benchmark

🔥 [ICLR 2025] Official Benchmark Toolkits for "Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark"

Language: Python - Size: 5.22 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 24 - Forks: 1

liupei101/VLSA

[ICLR 2025] Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology

Language: Jupyter Notebook - Size: 12.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 24 - Forks: 2

SALT-NLP/PopupAttack

Code repo for the paper: Attacking Vision-Language Computer Agents via Pop-ups

Language: Python - Size: 195 MB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 26 - Forks: 1

ivonajdenkoska/tulip

[ICLR 2025] Official code repository for "TULIP: Token-length Upgraded CLIP"

Language: Python - Size: 27.3 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 11 - Forks: 0

Blacksujit/Deep-Learning-Specialization-Repo

This repo contains the neural networks learning's with tensorflow with all the high level deep learning concepts i am learning with project implementation

Language: Jupyter Notebook - Size: 20.2 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

chenshuang-zhang/imagenet_d

[CVPR 2024 Highlight] ImageNet-D

Language: Python - Size: 49.3 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 41 - Forks: 5

fork123aniket/Agentic-RAG-Story-Generation-with-Multimodal-GenAI

Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling

Language: Python - Size: 94.7 KB - Last synced at: 28 days ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

astra-vision/LatteCLIP

[WACV 2025] LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

Language: Jupyter Notebook - Size: 1.86 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 6 - Forks: 0

Pavansomisetty21/Visual-Question-Answering-Pixtral_Vision_Finetuning_Unsloth

In this we finetune Pixtral-12B-2409 model using unsloth for visual Question Answering(NLP Task)

Language: Jupyter Notebook - Size: 379 KB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 2 - Forks: 0

DataFog/vlm-api

REST API for computing cross-modal similarity between images and text using the ColPaLI vision-language model

Language: Python - Size: 2.53 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 7 - Forks: 1

ys-zong/VLGuard

[ICML 2024] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models.

Language: Python - Size: 1.97 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 51 - Forks: 2

Nerif-AI/Nerif

LLM-powered Python

Language: Python - Size: 185 KB - Last synced at: 9 days ago - Pushed at: 4 months ago - Stars: 14 - Forks: 5

Theia-4869/FasterVLM

Official code for paper: [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster.

Language: Python - Size: 28.6 MB - Last synced at: 4 months ago - Pushed at: 5 months ago - Stars: 44 - Forks: 0

Related Keywords
vision-language-model 309 llm 50 large-language-models 46 multimodal 39 clip 30 vlm 29 computer-vision 24 llava 23 multimodal-large-language-models 23 foundation-models 21 deep-learning 20 vision-language 18 pytorch 17 vision-transformer 16 artificial-intelligence 16 chatbot 16 benchmark 14 ai 13 generative-ai 12 multi-modal 12 robotics 12 large-multimodal-models 12 python 12 chatgpt 12 large-language-model 11 multi-modality 11 machine-learning 10 vision-and-language 10 prompt-learning 9 llms 9 multimodal-learning 9 diffusion-models 8 llama 8 transformer 8 huggingface-transformers 8 mllm 8 nlp 8 visual-question-answering 7 multimodal-deep-learning 7 vision-language-learning 7 autonomous-driving 7 reinforcement-learning 7 rag 7 transformers 7 image-captioning 7 llama3 7 instruction-tuning 7 gradio 7 gpt-4 6 zero-shot-learning 6 llama2 6 dataset 6 zero-shot 6 colpali 6 multi-modal-learning 6 huggingface 6 vision-language-pretraining 6 natural-language-processing 6 few-shot-learning 6 zero-shot-classification 5 language-model 5 gpt 5 siglip2 5 anomaly-detection 5 image-classification 5 vision 5 medical-imaging 5 retrieval-augmented-generation 5 video-understanding 5 vision-language-transformer 5 contrastive-learning 5 object-detection 5 visual-language-learning 5 multimodal-llm 5 gemini 5 docker 5 trustworthy-ai 4 gpt-4o 4 modality-gap 4 openai 4 information-retrieval 4 agent 4 ocr 4 open-source 4 continual-learning 4 reasoning 4 knowledge-distillation 4 open-vocabulary 4 prompt-engineering 4 stable-diffusion 4 siglip 4 vit 4 gpt4v 4 large-vision-language-model 4 remote-sensing 4 object-hallucination 4 awesome-list 4 large-vision-language-models 4 medical-image-analysis 4 fastapi 3