GitHub topics: vision-language-model
yuanze-lin/Olympus
[CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router for Computer Vision Tasks"
Language: Python - Size: 3.49 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 172 - Forks: 35

KumaarBalbir/droneAnalyzer
FlytBase Assignment - Building Drone Security Analyst Agent for a docked drone that monitors a fixed property daily.
Language: Python - Size: 29.3 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 0 - Forks: 0

QwenLM/Qwen-VL
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
Language: Python - Size: 26 MB - Last synced at: 26 days ago - Pushed at: 9 months ago - Stars: 5,750 - Forks: 439

CraftJarvis/ROCKET-1
Official implementation of paper "ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting" (CVPR 2025)
Language: Java - Size: 75.2 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 38 - Forks: 0

tongnie/awesome-llm4tr
Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey, Framework, and Roadmap
Size: 1.31 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 11 - Forks: 0

ictnlp/LLaVA-Mini
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
Language: Python - Size: 54.6 MB - Last synced at: 26 days ago - Pushed at: 4 months ago - Stars: 441 - Forks: 19

Ruiyang-061X/VL-Uncertainty
🔎Official code for our paper: "VL-Uncertainty: Detecting Hallucination in Large Vision-Language Model via Uncertainty Estimation".
Language: Python - Size: 7.12 MB - Last synced at: 25 days ago - Pushed at: about 2 months ago - Stars: 31 - Forks: 2

neonwatty/meme-search
The open source Meme Search Engine and Finder. Free and built to self-host locally with Python, Ruby, and Docker.
Language: Ruby - Size: 14.8 MB - Last synced at: 27 days ago - Pushed at: about 1 month ago - Stars: 534 - Forks: 23

shikiw/Modality-Integration-Rate
The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".
Language: Python - Size: 17.7 MB - Last synced at: 28 days ago - Pushed at: 5 months ago - Stars: 97 - Forks: 3

aiishwarrya/VisualLanguageModel
A custom Vision-Language Model (VLM) built from scratch, using SigLip for contrastive learning and a ViT-based encoder to generate meaningful image captions and semantic descriptions.
Size: 2.49 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 0 - Forks: 0

Zackriya-Solutions/diagram2graph
An AI Vision Language Model System for extracting structured knowledge graph information(JSON) from images of process diagrams
Language: Jupyter Notebook - Size: 15.6 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 3 - Forks: 0

kingsdigitallab/kdl-vqa
Python tool for batch visual question answering (BVQA).
Language: Python - Size: 44.2 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 0

PRITHIVSAKTHIUR/Hand-Gesture-2-Robot
Hand-Gesture-2-Robot is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to recognize hand gestures and map them to specific robot commands using the SiglipForImageClassification architecture.
Language: Python - Size: 12.7 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

ShareGPT4Omni/ShareGPT4V
[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions
Language: Python - Size: 644 KB - Last synced at: 27 days ago - Pushed at: 10 months ago - Stars: 211 - Forks: 5

vbdi/divprune
[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Language: Python - Size: 11 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 7 - Forks: 0

YanNeu/DASH-B
Object Hallucination Benchmark for Vision Language Models
Language: Python - Size: 512 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

YanNeu/DASH
DASH: Detection and Assessment of Systematic Hallucinations of VLMs
Language: Python - Size: 4.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

Davidlequnchen/VLM-CADFeatureRecognition
This repository provides code and resources for automating manufacturing feature recognition in CAD designs using vision-language models.
Language: Jupyter Notebook - Size: 166 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

SeoBuAs/Advanced_Anomaly_Detection_in_CCTV_Systems_with_VLM
CCTV Abnormaly Detection and Logging System
Size: 0 Bytes - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

reidbarber/webmarker
Mark web pages for use with vision-language models
Language: TypeScript - Size: 677 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 30 - Forks: 3

tian1327/SWAT
[CVPR 2025] Few-shot Recognition via Stage-Wise Retrieval-Augmented Finetuning
Language: Python - Size: 22.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 14 - Forks: 1

IDEA-FinAI/ChartMoE
[ICLR2025 Oral] ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart Understanding
Language: Jupyter Notebook - Size: 9.76 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 50 - Forks: 1

PRITHIVSAKTHIUR/Gender-Classifier-Mini
Gender-Classifier-Mini is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify images based on gender using the SiglipForImageClassification architecture.
Language: Python - Size: 0 Bytes - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

tanmaypatil/study-notes
Sample application for using students study notes and searching and creating quizzes using notes.
Language: Python - Size: 1.17 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

sshh12/multi_token
Embed arbitrary modalities (images, audio, documents, etc) into large language models.
Language: Python - Size: 1.22 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 183 - Forks: 14

psunlpgroup/VisOnlyQA
This repository contains the code and data for the paper "VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information"
Language: Python - Size: 174 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 22 - Forks: 1

dvlab-research/LSDBench
A benchmark that focuses on the sampling dilemma in long-video tasks. Through well-designed tasks, it evaluates the sampling efficiency of long-video VLMs.
Language: Python - Size: 2.57 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 0

YyzHarry/vlm-fairness
[Science Advances] Demographic Bias of Vision-Language Foundation Models in Medical Imaging
Language: Python - Size: 1.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 16 - Forks: 3

s-emanuilov/litepali
LitePali is a minimal, efficient implementation of ColPali for image retrieval and indexing, optimized for cloud deployment.
Language: Python - Size: 691 KB - Last synced at: 22 days ago - Pushed at: 7 months ago - Stars: 46 - Forks: 1

FreedomIntelligence/TRIM
We introduce new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance.
Language: Python - Size: 26.9 MB - Last synced at: 4 days ago - Pushed at: 5 months ago - Stars: 12 - Forks: 0

Tanveer81/ReVisionLLM
This is the official implementation of ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
Language: Python - Size: 5.58 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 11 - Forks: 0

ZQuang2202/PromptGD
PromptGD - A simple baseline for Language-driven Grasp Detection task.
Language: Jupyter Notebook - Size: 1.91 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 5 - Forks: 0

VITA-MLLM/Long-VITA
✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
Language: Python - Size: 3.85 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 265 - Forks: 29

fork123aniket/Multi-Round-VLM-powered-Multimodal-Conversational-AI-Navigation-Bot
Streamlit App Combining Vision, Language, and Audio AI Models
Language: Python - Size: 18.6 KB - Last synced at: 13 days ago - Pushed at: 3 months ago - Stars: 3 - Forks: 0

Albi1999/multi-agent-anpr
Side group project for the Vision and Cognitive Systems Course of the MSc in Data Science @ UniPD 2024/2025
Language: Jupyter Notebook - Size: 431 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

jacobmarks/fiftyone_florence2_plugin
Run SOTA Vision-Language Model Florence-2 on your data!
Language: Jupyter Notebook - Size: 5.86 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 9 - Forks: 1

VoxAct-B/voxactb
VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual Manipulation (CoRL 2024)
Language: Python - Size: 400 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 39 - Forks: 2

Flame-Code-VLM/Flame-Code-VLM
Flame is an open-source multimodal AI system designed to translate UI design mockups into high-quality React code. It leverages vision-language modeling, automated data synthesis, and structured training workflows to bridge the gap between design and front-end development.
Language: Python - Size: 7.24 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 470 - Forks: 28

jmoral4/aigamer
AIGamer Testbed for the game WarSim (and maybe others). Support for Ollama, Claude, and easily extensible for OpenAI, Gemini
Language: C# - Size: 36.1 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

hustvl/AlphaDrive
Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
Size: 2.55 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 132 - Forks: 6

icon-lab/MedTrim
Official implementation of "Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models"
Language: Python - Size: 40 KB - Last synced at: 20 days ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 1

Pavansomisetty21/Qwen2-Vision-Finetuning-Unsloth---Maths-OCR-Formulae-Extraction-
we finetune unsloth llama model to extract mathematical fomulas in the images with optical character recognition(OCR)
Language: Jupyter Notebook - Size: 43 KB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

Jl-wei/guing
A mobile GUI search engine using a vision-language model
Language: Python - Size: 15.6 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 11 - Forks: 1

2U1/Pixtral-Finetune
An open-source implementaion for fine-tuning Pixtral by MistralAI.
Language: Python - Size: 58.6 KB - Last synced at: 24 days ago - Pushed at: 3 months ago - Stars: 13 - Forks: 1

zhengli97/PromptKD
[CVPR 2024] Official PyTorch Code for "PromptKD: Unsupervised Prompt Distillation for Vision-Language Models"
Language: Python - Size: 11.2 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 281 - Forks: 3

yuhui-zh15/AutoConverter
Official implementation of "Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation" (CVPR 2025)
Language: Python - Size: 46.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 24 - Forks: 2

sitamgithub-MSIT/aya-vision-litserve
Leverage Aya Vision's capabilities using LitServe.
Language: Python - Size: 1.84 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

NVlabs/prismer
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
Language: Python - Size: 4.25 MB - Last synced at: 26 days ago - Pushed at: over 1 year ago - Stars: 1,310 - Forks: 73

Xza85hrf/Arachne-Picrawler Fork of sunfounder/picrawler
ChatGPT 4o intergration for PiCrawler robot from SunFounder
Language: Python - Size: 110 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Derekkk/VIVA_EMNLP24
[EMNLP'24] VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values
Language: Python - Size: 3.28 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 0

ybendou/ProKeR
[CVPR 2025] This repository is the official implementation of "ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models"
Language: Python - Size: 40.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 10 - Forks: 0

ALucek/multimodal-llm-breakdown
Outlining and demonstrating how language models are able to understand image, video, and text content.
Language: Jupyter Notebook - Size: 14.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

anasster/diploma-thesis-repo
Repository for my Diploma Thesis code (Python)
Language: Python - Size: 52.7 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

feuler/openwebui-visual-retrieval
ColQwen2 local Vespa DB deploy and feed and Open-Webui retrieval function
Language: Python - Size: 34.2 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

Traffic-Alpha/VLM-TSC
Language: Python - Size: 703 KB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

Pavansomisetty21/Visual-Question-Answering-using-Gemini-LLM
In this we explore into visual Question Answering Using Gemini LLM and image was in URL or any other extension
Language: Jupyter Notebook - Size: 6.7 MB - Last synced at: 5 days ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

shafieiali42/PromptAD-Robustness
Evaluation of PromptAD’s robustness under various image corruptions for few-shot anomaly detection.
Language: Jupyter Notebook - Size: 315 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

ddw2AIGROUP2CQUPT/PA-LLaVA
A Large Language-Vision Assistant for Pathology Image Understanding (BIBM-2024)
Language: Python - Size: 1.27 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 41 - Forks: 3

thomas-yanxin/KarmaVLM
🧘🏻♂️KarmaVLM (相生):A family of high efficiency and powerful visual language model.
Language: Python - Size: 2.68 MB - Last synced at: 5 days ago - Pushed at: about 1 year ago - Stars: 88 - Forks: 3

PRITHIVSAKTHIUR/Agent-Dino
Dino: The Minimalist Multipurpose Chat System
Language: Python - Size: 363 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

showlab/ShowUI
[CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.
Language: Python - Size: 27.4 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1,064 - Forks: 64

hzjian123/VLArena
Closed-loop evaluation for end-to-end VLM autonomous driving agent
Language: Python - Size: 367 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 10 - Forks: 3

eljandoubi/PaliGemma
Coding PaliGemma from scratch using pytorch for inference.
Language: Python - Size: 1.11 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

rajibrhasan/modality_gap
A repository for visualization of modality gap in VLMs
Language: Python - Size: 14.6 KB - Last synced at: about 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

ai4ce/LLM4VPR
Can multimodal LLM help visual place recognition?
Language: Python - Size: 7.92 MB - Last synced at: 23 days ago - Pushed at: 10 months ago - Stars: 37 - Forks: 1

Skyline-9/Shotluck-Holmes
[ACM MMGR '24] 🔍 Shotluck Holmes: A family of small-scale LLVMs for shot-level video understanding
Language: Python - Size: 26.3 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 11 - Forks: 0

ChristianLin0420/elsa-vla
A simple and scalable codebase for training and fine-tuning vision-language-action models (VLAs) for generalist robotic manipulation:
Language: Python - Size: 314 KB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

rajibrhasan/LLaVA
This repository contains the implementation of a modified LLaVA architecture designed to address information imbalance between modalities in multimodal learning.
Language: Python - Size: 14.8 MB - Last synced at: 26 days ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Luis355/qw
qw is a lightweight text editor designed for quick and efficient editing tasks. It offers a simple yet powerful interface for users to easily manipulate text files.
Size: 0 Bytes - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

abbasjoyia99/DeepEraseKit
DeepEraseKit is a universal Swift package for iOS and macOS that removes backgrounds in real time while capturing video. Powered by Apple's Vision framework, it supports multiple background options: none, blur, color, and image, making it ideal for virtual backgrounds and augmented reality applications.
Language: Swift - Size: 26.4 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

tyler-romero/seahorse
A small vision language model meant for research
Language: Python - Size: 3.65 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 2 - Forks: 0

hustvl/MaskAdapter
[CVPR 2025] Official repository of the paper "Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation"
Language: Python - Size: 15.4 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 50 - Forks: 0

FMXExpress/AI-Vision-Chat
Chat with large languages models about the contents of an image via this native desktop client for Windows, macOS, and Linux.
Language: Pascal - Size: 3.48 MB - Last synced at: 2 days ago - Pushed at: over 1 year ago - Stars: 20 - Forks: 4

zihaosheng/VLM-RL
VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving
Language: Python - Size: 861 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 71 - Forks: 5

AstraZeneca/vlm
Official implementation for "Diffusion Instruction Tuning"
Size: 5.57 MB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 14 - Forks: 0

MaxLSB/mini-paligemma2
Minimalist implementation of PaliGemma 2 & PaliGemma VLM from scratch
Language: Python - Size: 4.22 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

xyz9911/FLAME
[AAAI-25 Oral] Official Implementation of "FLAME: Learning to Navigate with Multimodal LLM in Urban Environments"
Language: Python - Size: 8.57 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 30 - Forks: 3

sitamgithub-MSIT/align-anything-litserve
Leverage Align-DS-V's capabilities using LitServe.
Language: Python - Size: 1.14 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

wusize/CLIPSelf
[ICLR2024 Spotlight] Code Release of CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
Language: Python - Size: 32 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 183 - Forks: 9

huangwl18/VoxPoser
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Language: Python - Size: 7.11 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 637 - Forks: 82

MiZhenxing/ThinkDiff
Codes for ThinkDiff
Size: 2.55 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

hasanar1f/PAINT
PAINT (Paying Attention to INformed Tokens) is a plug-and-play framework that intervenes in the self-attention of the LLM and selectively boost the visual attention informed tokens to mitigate hallucination of Vision Language Models
Language: Python - Size: 59.9 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 4 - Forks: 0

spatial-comfort/spatial-comfort.github.io
Official website for "Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference under Ambiguities"
Language: JavaScript - Size: 9.95 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

xuyang-liu16/VGDiffZero
[ICASSP 2024] VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders
Language: Python - Size: 1.07 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 14 - Forks: 1

nsidn98/LLaMAR
Code for our paper LLaMAR: LM-based Long-Horizon Planner for Multi-Agent Robotics
Language: Jupyter Notebook - Size: 58.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

ymrohit/openscenesense-ollama
OpenSceneSense Ollama is a Python library that harnesses AI for advanced local video analysis, offering customizable frame and audio insights for dynamic applications in media, education, and content moderation.
Language: Python - Size: 22.7 MB - Last synced at: 23 days ago - Pushed at: 6 months ago - Stars: 18 - Forks: 3

visual-haystacks/mirage
🔥 [ICLR 2025] Official PyTorch Model "Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark"
Language: Python - Size: 11.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 8 - Forks: 0

visual-haystacks/vhs_benchmark
🔥 [ICLR 2025] Official Benchmark Toolkits for "Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark"
Language: Python - Size: 5.22 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 24 - Forks: 1

liupei101/VLSA
[ICLR 2025] Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology
Language: Jupyter Notebook - Size: 12.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 24 - Forks: 2

SALT-NLP/PopupAttack
Code repo for the paper: Attacking Vision-Language Computer Agents via Pop-ups
Language: Python - Size: 195 MB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 26 - Forks: 1

ivonajdenkoska/tulip
[ICLR 2025] Official code repository for "TULIP: Token-length Upgraded CLIP"
Language: Python - Size: 27.3 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 11 - Forks: 0

Blacksujit/Deep-Learning-Specialization-Repo
This repo contains the neural networks learning's with tensorflow with all the high level deep learning concepts i am learning with project implementation
Language: Jupyter Notebook - Size: 20.2 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

chenshuang-zhang/imagenet_d
[CVPR 2024 Highlight] ImageNet-D
Language: Python - Size: 49.3 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 41 - Forks: 5

fork123aniket/Agentic-RAG-Story-Generation-with-Multimodal-GenAI
Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling
Language: Python - Size: 94.7 KB - Last synced at: 28 days ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

astra-vision/LatteCLIP
[WACV 2025] LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts
Language: Jupyter Notebook - Size: 1.86 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 6 - Forks: 0

Pavansomisetty21/Visual-Question-Answering-Pixtral_Vision_Finetuning_Unsloth
In this we finetune Pixtral-12B-2409 model using unsloth for visual Question Answering(NLP Task)
Language: Jupyter Notebook - Size: 379 KB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 2 - Forks: 0

DataFog/vlm-api
REST API for computing cross-modal similarity between images and text using the ColPaLI vision-language model
Language: Python - Size: 2.53 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 7 - Forks: 1

ys-zong/VLGuard
[ICML 2024] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models.
Language: Python - Size: 1.97 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 51 - Forks: 2

Nerif-AI/Nerif
LLM-powered Python
Language: Python - Size: 185 KB - Last synced at: 9 days ago - Pushed at: 4 months ago - Stars: 14 - Forks: 5

Theia-4869/FasterVLM
Official code for paper: [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster.
Language: Python - Size: 28.6 MB - Last synced at: 4 months ago - Pushed at: 5 months ago - Stars: 44 - Forks: 0
