GitHub topics: vision-language-model
haotian-liu/LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Language: Python - Size: 13.4 MB - Last synced at: 34 minutes ago - Pushed at: 9 months ago - Stars: 22,386 - Forks: 2,467

Blaizzy/mlx-vlm
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
Language: Python - Size: 33.8 MB - Last synced at: about 1 hour ago - Pushed at: about 1 hour ago - Stars: 1,232 - Forks: 117

Blacksujit/Deep-Learning-Specialization-Repo
This repo contains the neural networks learning's with tensorflow with all the high level deep learning concepts i am learning with project implementation
Language: Jupyter Notebook - Size: 20.2 MB - Last synced at: about 2 hours ago - Pushed at: about 3 hours ago - Stars: 0 - Forks: 0

Rajadhopiya/Gender-Classifier-Mini
Gender-Classifier-Mini is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify images based on gender using the SiglipForImageClassification architecture.
Language: Python - Size: 12.7 KB - Last synced at: about 11 hours ago - Pushed at: about 12 hours ago - Stars: 0 - Forks: 0

StarlightSearch/EmbedAnything
Production-ready Inference, Ingestion and Indexing built in Rust 🦀
Language: Rust - Size: 36.7 MB - Last synced at: about 14 hours ago - Pushed at: about 14 hours ago - Stars: 551 - Forks: 49

dvlab-research/MGM
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
Language: Python - Size: 57.1 MB - Last synced at: about 2 hours ago - Pushed at: about 1 year ago - Stars: 3,272 - Forks: 281

TongUI-agent/TongUI-agent
Release of code, datasets and model for our work TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials
Language: HTML - Size: 3.69 MB - Last synced at: about 19 hours ago - Pushed at: about 20 hours ago - Stars: 7 - Forks: 2

MrAlonso9/Hand-Gesture-2-Robot
Hand-Gesture-2-Robot is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to recognize hand gestures and map them to specific robot commands using the SiglipForImageClassification architecture.
Language: Python - Size: 12.7 KB - Last synced at: about 23 hours ago - Pushed at: about 24 hours ago - Stars: 0 - Forks: 0

akusayudodograu/Agentic-RAG-Story-Generation-with-Multimodal-GenAI
Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling
Size: 1000 Bytes - Last synced at: about 23 hours ago - Pushed at: 1 day ago - Stars: 8 - Forks: 1

Wisimaji/qwe
Description: QWE is a lightweight and efficient command-line tool designed for quick and easy text manipulation tasks. It offers a variety of functions such as searching, replacing, and formatting text, making it a versatile tool for developers and data analysts alike.
Size: 1000 Bytes - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

manufactai/finetuning-cookbook
A collection of practical examples and tutorials for fine-tuning large language models using Factory. Includes Docker images, Jupyter notebooks, and utility scripts for easy model training and deployment.
Language: Jupyter Notebook - Size: 1.54 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 0

liupei101/VLSA
[ICLR 2025] Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology
Language: Jupyter Notebook - Size: 12.2 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 36 - Forks: 3

Jl-wei/guing
A mobile GUI search engine using a vision-language model
Language: Python - Size: 16 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 12 - Forks: 1

illuin-tech/colpali
The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.
Language: Python - Size: 796 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1,795 - Forks: 153

yu-rp/apiprompting
[ECCV 2024] API: Attention Prompting on Image for Large Vision-Language Models
Language: Python - Size: 8.63 MB - Last synced at: 2 days ago - Pushed at: 7 months ago - Stars: 87 - Forks: 5

runjtu/vpr-arxiv-daily Fork of Vincentqyw/cv-arxiv-daily
Automatically Update Visual Place Recognition Papers Daily using Github Actions (Update Every 12th hours)
Language: Python - Size: 25 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2 - Forks: 0

SuyogKamble/simpleVLM
building a simple VLM. Implementing LlaMA-SmolLM2 from scratch + SigLip2 Vision Model. KV-Caching is supported and implemented from scratch as well
Language: Jupyter Notebook - Size: 7.33 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

gokayfem/awesome-vlm-architectures
Famous Vision Language Models and Their Architectures
Language: Markdown - Size: 2.26 MB - Last synced at: 3 days ago - Pushed at: 2 months ago - Stars: 804 - Forks: 42

linzhiqiu/t2v_metrics
Evaluating text-to-image/video/3D models with VQAScore
Language: Python - Size: 197 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 293 - Forks: 21

thisisiron/LLaVA-Pool
🌋 A flexible framework for training and configuring Vision-Language Models
Language: Python - Size: 3.09 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

Event-AHU/Medical_Image_Analysis
Foundation models based medical image analysis
Language: Python - Size: 28.3 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 129 - Forks: 4

YanNeu/RePOPE
Relabeling of the POPE benchmark
Language: Python - Size: 10.2 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 4 - Forks: 0

jagennath-hari/SpatialFusion-LM
SpatialFusion-LM is a real-time spatial reasoning framework that combines neural depth, 3D reconstruction, and language-driven scene understanding.
Language: Python - Size: 84.9 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

AIDC-AI/Parrot
🎉 The code repository for "Parrot: Multilingual Visual Instruction Tuning" in PyTorch.
Language: Python - Size: 25.2 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 36 - Forks: 1

zubair-irshad/Awesome-Robotics-3D
A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites
Size: 730 KB - Last synced at: 4 days ago - Pushed at: 6 months ago - Stars: 687 - Forks: 35

Gumpest/SparseVLMs
[ICML'25] Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference".
Language: Python - Size: 5.25 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 95 - Forks: 7

PRITHIVSAKTHIUR/SigLIP2-MultiDomain-App
SigLIP2 is a vision-language encoder model fine-tuned from google/siglip2-base-patch16-224
Language: Python - Size: 0 Bytes - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

2U1/Gemma3-Finetune
An open-source implementaion for Gemma3 series by Google.
Language: Python - Size: 88.9 KB - Last synced at: 1 day ago - Pushed at: 7 days ago - Stars: 22 - Forks: 4

NVlabs/describe-anything
Implementation for Describe Anything: Detailed Localized Image and Video Captioning
Language: Python - Size: 66 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 839 - Forks: 37

KejiaZhang-Robust/VAP
Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs
Language: Python - Size: 33.5 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 22 - Forks: 0

SkalskiP/vlms-zero-to-hero
This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.
Language: Jupyter Notebook - Size: 338 KB - Last synced at: 5 days ago - Pushed at: 3 months ago - Stars: 1,062 - Forks: 97

llm-jp/awesome-japanese-llm
日本語LLMまとめ - Overview of Japanese LLMs
Language: TypeScript - Size: 11.7 MB - Last synced at: 5 days ago - Pushed at: 7 days ago - Stars: 1,154 - Forks: 33

DAILtech/LLaVA-Deploy-Guide
💻 Tutorial for deploying LLaVA (Large Language & Vision Assistant) on Ubuntu + CUDA – step-by-step guide with CLI & web UI.
Language: Python - Size: 167 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 11 - Forks: 4

MrGiovanni/RadGPT
AbdomenAtlas 3.0 (9,262 CT volumes + medical reports). These “superhuman” reports are more accurate, detailed, standardized, and generated faster than traditional human-made reports.
Language: Python - Size: 11.5 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 67 - Forks: 1

jingyi0000/R1-VL
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Language: Python - Size: 2.3 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 276 - Forks: 0

jingyi0000/VLM_survey
Collection of AWESOME vision-language models for vision tasks
Size: 370 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 2,698 - Forks: 205

2U1/SmolVLM-Finetune
An open-source implementaion for fine-tuning SmolVLM.
Language: Python - Size: 85 KB - Last synced at: 1 day ago - Pushed at: 7 days ago - Stars: 26 - Forks: 5

MiniMax-AI/MiniMax-01
The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention
Language: Python - Size: 9.17 MB - Last synced at: 7 days ago - Pushed at: 27 days ago - Stars: 2,565 - Forks: 192

yshinya6/clip-refine
Code repository for "Post-pre-training for Modality Alignment in Vision-Language Foundation Models" (CVPR2025)
Language: Python - Size: 49.8 KB - Last synced at: 1 day ago - Pushed at: 7 days ago - Stars: 12 - Forks: 0

PKU-Alignment/align-anything
Align Anything: Training All-modality Model with Feedback
Language: Jupyter Notebook - Size: 108 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 3,520 - Forks: 408

NVlabs/PS3
Scaling Vision Pre-Training to 4K Resolution
Size: 5.07 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 154 - Forks: 6

2U1/Llama3.2-Vision-Finetune
An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.
Language: Python - Size: 89.8 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 155 - Forks: 23

IrohXu/Awesome-Multimodal-LLM-Autonomous-Driving
[WACV 2024 Survey Paper] Multimodal Large Language Models for Autonomous Driving
Size: 15 MB - Last synced at: 4 days ago - Pushed at: about 1 year ago - Stars: 286 - Forks: 12

AIDC-AI/Ovis
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Language: Python - Size: 5.56 MB - Last synced at: 7 days ago - Pushed at: about 1 month ago - Stars: 899 - Forks: 56

dvlab-research/VisionZip
Official repository for VisionZip (CVPR 2025)
Language: Python - Size: 18.2 MB - Last synced at: 4 days ago - Pushed at: 2 months ago - Stars: 274 - Forks: 12

eliaskempf/ideal_words
A PyTorch implementation of ideal word computation.
Language: Python - Size: 48.8 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 4 - Forks: 0

katha-ai/VELOCITI
VELOCITI Benchmark Evaluation and Visualisation Code
Language: Python - Size: 186 KB - Last synced at: 8 days ago - Pushed at: 18 days ago - Stars: 6 - Forks: 0

ApplyU-ai/ColorBlindnessEval
ColorBlindnessEval: Can Vision Language Models Pass Color Blindness Tests?
Size: 4.18 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 2 - Forks: 0

jusiro/DLILP
[IPMI'25] A Reality-check of vision-language pre-training for radiology. DLILP, a disentangled language-image-label pre-training criteria for VLMs.
Language: Python - Size: 107 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 1 - Forks: 0

jusiro/CLAP
[CVPR'24] Validation-free few-shot adaptation of CLIP, using a well-initialized Linear Probe (ZSLP) and class-adaptive constraints (CLAP).
Language: Python - Size: 1.46 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 71 - Forks: 3

mvish7/AlignVLM
This repository contains the implementation of AlignVLM paper, which proposes a novel method for vision language alignment
Language: Python - Size: 14.6 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

zjlucam/VisionAssistant
Parameter Efficient Multi-Model Vision Assistant for Polymer Solvation Behaviour Inference
Language: Python - Size: 184 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 4 - Forks: 1

X-iZhang/Libra
Code for the paper "Libra: Leveraging Temporal Images for Biomedical Radiology Analysis"
Language: Python - Size: 13.4 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 4 - Forks: 1

jingyaogong/minimind-v
🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in just 1 hours!
Language: Python - Size: 21.4 MB - Last synced at: 9 days ago - Pushed at: 10 days ago - Stars: 3,209 - Forks: 305

zytx121/Awesome-VLGFM
A Survey on Vision-Language Geo-Foundation Models (VLGFMs)
Size: 3.33 MB - Last synced at: 7 days ago - Pushed at: 3 months ago - Stars: 164 - Forks: 8

PJLab-ADG/awesome-knowledge-driven-AD
A curated list of awesome knowledge-driven autonomous driving (continually updated)
Size: 912 KB - Last synced at: 5 days ago - Pushed at: 11 months ago - Stars: 462 - Forks: 24

2U1/Qwen2-VL-Finetune
An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.
Language: Python - Size: 157 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 668 - Forks: 79

InternLM/InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Language: Python - Size: 199 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 2,815 - Forks: 173

2U1/Phi3-Vision-Finetune Fork of GaiZhenbiao/Phi3V-Finetuning
An open-source implementaion for fine-tuning Phi3-Vision and Phi3.5-Vision by Microsoft.
Language: Python - Size: 926 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 92 - Forks: 16

dheeraj7000/reflect
🔍 Reflect - Personal Journal & Sentiment Analysis App
Language: Python - Size: 15.6 KB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs
A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.
Size: 189 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 509 - Forks: 22

ammarlodhi255/Chest-xray-report-generation-app-with-chatbot-end-to-end-implementation
AI-powered Chest X-ray report generation app using VLM (Swin-T5) and LLM (LLaMA-3) for multilingual Q&A and medical education support.
Language: Jupyter Notebook - Size: 25.1 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

HVision-NKU/MaskCLIPpp
Official repository of the paper "High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation"
Language: Python - Size: 17.8 MB - Last synced at: 10 days ago - Pushed at: about 1 month ago - Stars: 24 - Forks: 1

OpenGVLab/InternVL
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Language: Python - Size: 38.5 MB - Last synced at: 13 days ago - Pushed at: 19 days ago - Stars: 7,817 - Forks: 591

taco-group/LangCoop
Official implementation of LangCoop: Collaborative Driving with Natural Language
Language: Python - Size: 53.9 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 7 - Forks: 0

FoundationVision/Groma
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
Language: Python - Size: 13.5 MB - Last synced at: 7 days ago - Pushed at: 11 months ago - Stars: 563 - Forks: 43

OpenDriveLab/ELM
[ECCV 2024] Embodied Understanding of Driving Scenarios
Language: Python - Size: 5.36 MB - Last synced at: 5 days ago - Pushed at: 4 months ago - Stars: 191 - Forks: 15

shreydan/simpleVLM
building a simple VLM. Implementing LlaMA-SmolLM2 from scratch + SigLip2 Vision Model. KV-Caching is supported and implemented from scratch as well
Language: Jupyter Notebook - Size: 7.33 MB - Last synced at: 13 days ago - Pushed at: 16 days ago - Stars: 2 - Forks: 0

princeton-nlp/CharXiv
[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Language: Python - Size: 831 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 108 - Forks: 12

yasserben/FLOSS
FLOSS: Plug-in Training-free and label-free text template selection that boosts OVSS methods
Language: Python - Size: 4.22 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 17 - Forks: 1

taco-group/Re-Align
A novel alignment framework that leverages image retrieval to mitigate hallucinations in Vision Language Models.
Language: Python - Size: 18.6 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 40 - Forks: 1

X-iZhang/RRG-BioNLP-ACL2024
Code for the paper "Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for Radiology Report Generation" (BioNLP ACL'24)
Size: 581 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 1

billbillbilly/urbanworm
Urban-Worm is a Python library that integrates remote sensing imagery, street view data, and multimodal model to assess environments and urban units
Language: Python - Size: 6.84 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 3

OpenGVLab/Multi-Modality-Arena
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
Language: Python - Size: 21.5 MB - Last synced at: 16 days ago - Pushed at: about 1 year ago - Stars: 517 - Forks: 38

shikiw/OPERA
[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Language: Python - Size: 15.7 MB - Last synced at: 15 days ago - Pushed at: 9 months ago - Stars: 332 - Forks: 28

ndurner/oai_chat
Multi-modal Chatbot based on OpenAI
Language: Python - Size: 128 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 4 - Forks: 0

mixpeek/awesome-multimodal-search
Collections of multimodal search libraries, service and research papers
Size: 3.12 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 7 - Forks: 0

PRITHIVSAKTHIUR/Image-Captioning-Florence2
This application utilizes the powerful Florence-2 vision-language model from Microsoft to generate comprehensive captions for images. The model is capable of understanding visual content and expressing it in natural language.
Language: Python - Size: 8.79 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

sherlockchou86/PyLangPipe
a simple lightweight large language model pipeline framework.
Language: Python - Size: 790 KB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 16 - Forks: 2

tsunghan-wu/reverse_vlm
🔥 Official implementation of "Generate, but Verify: Reducing Visual Hallucination in Vision-Language Models with Retrospective Resampling"
Language: Python - Size: 380 KB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 3 - Forks: 0

LAMDA-CL/PROOF
Learning without Forgetting for Vision-Language Models (TPAMI 2025)
Language: Python - Size: 581 KB - Last synced at: 18 days ago - Pushed at: 2 months ago - Stars: 34 - Forks: 2

RaptorMai/MLLM-CompBench
[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes
Language: Jupyter Notebook - Size: 10.7 MB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 38 - Forks: 2

illuin-tech/vidore-benchmark
Vision Document Retrieval (ViDoRe): Benchmark. Evaluation code for the ColPali paper.
Language: Python - Size: 2.97 MB - Last synced at: 17 days ago - Pushed at: 26 days ago - Stars: 197 - Forks: 24

nhussein/promptsmooth
Official implementation of the paper "PromptSmooth: Certifying Robustness of Medical Vision-Language Models via Prompt Learning"
Language: Python - Size: 6.85 MB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 22 - Forks: 1

miccunifi/Cross-the-Gap
[ICLR 2025] - Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
Language: Python - Size: 23.2 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 36 - Forks: 0

LAMDA-CL/LAMDA-PILOT
🎉 PILOT: A Pre-trained Model-Based Continual Learning Toolbox
Language: Python - Size: 7.13 MB - Last synced at: 18 days ago - Pushed at: about 1 month ago - Stars: 407 - Forks: 45

astra-vision/ProLIP
An extremely simple method for validation-free efficient adaptation of CLIP-like VLMs that is robust to the learning rate.
Language: Shell - Size: 3.9 MB - Last synced at: 19 days ago - Pushed at: 20 days ago - Stars: 24 - Forks: 2

wadeKeith/Awesome-Embodied-AI
An Introduction to Embodied Intelligence (A Quick Guide of Embodied-AI) (Updating)
Size: 2.2 MB - Last synced at: 19 days ago - Pushed at: about 1 month ago - Stars: 85 - Forks: 6

tonywu71/colpali-cookbooks
Recipes for learning, fine-tuning, and adapting ColPali to your multimodal RAG use cases. 👨🏻🍳
Size: 10.4 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 269 - Forks: 17

can-can-ya/QPMIL-VL
✨ [AAAI 2025] Queryable Prototype Multiple Instance Learning with Vision-Language Models for Incremental Whole Slide Image Classification
Language: Python - Size: 2.42 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 39 - Forks: 1

anishmadan23/foundational_fsod
This repository contains the implementation for the paper "Revisiting Few Shot Object Detection with Vision-Language Models"
Language: Python - Size: 12.2 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 59 - Forks: 3

iAnisdev/assistive-ai
Zero-shot object detection system for visually impaired users using CLIP, OWL-ViT, and real-time audio feedback.
Size: 3.91 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

Pavansomisetty21/Image-Caption-Generation-using-LLMs-GEMINI-
we generate captions to the images which are given by user(user input) using prompt engineering and Generative AI
Language: Jupyter Notebook - Size: 366 KB - Last synced at: 6 days ago - Pushed at: 9 months ago - Stars: 9 - Forks: 1

corentin-ryr/MultiMedEval
A Python tool to evaluate the performance of VLM on the medical domain.
Language: Python - Size: 10.9 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 59 - Forks: 3

PKU-YuanGroup/Chat-UniVi
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Language: Python - Size: 38.2 MB - Last synced at: 22 days ago - Pushed at: 7 months ago - Stars: 931 - Forks: 46

mbzuai-oryx/VideoGLaMM
[CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
Language: Python - Size: 40 MB - Last synced at: 22 days ago - Pushed at: 23 days ago - Stars: 57 - Forks: 1

deepseek-ai/DeepSeek-VL
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Language: Python - Size: 12.2 MB - Last synced at: 23 days ago - Pushed at: about 1 year ago - Stars: 3,769 - Forks: 558

OpenGVLab/MM-NIAH
[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.
Language: Python - Size: 2.83 MB - Last synced at: 19 days ago - Pushed at: 5 months ago - Stars: 115 - Forks: 6

sitamgithub-MSIT/TextSnap
TextSnap: Demo for Florence 2 model used in OCR tasks to extract and visualize text from images.
Language: Python - Size: 3.34 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 4 - Forks: 2

mbodiai/embodied-agents
Seamlessly integrate state-of-the-art transformer models into robotics stacks
Language: Python - Size: 75.3 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 200 - Forks: 22
