Topic: "vision-language-learning"
AIDC-AI/Ovis
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Language: Python - Size: 5.56 MB - Last synced at: 12 days ago - Pushed at: about 1 month ago - Stars: 886 - Forks: 57

RLHF-V/RLAIF-V
[CVPR'25] RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
Language: Python - Size: 60 MB - Last synced at: 24 days ago - Pushed at: about 2 months ago - Stars: 335 - Forks: 13

shikiw/OPERA
[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Language: Python - Size: 15.7 MB - Last synced at: 5 days ago - Pushed at: 8 months ago - Stars: 332 - Forks: 28

shikiw/Modality-Integration-Rate
The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".
Language: Python - Size: 17.7 MB - Last synced at: 19 days ago - Pushed at: 5 months ago - Stars: 97 - Forks: 3

YunzeMan/Situation3D
[CVPR 2024] Situational Awareness Matters in 3D Vision Language Reasoning
Language: Python - Size: 63.3 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 28 - Forks: 2

LooperXX/ManagerTower
Code for ACL 2023 Oral Paper: ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
Language: Python - Size: 6.71 MB - Last synced at: 12 months ago - Pushed at: over 1 year ago - Stars: 9 - Forks: 1

yubin1219/CrossVLT
Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation (Published in IEEE TMM 2023)
Language: Python - Size: 5.29 MB - Last synced at: 8 months ago - Pushed at: 9 months ago - Stars: 5 - Forks: 0

fork123aniket/Multi-Round-VLM-powered-Multimodal-Conversational-AI-Navigation-Bot
Streamlit App Combining Vision, Language, and Audio AI Models
Language: Python - Size: 18.6 KB - Last synced at: 4 days ago - Pushed at: 3 months ago - Stars: 3 - Forks: 0

lyuchenyang/Dialogue-to-Video-Retrieval
Code for ECIR 2023 paper "Dialogue-to-Video Retrieval"
Language: Python - Size: 34.2 KB - Last synced at: 20 days ago - Pushed at: almost 2 years ago - Stars: 3 - Forks: 1

fork123aniket/Agentic-RAG-Story-Generation-with-Multimodal-GenAI
Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling
Language: Python - Size: 94.7 KB - Last synced at: 19 days ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

Ravi-Teja-konda/TunedLlavaDelights
Explore the rich flavors of Indian desserts with TunedLlavaDelights. Utilizing the in Llava fine-tuning, our project unveils detailed nutritional profiles, taste notes, and optimal consumption times for beloved sweets. Dive into a fusion of AI innovation and culinary tradition
Language: Python - Size: 43.3 MB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

abhinav-neil/socratic-models Fork of milenakapralova/socraticmodels
Socratic models for multimodal reasoning & image captioning
Language: Jupyter Notebook - Size: 48.8 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0
