An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: vision-language-transformer

salesforce/LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence

Language: Jupyter Notebook - Size: 79.3 MB - Last synced at: about 4 hours ago - Pushed at: 6 months ago - Stars: 10,534 - Forks: 1,026

IDEA-Research/GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

Language: Python - Size: 12.5 MB - Last synced at: 4 days ago - Pushed at: 9 months ago - Stars: 7,994 - Forks: 800

akusayudodograu/Agentic-RAG-Story-Generation-with-Multimodal-GenAI

Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling

Size: 1000 Bytes - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 8 - Forks: 1

AlibabaResearch/AdvancedLiterateMachinery

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

Language: C++ - Size: 104 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1,685 - Forks: 191

salesforce/BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Language: Jupyter Notebook - Size: 6.34 MB - Last synced at: about 1 month ago - Pushed at: 9 months ago - Stars: 5,165 - Forks: 681

henghuiding/ReLA

[CVPR2023 Highlight] GRES: Generalized Referring Expression Segmentation

Language: Python - Size: 2.06 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 693 - Forks: 19

fork123aniket/Multi-Round-VLM-powered-Multimodal-Conversational-AI-Navigation-Bot

Streamlit App Combining Vision, Language, and Audio AI Models

Language: Python - Size: 18.6 KB - Last synced at: 20 days ago - Pushed at: 3 months ago - Stars: 3 - Forks: 0

henghuiding/Vision-Language-Transformer

[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation

Language: Python - Size: 322 KB - Last synced at: about 1 month ago - Pushed at: over 3 years ago - Stars: 352 - Forks: 23

ThomasVonWu/Awesome-VLMs-Strawberry

A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques.

Size: 760 KB - Last synced at: 2 days ago - Pushed at: 6 months ago - Stars: 10 - Forks: 1

fork123aniket/Agentic-RAG-Story-Generation-with-Multimodal-GenAI

Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling

Language: Python - Size: 94.7 KB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

sdc17/CrossGET

[ICML 2024] CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers.

Language: Python - Size: 11.6 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 26 - Forks: 0

shenyunhang/APE

[CVPR 2024] Aligning and Prompting Everything All at Once for Universal Visual Perception

Language: Python - Size: 49.3 MB - Last synced at: 5 months ago - Pushed at: about 1 year ago - Stars: 490 - Forks: 29

haoliuhl/instructrl

Instruction Following Agents with Multimodal Transforemrs

Language: Python - Size: 191 KB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 52 - Forks: 5

PrateekJannu/Vision-GPT

Coding a Multi-Modal vision model like GPT-4o from scratch, inspired by @hkproj and PaliGemma

Language: Python - Size: 591 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

unitaryai/VTC

VTC: Improving Video-Text Retrieval with User Comments

Language: Python - Size: 5.45 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 11 - Forks: 0

sMamooler/CLIP_Explainability

code for studying OpenAI's CLIP explainability

Language: Jupyter Notebook - Size: 470 MB - Last synced at: 10 months ago - Pushed at: over 3 years ago - Stars: 23 - Forks: 3

yiren-jian/BLIText

[NeurIPS 2023] Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

Language: Python - Size: 34.4 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 17 - Forks: 1

atharva-naik/MMML-TermProject-VizWiz-VQA-Challenge

VizWiz Challenge Term Project for Multi Modal Machine Learning @ CMU (11777)

Language: Python - Size: 90.6 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

marialymperaiou/knowledge-enhanced-multimodal-learning

A list of research papers on knowledge-enhanced multimodal learning

Size: 20.5 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 5 - Forks: 0

Related Keywords
vision-language-transformer 19 multimodal-deep-learning 8 vision-language 8 multimodal-learning 6 vision-language-model 5 multimodal 4 visual-question-answering 4 internvl2 3 multimodal-large-language-models 3 generative-ai 3 visual-reasoning 3 computer-vision 3 vision-language-pretraining 3 machine-learning 3 transformer 3 image-captioning 3 image-text-retrieval 3 artificial-intelligence 2 vision-and-language-pre-training 2 story-generation 2 llm 2 multimodal-data 2 generative-ai-model 2 agentic-workflow 2 agentic-rag 2 agentic-ai 2 open-world 2 object-detection 2 open-source 2 vision-and-language 2 referring-expression-comprehension 2 vision-language-learning 2 comments 1 vision-transformer 1 video-text-retrieval 1 video-understanding 1 gradcam-visualization 1 model-explainability 1 transformer-models 1 transformer-architecture 1 large-language-models 1 large-language-model 1 gpt-4o 1 google 1 gemini 1 reinforcement-learning 1 jax 1 instructions 1 instruction-following 1 referring-expression-segmentation 1 visual-storytelling 1 visual-grounding 1 visual-dialog 1 visual-commonsense-reasoning 1 vision-and-language-navigation 1 story-visualization 1 multimodal-retrieval 1 multi-task-learning 1 knowledge-graph 1 knowledge-enhanced-vision-language 1 knowledge-enhanced-multimodal-learning 1 image-text-matching 1 conditional-image-generation 1 vizwiz-vqa 1 vizwiz 1 term-project 1 question-answering 1 pytorch 1 opencv 1 open-source-project 1 natural-language-processing 1 image-processing 1 carnegie-mellon-university 1 openai-clip 1 cvpr2023 1 text-recognition 1 text-detection 1 scene-text-recognition 1 scene-text-detection-recognition 1 scene-text-detection 1 ocr 1 end-to-end-ocr 1 documentai 1 document-understanding 1 document-recognition 1 document-intelligence 1 document-analysis 1 document 1 open-world-detection 1 visual-question-anwsering 1 vision-framework 1 salesforce 1 multimodal-datasets 1 deep-learning-library 1 deep-learning 1 flax 1 image-segmentation 1 token-matching 1 token-ensemble 1 text-image-retrieval 1