Topic: "vision-and-language"
aishwaryanr/awesome-generative-ai-guide
A one stop repository for generative AI research updates, interview resources, notebooks and much more!
Size: 29.3 MB - Last synced at: 6 days ago - Pushed at: about 1 month ago - Stars: 11,883 - Forks: 2,457

salesforce/LAVIS
LAVIS - A One-stop Library for Language-Vision Intelligence
Language: Jupyter Notebook - Size: 79.3 MB - Last synced at: 3 days ago - Pushed at: 6 months ago - Stars: 10,509 - Forks: 1,022

roboflow/maestro
streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL
Language: Python - Size: 10.6 MB - Last synced at: 5 days ago - Pushed at: 11 days ago - Stars: 2,551 - Forks: 203

om-ai-lab/OmAgent
Build multimodal language agents for fast prototype and production
Language: Python - Size: 11.4 MB - Last synced at: 21 days ago - Pushed at: about 2 months ago - Stars: 2,465 - Forks: 271

salesforce/ALBEF
Code for ALBEF: a new vision-language pre-training method
Language: Python - Size: 69.9 MB - Last synced at: 25 days ago - Pushed at: over 2 years ago - Stars: 1,627 - Forks: 205

open-mmlab/Multimodal-GPT
Multimodal-GPT
Language: Python - Size: 109 KB - Last synced at: 3 days ago - Pushed at: almost 2 years ago - Stars: 1,498 - Forks: 130

dandelin/ViLT
Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
Language: Python - Size: 1.38 MB - Last synced at: 25 days ago - Pushed at: about 1 year ago - Stars: 1,448 - Forks: 216

om-ai-lab/OmDet
Real-time and accurate open-vocabulary end-to-end object detection
Language: Python - Size: 9.75 MB - Last synced at: 20 days ago - Pushed at: 5 months ago - Stars: 1,313 - Forks: 111

NVlabs/prismer
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
Language: Python - Size: 4.25 MB - Last synced at: 25 days ago - Pushed at: over 1 year ago - Stars: 1,310 - Forks: 73

llm-jp/awesome-japanese-llm
日本語LLMまとめ - Overview of Japanese LLMs
Language: TypeScript - Size: 11.7 MB - Last synced at: 2 days ago - Pushed at: 4 days ago - Stars: 1,154 - Forks: 33

yuewang-cuhk/awesome-vision-language-pretraining-papers
Recent Advances in Vision and Language PreTrained Models (VL-PTMs)
Size: 104 KB - Last synced at: 15 days ago - Pushed at: over 2 years ago - Stars: 1,152 - Forks: 104

microsoft/Oscar 📦
Oscar and VinVL
Language: Python - Size: 715 KB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 1,048 - Forks: 252

rhymes-ai/Aria
Codebase for Aria - an Open Multimodal Native MoE
Language: Jupyter Notebook - Size: 120 MB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 995 - Forks: 83

OFA-Sys/ONE-PEACE
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Language: Python - Size: 29.9 MB - Last synced at: 5 months ago - Pushed at: 7 months ago - Stars: 977 - Forks: 63

YehLi/xmodaler
X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).
Language: Python - Size: 12.2 MB - Last synced at: 21 days ago - Pushed at: about 2 years ago - Stars: 970 - Forks: 105

26hzhang/DL-NLP-Readings
My Reading Lists of Deep Learning and Natural Language Processing
Language: TeX - Size: 364 MB - Last synced at: about 1 month ago - Pushed at: about 3 years ago - Stars: 850 - Forks: 264

SunzeY/AlphaCLIP
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Language: Jupyter Notebook - Size: 173 MB - Last synced at: 21 days ago - Pushed at: 9 months ago - Stars: 802 - Forks: 55

ChenRocks/UNITER
Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"
Language: Python - Size: 172 KB - Last synced at: 5 months ago - Pushed at: almost 4 years ago - Stars: 786 - Forks: 109

OpenRobotLab/PointLLM
[ECCV 2024 Best Paper Candidate] PointLLM: Empowering Large Language Models to Understand Point Clouds
Language: Python - Size: 3.42 MB - Last synced at: 3 days ago - Pushed at: 12 days ago - Stars: 774 - Forks: 37

jackroos/VL-BERT
Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".
Language: Jupyter Notebook - Size: 5.41 MB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 740 - Forks: 111

jayleicn/ClipBERT
[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.
Language: Python - Size: 73.2 KB - Last synced at: 29 days ago - Pushed at: over 1 year ago - Stars: 718 - Forks: 86

SkalskiP/top-cvpr-2024-papers
This repository is a curated collection of the most exciting and influential CVPR 2024 papers. 🔥 [Paper + Code + Demo]
Language: Python - Size: 58.6 KB - Last synced at: 1 day ago - Pushed at: 10 months ago - Stars: 715 - Forks: 59

NVlabs/DoRA
[ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation
Language: Python - Size: 3.06 MB - Last synced at: 4 months ago - Pushed at: 7 months ago - Stars: 668 - Forks: 44

SkalskiP/top-cvpr-2023-papers
This repository is a curated collection of the most exciting and influential CVPR 2023 papers. 🔥 [Paper + Code]
Language: Python - Size: 37.1 KB - Last synced at: 1 day ago - Pushed at: 10 months ago - Stars: 646 - Forks: 64

vardanagarwal/Proctoring-AI
Creating a software for automatic monitoring in online proctoring
Language: Python - Size: 383 MB - Last synced at: 19 days ago - Pushed at: 5 months ago - Stars: 573 - Forks: 339

peteanderson80/Matterport3DSimulator
AI Research Platform for Reinforcement Learning from Real Panoramic Images.
Language: C++ - Size: 13.2 MB - Last synced at: 29 days ago - Pushed at: 10 months ago - Stars: 566 - Forks: 132

mbzuai-oryx/groundingLMM
Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks [CVPR 2024].
Language: Python - Size: 109 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 536 - Forks: 25

sangminwoo/awesome-vision-and-language
A curated list of awesome vision and language resources (still under construction... stay tuned!)
Size: 127 KB - Last synced at: 8 days ago - Pushed at: 6 months ago - Stars: 534 - Forks: 41

eric-ai-lab/awesome-vision-language-navigation
A curated list for vision-and-language navigation. ACL 2022 paper "Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions"
Size: 79.1 KB - Last synced at: 6 days ago - Pushed at: about 1 year ago - Stars: 477 - Forks: 23

mees/calvin
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Language: Python - Size: 1.61 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 469 - Forks: 67

zengyan-97/X-VLM
X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)
Language: Python - Size: 13.5 MB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 462 - Forks: 51

JindongGu/Awesome-Prompting-on-Vision-Language-Model
This repo lists relevant papers summarized in our survey paper: A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models.
Size: 1.48 MB - Last synced at: 11 days ago - Pushed at: about 2 months ago - Stars: 453 - Forks: 34

Paranioar/Awesome_Matching_Pretraining_Transfering
The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.
Size: 369 KB - Last synced at: 9 days ago - Pushed at: 5 months ago - Stars: 425 - Forks: 48

google-research-datasets/conceptual-12m
Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
Size: 97.7 KB - Last synced at: 2 months ago - Pushed at: about 2 years ago - Stars: 380 - Forks: 20

j-min/VL-T5
PyTorch code for "Unifying Vision-and-Language Tasks via Text Generation" (ICML 2021)
Language: Python - Size: 847 KB - Last synced at: 28 days ago - Pushed at: almost 2 years ago - Stars: 369 - Forks: 58

Haiyang-W/GiT
[ECCV2024 Oral🔥] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"
Language: Python - Size: 12.5 MB - Last synced at: 20 days ago - Pushed at: 4 months ago - Stars: 345 - Forks: 15

HyperGAI/HPT
HPT - Open Multimodal LLMs from HyperGAI
Language: Python - Size: 2.47 MB - Last synced at: 3 months ago - Pushed at: 11 months ago - Stars: 313 - Forks: 20

tsujuifu/pytorch_mgie
A Gradio demo of MGIE
Language: Python - Size: 32.5 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 307 - Forks: 24

phellonchen/awesome-Vision-and-Language-Pre-training
Recent Advances in Vision and Language Pre-training (VLP)
Size: 81.1 KB - Last synced at: 11 days ago - Pushed at: almost 2 years ago - Stars: 292 - Forks: 16

MarSaKi/ETPNav
[TPAMI 2024] Official repo of "ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments"
Language: Python - Size: 8.53 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 290 - Forks: 24

Olney1/ChatGPT-OpenAI-Smart-Speaker
This AI Smart Speaker uses speech recognition, TTS (text-to-speech), and STT (speech-to-text) to enable voice and vision-driven conversations, with additional web search capabilities via OpenAI and Langchain agents.
Language: Python - Size: 145 MB - Last synced at: 1 day ago - Pushed at: 6 months ago - Stars: 279 - Forks: 31

FuxiaoLiu/LRV-Instruction
[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Language: Python - Size: 23.9 MB - Last synced at: 3 days ago - Pushed at: about 1 year ago - Stars: 276 - Forks: 13

JDAI-CV/image-captioning
Implementation of 'X-Linear Attention Networks for Image Captioning' [CVPR 2020]
Language: Python - Size: 733 KB - Last synced at: 24 days ago - Pushed at: almost 4 years ago - Stars: 273 - Forks: 54

om-ai-lab/RS5M
RS5M: a large-scale vision language dataset for remote sensing [TGRS]
Language: Python - Size: 44.8 MB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 252 - Forks: 11

SALT-NLP/LLaVAR
Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"
Language: Python - Size: 19.4 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 245 - Forks: 12

j-min/CLIP-Caption-Reward
PyTorch code for "Fine-grained Image Captioning with CLIP Reward" (Findings of NAACL 2022)
Language: Python - Size: 2.64 MB - Last synced at: 23 days ago - Pushed at: over 2 years ago - Stars: 241 - Forks: 26

uta-smile/TCL
code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022
Language: Python - Size: 11.1 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 233 - Forks: 33

linjieli222/HERO
Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"
Language: Python - Size: 248 KB - Last synced at: 3 months ago - Pushed at: over 3 years ago - Stars: 232 - Forks: 35

haofanwang/Awesome-Computer-Vision
Awesome Resources for Advanced Computer Vision Topics
Size: 93.8 KB - Last synced at: 8 days ago - Pushed at: over 2 years ago - Stars: 228 - Forks: 42

geoaigroup/awesome-vision-language-models-for-earth-observation
A curated list of awesome vision and language resources for earth observation.
Size: 470 KB - Last synced at: 15 days ago - Pushed at: about 1 month ago - Stars: 219 - Forks: 17

daqingliu/awesome-vln 📦
A curated list of research papers in Vision-Language Navigation (VLN)
Size: 40 KB - Last synced at: 8 days ago - Pushed at: about 1 year ago - Stars: 206 - Forks: 32

salesforce/ALPRO 📦
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
Language: Python - Size: 311 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 187 - Forks: 17

ylsung/VL_adapter
PyTorch code for "VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks" (CVPR2022)
Language: Python - Size: 2.28 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 187 - Forks: 15

clin1223/VLDet
[ICLR 2023] PyTorch implementation of VLDet (https://arxiv.org/abs/2211.14843)
Language: Python - Size: 1.56 MB - Last synced at: 29 days ago - Pushed at: about 1 year ago - Stars: 186 - Forks: 11

amanchadha/stanford-cs231n-assignments-2020
This repository contains my solutions to the assignments for Stanford's CS231n "Convolutional Neural Networks for Visual Recognition" (Spring 2020).
Language: Jupyter Notebook - Size: 200 MB - Last synced at: 28 days ago - Pushed at: almost 4 years ago - Stars: 170 - Forks: 68

kevinzakka/clip_playground
An ever-growing playground of notebooks showcasing CLIP's impressive zero-shot capabilities
Language: Jupyter Notebook - Size: 11.3 MB - Last synced at: 20 days ago - Pushed at: almost 3 years ago - Stars: 165 - Forks: 13

SunzeY/X-Prompt
Official implementation of X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
Size: 4.25 MB - Last synced at: about 2 months ago - Pushed at: 5 months ago - Stars: 152 - Forks: 1

LeapLabTHU/Pseudo-Q
[CVPR 2022] Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding
Language: Python - Size: 22.9 MB - Last synced at: 30 days ago - Pushed at: 10 months ago - Stars: 148 - Forks: 10

OFA-Sys/OFASys
OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models
Language: Python - Size: 20.3 MB - Last synced at: 26 days ago - Pushed at: over 2 years ago - Stars: 147 - Forks: 13

j-min/DallEval
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models (ICCV 2023)
Language: Jupyter Notebook - Size: 66.2 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 140 - Forks: 6

eric-xw/AREL
Code for the ACL paper "No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling"
Language: Python - Size: 441 MB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 137 - Forks: 35

YicongHong/Recurrent-VLN-BERT
Code of the CVPR 2021 Oral paper: A Recurrent Vision-and-Language BERT for Navigation
Language: Python - Size: 780 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 134 - Forks: 26

aimagelab/LLaVA-MORE
LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
Language: Python - Size: 2.62 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 131 - Forks: 8

tsujuifu/pytorch_violet
A PyTorch implementation of VIOLET
Language: Python - Size: 115 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 130 - Forks: 7

om-ai-lab/VL-CheckList
Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations. [EMNLP 2022]
Language: Python - Size: 26.6 MB - Last synced at: 14 days ago - Pushed at: 7 months ago - Stars: 129 - Forks: 5

antoyang/TubeDETR
[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers
Language: Python - Size: 93.8 KB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 127 - Forks: 8

chihyaoma/regretful-agent
PyTorch code for CVPR 2019 paper: The Regretful Agent: Heuristic-Aided Navigation through Progress Estimation
Language: C++ - Size: 5.73 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 124 - Forks: 24

zinengtang/TVLT
PyTorch code for “TVLT: Textless Vision-Language Transformer” (NeurIPS 2022 Oral)
Language: Jupyter Notebook - Size: 5.09 MB - Last synced at: 23 days ago - Pushed at: about 2 years ago - Stars: 124 - Forks: 13

patrick-tssn/Awesome-Colorful-LLM
Recent advancements propelled by large language models (LLMs), encompassing an array of domains including Vision, Audio, Agent, Robotics, Fundamental Sciences such as Mathematics, and Ominous.
Size: 935 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 121 - Forks: 8

zhegan27/VILLA
Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part
Language: Python - Size: 849 KB - Last synced at: 5 months ago - Pushed at: over 4 years ago - Stars: 119 - Forks: 14

cambridgeltl/visual-spatial-reasoning
[TACL'23] VSR: A probing benchmark for spatial undersranding of vision-language models.
Language: Python - Size: 10.7 MB - Last synced at: 30 days ago - Pushed at: about 2 years ago - Stars: 118 - Forks: 9

chihyaoma/selfmonitoring-agent
PyTorch code for ICLR 2019 paper: Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
Language: C++ - Size: 3.03 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 115 - Forks: 18

Ravi-Teja-konda/Surveillance_Video_Summarizer
VLM driven tool that processes surveillance videos, extracts frames, and generates insightful annotations using a fine-tuned Florence-2 Vision-Language Model. Includes a Gradio-based interface for querying and analyzing video footage.
Language: Python - Size: 1.72 MB - Last synced at: 24 days ago - Pushed at: 8 months ago - Stars: 106 - Forks: 13

zhuang-li/FactualSceneGraph
FACTUAL benchmark dataset, the pre-trained textual scene graph parser trained on FACTUAL.
Language: Python - Size: 11.4 MB - Last synced at: 28 days ago - Pushed at: 3 months ago - Stars: 105 - Forks: 12

j-min/HiREST
Hierarchical Video-Moment Retrieval and Step-Captioning (CVPR 2023)
Language: Python - Size: 3.64 MB - Last synced at: 27 days ago - Pushed at: 3 months ago - Stars: 100 - Forks: 9

antoyang/just-ask
[ICCV 2021 Oral + TPAMI] Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Language: Jupyter Notebook - Size: 917 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 98 - Forks: 13

FuxiaoLiu/VisualNews-Repository
[EMNLP'21] Visual News: Benchmark and Challenges in News Image Captioning
Language: Jupyter Notebook - Size: 6.94 MB - Last synced at: 3 days ago - Pushed at: 10 months ago - Stars: 94 - Forks: 9

xiaofeng94/VL-PLM
Exploiting unlabeled data with vision and language models for object detection, ECCV 2022
Language: Python - Size: 12 MB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 93 - Forks: 7

antoyang/VidChapters
[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale
Language: Jupyter Notebook - Size: 34.3 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 93 - Forks: 5

ChenyunWu/PhraseCutDataset
Dataset API for "PhraseCut: Language-based Image Segmentation in the Wild"
Language: Jupyter Notebook - Size: 15.5 MB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 91 - Forks: 10

Muennighoff/vilio
🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle
Language: Python - Size: 10.4 MB - Last synced at: 27 days ago - Pushed at: almost 2 years ago - Stars: 88 - Forks: 28

ch3cook-fdu/Vote2Cap-DETR
[CVPR 2023] Vote2Cap-DETR and [T-PAMI 2024] Vote2Cap-DETR++; A set-to-set perspective towards 3D Dense Captioning; State-of-the-Art 3D Dense Captioning methods
Language: Python - Size: 308 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 82 - Forks: 5

showlab/ShowAnything
Language: Jupyter Notebook - Size: 43 MB - Last synced at: 12 days ago - Pushed at: almost 2 years ago - Stars: 82 - Forks: 3

multimodal/multimodal
A collection of multimodal datasets, and visual features for VQA and captionning in pytorch. Just run "pip install multimodal"
Language: Python - Size: 2.21 MB - Last synced at: 11 days ago - Pushed at: about 3 years ago - Stars: 82 - Forks: 7

zengyan-97/X2-VLM
All-In-One VLM: Image + Video + Transfer to Other Languages / Domains
Language: Python - Size: 12.1 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 80 - Forks: 6

GT-RIPL/robo-vln
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"
Language: Python - Size: 33.3 MB - Last synced at: 11 days ago - Pushed at: 10 months ago - Stars: 78 - Forks: 8

MarSaKi/NvEM
[ACM MM 2021 Oral] Official repo of "Neighbor-view Enhanced Model for Vision and Language Navigation"
Language: C++ - Size: 2.74 MB - Last synced at: 12 months ago - Pushed at: over 2 years ago - Stars: 77 - Forks: 2

yanmin-wu/EDA
[CVPR 2023] EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
Language: Python - Size: 2.35 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 76 - Forks: 2

zeyofu/BLINK_Benchmark
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390
Language: Python - Size: 12.2 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 73 - Forks: 2

antoyang/FrozenBiLM
[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Language: Python - Size: 88.9 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 72 - Forks: 13

intersun/LightningDOT
source code and pre-trained/fine-tuned checkpoint for NAACL 2021 paper LightningDOT
Language: Python - Size: 435 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 72 - Forks: 9

MultimodalGeo/GeoText-1652
An offical repo for ECCV 2024 Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
Language: Python - Size: 41 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 67 - Forks: 2

TheShadow29/vognet-pytorch
[CVPR20] Video Object Grounding using Semantic Roles in Language Description (https://arxiv.org/abs/2003.10606)
Language: Python - Size: 3.45 MB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 67 - Forks: 7

rentainhe/TRAR-VQA
[ICCV 2021] Official implementation of the paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering"
Language: Python - Size: 927 KB - Last synced at: 21 days ago - Pushed at: over 3 years ago - Stars: 66 - Forks: 18

PathologyFoundation/plip
Pathology Language and Image Pre-Training (PLIP) is the first vision and language foundation model for Pathology AI. PLIP is a large-scale pre-trained model that can be used to extract visual and language features from pathology images and text description. The model is a fine-tuned version of the original CLIP model.
Language: Python - Size: 677 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 65 - Forks: 7

aimagelab/open-fashion-clip
This is the official repository for the paper "OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data". ICIAP 2023
Language: Python - Size: 213 KB - Last synced at: 22 days ago - Pushed at: 12 months ago - Stars: 64 - Forks: 4

om-ai-lab/GroundVLP
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection (AAAI 2024)
Language: Jupyter Notebook - Size: 33.7 MB - Last synced at: 25 days ago - Pushed at: over 1 year ago - Stars: 64 - Forks: 4

yuleiniu/rva
Code for CVPR'19 "Recursive Visual Attention in Visual Dialog"
Language: Python - Size: 39.1 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 64 - Forks: 11

fenglinliu98/MIA
Code for "Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations" (NeurIPS 2019)
Language: Python - Size: 971 KB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 63 - Forks: 14

aimagelab/pacscore
Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. (CVPR 2023)
Language: Python - Size: 7.15 MB - Last synced at: 27 days ago - Pushed at: about 2 months ago - Stars: 61 - Forks: 5
