vision-and-language | Topic | Ecosyste.ms: Repos

Topic: "vision-and-language"

aishwaryanr/awesome-generative-ai-guide

A one stop repository for generative AI research updates, interview resources, notebooks and much more!

Size: 29.3 MB - Last synced at: 6 days ago - Pushed at: about 1 month ago - Stars: 11,883 - Forks: 2,457

salesforce/LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence

Language: Jupyter Notebook - Size: 79.3 MB - Last synced at: 3 days ago - Pushed at: 6 months ago - Stars: 10,509 - Forks: 1,022

roboflow/maestro

streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL

Language: Python - Size: 10.6 MB - Last synced at: 5 days ago - Pushed at: 11 days ago - Stars: 2,551 - Forks: 203

om-ai-lab/OmAgent

Build multimodal language agents for fast prototype and production

Language: Python - Size: 11.4 MB - Last synced at: 21 days ago - Pushed at: about 2 months ago - Stars: 2,465 - Forks: 271

salesforce/ALBEF

Code for ALBEF: a new vision-language pre-training method

Language: Python - Size: 69.9 MB - Last synced at: 25 days ago - Pushed at: over 2 years ago - Stars: 1,627 - Forks: 205

open-mmlab/Multimodal-GPT

Multimodal-GPT

Language: Python - Size: 109 KB - Last synced at: 3 days ago - Pushed at: almost 2 years ago - Stars: 1,498 - Forks: 130

dandelin/ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Language: Python - Size: 1.38 MB - Last synced at: 25 days ago - Pushed at: about 1 year ago - Stars: 1,448 - Forks: 216

om-ai-lab/OmDet

Real-time and accurate open-vocabulary end-to-end object detection

Language: Python - Size: 9.75 MB - Last synced at: 20 days ago - Pushed at: 5 months ago - Stars: 1,313 - Forks: 111

NVlabs/prismer

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

Language: Python - Size: 4.25 MB - Last synced at: 25 days ago - Pushed at: over 1 year ago - Stars: 1,310 - Forks: 73

llm-jp/awesome-japanese-llm

日本語LLMまとめ - Overview of Japanese LLMs

Language: TypeScript - Size: 11.7 MB - Last synced at: 2 days ago - Pushed at: 4 days ago - Stars: 1,154 - Forks: 33

yuewang-cuhk/awesome-vision-language-pretraining-papers

Recent Advances in Vision and Language PreTrained Models (VL-PTMs)

Size: 104 KB - Last synced at: 15 days ago - Pushed at: over 2 years ago - Stars: 1,152 - Forks: 104

microsoft/Oscar 📦

Oscar and VinVL

Language: Python - Size: 715 KB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 1,048 - Forks: 252

rhymes-ai/Aria

Codebase for Aria - an Open Multimodal Native MoE

Language: Jupyter Notebook - Size: 120 MB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 995 - Forks: 83

OFA-Sys/ONE-PEACE

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Language: Python - Size: 29.9 MB - Last synced at: 5 months ago - Pushed at: 7 months ago - Stars: 977 - Forks: 63

X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).

Language: Python - Size: 12.2 MB - Last synced at: 21 days ago - Pushed at: about 2 years ago - Stars: 970 - Forks: 105

26hzhang/DL-NLP-Readings

My Reading Lists of Deep Learning and Natural Language Processing

Language: TeX - Size: 364 MB - Last synced at: about 1 month ago - Pushed at: about 3 years ago - Stars: 850 - Forks: 264

SunzeY/AlphaCLIP

[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Language: Jupyter Notebook - Size: 173 MB - Last synced at: 21 days ago - Pushed at: 9 months ago - Stars: 802 - Forks: 55

ChenRocks/UNITER

Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

Language: Python - Size: 172 KB - Last synced at: 5 months ago - Pushed at: almost 4 years ago - Stars: 786 - Forks: 109

OpenRobotLab/PointLLM

[ECCV 2024 Best Paper Candidate] PointLLM: Empowering Large Language Models to Understand Point Clouds

Language: Python - Size: 3.42 MB - Last synced at: 3 days ago - Pushed at: 12 days ago - Stars: 774 - Forks: 37

jackroos/VL-BERT

Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".

Language: Jupyter Notebook - Size: 5.41 MB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 740 - Forks: 111

jayleicn/ClipBERT

[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.

Language: Python - Size: 73.2 KB - Last synced at: 29 days ago - Pushed at: over 1 year ago - Stars: 718 - Forks: 86

SkalskiP/top-cvpr-2024-papers

This repository is a curated collection of the most exciting and influential CVPR 2024 papers. 🔥 [Paper + Code + Demo]

Language: Python - Size: 58.6 KB - Last synced at: 1 day ago - Pushed at: 10 months ago - Stars: 715 - Forks: 59

NVlabs/DoRA

[ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation

Language: Python - Size: 3.06 MB - Last synced at: 4 months ago - Pushed at: 7 months ago - Stars: 668 - Forks: 44

SkalskiP/top-cvpr-2023-papers

This repository is a curated collection of the most exciting and influential CVPR 2023 papers. 🔥 [Paper + Code]

Language: Python - Size: 37.1 KB - Last synced at: 1 day ago - Pushed at: 10 months ago - Stars: 646 - Forks: 64

vardanagarwal/Proctoring-AI

Creating a software for automatic monitoring in online proctoring

Language: Python - Size: 383 MB - Last synced at: 19 days ago - Pushed at: 5 months ago - Stars: 573 - Forks: 339

peteanderson80/Matterport3DSimulator

AI Research Platform for Reinforcement Learning from Real Panoramic Images.

Language: C++ - Size: 13.2 MB - Last synced at: 29 days ago - Pushed at: 10 months ago - Stars: 566 - Forks: 132

mbzuai-oryx/groundingLMM

Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks [CVPR 2024].

Language: Python - Size: 109 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 536 - Forks: 25

sangminwoo/awesome-vision-and-language

A curated list of awesome vision and language resources (still under construction... stay tuned!)

Size: 127 KB - Last synced at: 8 days ago - Pushed at: 6 months ago - Stars: 534 - Forks: 41

eric-ai-lab/awesome-vision-language-navigation

A curated list for vision-and-language navigation. ACL 2022 paper "Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions"

Size: 79.1 KB - Last synced at: 6 days ago - Pushed at: about 1 year ago - Stars: 477 - Forks: 23

mees/calvin

CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

Language: Python - Size: 1.61 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 469 - Forks: 67

zengyan-97/X-VLM

X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)

Language: Python - Size: 13.5 MB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 462 - Forks: 51

JindongGu/Awesome-Prompting-on-Vision-Language-Model

This repo lists relevant papers summarized in our survey paper: A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models.

Size: 1.48 MB - Last synced at: 11 days ago - Pushed at: about 2 months ago - Stars: 453 - Forks: 34

Paranioar/Awesome_Matching_Pretraining_Transfering

The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.

Size: 369 KB - Last synced at: 9 days ago - Pushed at: 5 months ago - Stars: 425 - Forks: 48

google-research-datasets/conceptual-12m

Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.

Size: 97.7 KB - Last synced at: 2 months ago - Pushed at: about 2 years ago - Stars: 380 - Forks: 20

j-min/VL-T5

PyTorch code for "Unifying Vision-and-Language Tasks via Text Generation" (ICML 2021)

Language: Python - Size: 847 KB - Last synced at: 28 days ago - Pushed at: almost 2 years ago - Stars: 369 - Forks: 58

Haiyang-W/GiT

[ECCV2024 Oral🔥] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"

Language: Python - Size: 12.5 MB - Last synced at: 20 days ago - Pushed at: 4 months ago - Stars: 345 - Forks: 15

HyperGAI/HPT

HPT - Open Multimodal LLMs from HyperGAI

Language: Python - Size: 2.47 MB - Last synced at: 3 months ago - Pushed at: 11 months ago - Stars: 313 - Forks: 20

tsujuifu/pytorch_mgie

A Gradio demo of MGIE

Language: Python - Size: 32.5 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 307 - Forks: 24

phellonchen/awesome-Vision-and-Language-Pre-training

Recent Advances in Vision and Language Pre-training (VLP)

Size: 81.1 KB - Last synced at: 11 days ago - Pushed at: almost 2 years ago - Stars: 292 - Forks: 16

MarSaKi/ETPNav

[TPAMI 2024] Official repo of "ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments"

Language: Python - Size: 8.53 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 290 - Forks: 24

Olney1/ChatGPT-OpenAI-Smart-Speaker

This AI Smart Speaker uses speech recognition, TTS (text-to-speech), and STT (speech-to-text) to enable voice and vision-driven conversations, with additional web search capabilities via OpenAI and Langchain agents.

Language: Python - Size: 145 MB - Last synced at: 1 day ago - Pushed at: 6 months ago - Stars: 279 - Forks: 31

FuxiaoLiu/LRV-Instruction

[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Language: Python - Size: 23.9 MB - Last synced at: 3 days ago - Pushed at: about 1 year ago - Stars: 276 - Forks: 13

JDAI-CV/image-captioning

Implementation of 'X-Linear Attention Networks for Image Captioning' [CVPR 2020]

Language: Python - Size: 733 KB - Last synced at: 24 days ago - Pushed at: almost 4 years ago - Stars: 273 - Forks: 54

om-ai-lab/RS5M

RS5M: a large-scale vision language dataset for remote sensing [TGRS]

Language: Python - Size: 44.8 MB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 252 - Forks: 11

SALT-NLP/LLaVAR

Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"

Language: Python - Size: 19.4 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 245 - Forks: 12

j-min/CLIP-Caption-Reward

PyTorch code for "Fine-grained Image Captioning with CLIP Reward" (Findings of NAACL 2022)

Language: Python - Size: 2.64 MB - Last synced at: 23 days ago - Pushed at: over 2 years ago - Stars: 241 - Forks: 26

uta-smile/TCL

code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Language: Python - Size: 11.1 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 233 - Forks: 33

linjieli222/HERO

Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"

Language: Python - Size: 248 KB - Last synced at: 3 months ago - Pushed at: over 3 years ago - Stars: 232 - Forks: 35

haofanwang/Awesome-Computer-Vision

Awesome Resources for Advanced Computer Vision Topics

Size: 93.8 KB - Last synced at: 8 days ago - Pushed at: over 2 years ago - Stars: 228 - Forks: 42

geoaigroup/awesome-vision-language-models-for-earth-observation

A curated list of awesome vision and language resources for earth observation.

Size: 470 KB - Last synced at: 15 days ago - Pushed at: about 1 month ago - Stars: 219 - Forks: 17

daqingliu/awesome-vln 📦

A curated list of research papers in Vision-Language Navigation (VLN)

Size: 40 KB - Last synced at: 8 days ago - Pushed at: about 1 year ago - Stars: 206 - Forks: 32

salesforce/ALPRO 📦

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Language: Python - Size: 311 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 187 - Forks: 17

ylsung/VL_adapter

PyTorch code for "VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks" (CVPR2022)

Language: Python - Size: 2.28 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 187 - Forks: 15

clin1223/VLDet

[ICLR 2023] PyTorch implementation of VLDet （https://arxiv.org/abs/2211.14843）

Language: Python - Size: 1.56 MB - Last synced at: 29 days ago - Pushed at: about 1 year ago - Stars: 186 - Forks: 11

amanchadha/stanford-cs231n-assignments-2020

This repository contains my solutions to the assignments for Stanford's CS231n "Convolutional Neural Networks for Visual Recognition" (Spring 2020).

Language: Jupyter Notebook - Size: 200 MB - Last synced at: 28 days ago - Pushed at: almost 4 years ago - Stars: 170 - Forks: 68

kevinzakka/clip_playground

An ever-growing playground of notebooks showcasing CLIP's impressive zero-shot capabilities

Language: Jupyter Notebook - Size: 11.3 MB - Last synced at: 20 days ago - Pushed at: almost 3 years ago - Stars: 165 - Forks: 13

SunzeY/X-Prompt

Official implementation of X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

Size: 4.25 MB - Last synced at: about 2 months ago - Pushed at: 5 months ago - Stars: 152 - Forks: 1

LeapLabTHU/Pseudo-Q

[CVPR 2022] Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Language: Python - Size: 22.9 MB - Last synced at: 30 days ago - Pushed at: 10 months ago - Stars: 148 - Forks: 10

OFA-Sys/OFASys

OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

Language: Python - Size: 20.3 MB - Last synced at: 26 days ago - Pushed at: over 2 years ago - Stars: 147 - Forks: 13

j-min/DallEval

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models (ICCV 2023)

Language: Jupyter Notebook - Size: 66.2 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 140 - Forks: 6

eric-xw/AREL

Code for the ACL paper "No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling"

Language: Python - Size: 441 MB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 137 - Forks: 35

YicongHong/Recurrent-VLN-BERT

Code of the CVPR 2021 Oral paper: A Recurrent Vision-and-Language BERT for Navigation

Language: Python - Size: 780 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 134 - Forks: 26

aimagelab/LLaVA-MORE

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Language: Python - Size: 2.62 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 131 - Forks: 8

tsujuifu/pytorch_violet

A PyTorch implementation of VIOLET

Language: Python - Size: 115 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 130 - Forks: 7

om-ai-lab/VL-CheckList

Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations. [EMNLP 2022]

Language: Python - Size: 26.6 MB - Last synced at: 14 days ago - Pushed at: 7 months ago - Stars: 129 - Forks: 5

antoyang/TubeDETR

[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Language: Python - Size: 93.8 KB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 127 - Forks: 8

chihyaoma/regretful-agent

PyTorch code for CVPR 2019 paper: The Regretful Agent: Heuristic-Aided Navigation through Progress Estimation

Language: C++ - Size: 5.73 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 124 - Forks: 24

zinengtang/TVLT

PyTorch code for “TVLT: Textless Vision-Language Transformer” (NeurIPS 2022 Oral)

Language: Jupyter Notebook - Size: 5.09 MB - Last synced at: 23 days ago - Pushed at: about 2 years ago - Stars: 124 - Forks: 13

patrick-tssn/Awesome-Colorful-LLM

Recent advancements propelled by large language models (LLMs), encompassing an array of domains including Vision, Audio, Agent, Robotics, Fundamental Sciences such as Mathematics, and Ominous.

Size: 935 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 121 - Forks: 8

zhegan27/VILLA

Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

Language: Python - Size: 849 KB - Last synced at: 5 months ago - Pushed at: over 4 years ago - Stars: 119 - Forks: 14

cambridgeltl/visual-spatial-reasoning

[TACL'23] VSR: A probing benchmark for spatial undersranding of vision-language models.

Language: Python - Size: 10.7 MB - Last synced at: 30 days ago - Pushed at: about 2 years ago - Stars: 118 - Forks: 9

chihyaoma/selfmonitoring-agent

PyTorch code for ICLR 2019 paper: Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

Language: C++ - Size: 3.03 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 115 - Forks: 18

Ravi-Teja-konda/Surveillance_Video_Summarizer

VLM driven tool that processes surveillance videos, extracts frames, and generates insightful annotations using a fine-tuned Florence-2 Vision-Language Model. Includes a Gradio-based interface for querying and analyzing video footage.

Language: Python - Size: 1.72 MB - Last synced at: 24 days ago - Pushed at: 8 months ago - Stars: 106 - Forks: 13

zhuang-li/FactualSceneGraph

FACTUAL benchmark dataset, the pre-trained textual scene graph parser trained on FACTUAL.

Language: Python - Size: 11.4 MB - Last synced at: 28 days ago - Pushed at: 3 months ago - Stars: 105 - Forks: 12

j-min/HiREST

Hierarchical Video-Moment Retrieval and Step-Captioning (CVPR 2023)

Language: Python - Size: 3.64 MB - Last synced at: 27 days ago - Pushed at: 3 months ago - Stars: 100 - Forks: 9

antoyang/just-ask

[ICCV 2021 Oral + TPAMI] Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Language: Jupyter Notebook - Size: 917 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 98 - Forks: 13

FuxiaoLiu/VisualNews-Repository

[EMNLP'21] Visual News: Benchmark and Challenges in News Image Captioning

Language: Jupyter Notebook - Size: 6.94 MB - Last synced at: 3 days ago - Pushed at: 10 months ago - Stars: 94 - Forks: 9

xiaofeng94/VL-PLM

Exploiting unlabeled data with vision and language models for object detection, ECCV 2022

Language: Python - Size: 12 MB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 93 - Forks: 7

antoyang/VidChapters

[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale

Language: Jupyter Notebook - Size: 34.3 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 93 - Forks: 5

ChenyunWu/PhraseCutDataset

Dataset API for "PhraseCut: Language-based Image Segmentation in the Wild"

Language: Jupyter Notebook - Size: 15.5 MB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 91 - Forks: 10

Muennighoff/vilio

🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle

Language: Python - Size: 10.4 MB - Last synced at: 27 days ago - Pushed at: almost 2 years ago - Stars: 88 - Forks: 28

ch3cook-fdu/Vote2Cap-DETR

[CVPR 2023] Vote2Cap-DETR and [T-PAMI 2024] Vote2Cap-DETR++; A set-to-set perspective towards 3D Dense Captioning; State-of-the-Art 3D Dense Captioning methods

Language: Python - Size: 308 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 82 - Forks: 5

showlab/ShowAnything

Language: Jupyter Notebook - Size: 43 MB - Last synced at: 12 days ago - Pushed at: almost 2 years ago - Stars: 82 - Forks: 3

multimodal/multimodal

A collection of multimodal datasets, and visual features for VQA and captionning in pytorch. Just run "pip install multimodal"

Language: Python - Size: 2.21 MB - Last synced at: 11 days ago - Pushed at: about 3 years ago - Stars: 82 - Forks: 7

zengyan-97/X2-VLM

All-In-One VLM: Image + Video + Transfer to Other Languages / Domains

Language: Python - Size: 12.1 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 80 - Forks: 6

GT-RIPL/robo-vln

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Language: Python - Size: 33.3 MB - Last synced at: 11 days ago - Pushed at: 10 months ago - Stars: 78 - Forks: 8

MarSaKi/NvEM

[ACM MM 2021 Oral] Official repo of "Neighbor-view Enhanced Model for Vision and Language Navigation"

Language: C++ - Size: 2.74 MB - Last synced at: 12 months ago - Pushed at: over 2 years ago - Stars: 77 - Forks: 2

yanmin-wu/EDA

[CVPR 2023] EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding

Language: Python - Size: 2.35 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 76 - Forks: 2

zeyofu/BLINK_Benchmark

This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390

Language: Python - Size: 12.2 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 73 - Forks: 2

antoyang/FrozenBiLM

[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Language: Python - Size: 88.9 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 72 - Forks: 13

intersun/LightningDOT

source code and pre-trained/fine-tuned checkpoint for NAACL 2021 paper LightningDOT

Language: Python - Size: 435 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 72 - Forks: 9

MultimodalGeo/GeoText-1652

An offical repo for ECCV 2024 Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

Language: Python - Size: 41 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 67 - Forks: 2

TheShadow29/vognet-pytorch

[CVPR20] Video Object Grounding using Semantic Roles in Language Description (https://arxiv.org/abs/2003.10606)

Language: Python - Size: 3.45 MB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 67 - Forks: 7

rentainhe/TRAR-VQA

[ICCV 2021] Official implementation of the paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering"

Language: Python - Size: 927 KB - Last synced at: 21 days ago - Pushed at: over 3 years ago - Stars: 66 - Forks: 18

PathologyFoundation/plip

Pathology Language and Image Pre-Training (PLIP) is the first vision and language foundation model for Pathology AI. PLIP is a large-scale pre-trained model that can be used to extract visual and language features from pathology images and text description. The model is a fine-tuned version of the original CLIP model.

Language: Python - Size: 677 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 65 - Forks: 7