An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: video-language

microsoft/UniVL

An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"

Language: Python - Size: 219 KB - Last synced at: 6 days ago - Pushed at: about 1 year ago - Stars: 360 - Forks: 58

willyfh/awesome-video-text-datasets

A curated list of video-text datasets in a variety of languages. These datasets can be used for video captioning (video description) or video retrieval.

Size: 48.8 KB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 38 - Forks: 3

showlab/UniVTG

[ICCV 2023] UniVTG: Towards Unified Video-Language Temporal Grounding

Language: Python - Size: 22.7 MB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 356 - Forks: 34

showlab/VideoGUI

[NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos

Language: JavaScript - Size: 32.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 35 - Forks: 2

salesforce/ALPRO 📦

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Language: Python - Size: 311 KB - Last synced at: 19 days ago - Pushed at: 4 months ago - Stars: 188 - Forks: 17

showlab/EgoVLP

[NeurIPS 2022] Egocentric Video-Language Pretraining

Language: Python - Size: 1.97 MB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 242 - Forks: 20

showlab/VLog

[CVPR 2025] Video Narration as Vocabulary & Video as Long Document

Language: Python - Size: 10.8 MB - Last synced at: 4 months ago - Pushed at: 6 months ago - Stars: 567 - Forks: 28

showlab/all-in-one

[CVPR2023] All in One: Exploring Unified Video-Language Pre-training

Language: Python - Size: 1.53 MB - Last synced at: 4 months ago - Pushed at: over 2 years ago - Stars: 282 - Forks: 19

bytedance/Shot2Story

A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.

Language: Python - Size: 153 MB - Last synced at: 5 months ago - Pushed at: 7 months ago - Stars: 128 - Forks: 6

junchen14/Multi-Modal-Transformer

The repository collects many various multi-modal transformer architectures, including image transformer, video transformer, image-language transformer, video-language transformer and self-supervised learning models. Additionally, it also collects many useful tutorials and tools in these related domains.

Size: 354 KB - Last synced at: 5 months ago - Pushed at: about 3 years ago - Stars: 225 - Forks: 31

wjn922/ReferFormer

[CVPR2022] Official Implementation of ReferFormer

Language: Python - Size: 52 MB - Last synced at: 5 months ago - Pushed at: 7 months ago - Stars: 339 - Forks: 25

bigai-nlco/VideoTGB

[EMNLP 2024] A Video Chat Agent with Temporal Prior

Language: Python - Size: 51.6 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 29 - Forks: 2

zinengtang/Perceiver_VL

PyTorch code for "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention" (WACV 2023)

Language: Python - Size: 1.09 MB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 33 - Forks: 4

JerryYLi/svitt

Code for CVPR 2023 paper "SViTT: Temporal Learning of Sparse Video-Text Transformers"

Language: Python - Size: 418 KB - Last synced at: 7 months ago - Pushed at: about 2 years ago - Stars: 18 - Forks: 1

SCZwangxiao/DEPICT

a multi-modal video caption dataset with richer annotation

Language: Python - Size: 1.48 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

patrick-tssn/MM-NIAVH

Pressure Testing Large Video-Language Models (LVLM): Doing multimodal retrieval from LVLM at any video lengths to measure accuracy

Language: Python - Size: 29.7 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 1

MikeWangWZHL/VidIL

Pytorch code for Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

Language: Python - Size: 109 MB - Last synced at: over 1 year ago - Pushed at: almost 3 years ago - Stars: 110 - Forks: 2

showlab/Region_Learner

The Pytorch implementation for "Video-Text Pre-training with Learned Regions"

Language: Python - Size: 14.7 MB - Last synced at: 5 months ago - Pushed at: about 3 years ago - Stars: 42 - Forks: 2

zinengtang/DeCEMBERT

Pytorch version of DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization (NAACL 2021)

Language: Python - Size: 215 KB - Last synced at: 5 months ago - Pushed at: over 2 years ago - Stars: 17 - Forks: 1

jena-shreyas/Awesome-Video-Language-Resources

A repository of Video Language papers, code and datasets.

Size: 7.81 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

SCZwangxiao/RTQ-MM2023

ACM Multimedia 2023 (Oral) - RTQ: Rethinking Video-language Understanding Based on Image-text Model

Language: Python - Size: 7.96 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 6 - Forks: 0

zjr2000/GVL

Official implementation for paper Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Language: Python - Size: 109 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 19 - Forks: 6

TheShadow29/VidSitu

[CVPR21] Visual Semantic Role Labeling for Video Understanding (https://arxiv.org/abs/2104.00990)

Language: Python - Size: 928 KB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 50 - Forks: 7

waybarrios/guidance-based-video-grounding

The official PyTorch implementation of the paper: "Localizing Moments in Long Video Via Multimodal Guidance"

Size: 54.7 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 6 - Forks: 0

Maddy12/SSL4VideoSurvey

The official GitHub page for the survey paper "Self-Supervised learning for Videos: A survey"

Size: 665 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

MCG-NJU/VLG

VLG: General Video Recognition with Web Textual Knowledge (https://arxiv.org/abs/2212.01638)

Language: Python - Size: 96.7 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 8 - Forks: 0

shufangxun/MAC

An end-to-end masked contrastive video-and-language pre-training framework

Size: 1.96 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 23 - Forks: 0

Related Keywords
video-language 27 vision-language 5 vision-and-language 5 pytorch 5 pretraining 4 video-language-understanding 3 video-text-retrieval 3 msrvtt 3 video 3 video-language-pretraining 3 dataset 2 video-captioning 2 multi-modal 2 video-question-answering 2 moment-retrieval 2 video-grounding 2 video-summarization 2 representation-learning 2 multimodal 2 action-recognition 2 video-text 2 clip 2 vision-transformer 2 llm 2 multimodal-large-language-models 2 long-video-understanding 2 pre-training 2 multimodal-deep-learning 1 deep-learning 1 foundational-models 1 machine-learning 1 youcook2 1 vlep 1 vatex 1 msvd 1 gpt-3 1 blip 1 pressure-testing 1 video-caption 1 scalability 1 retrieval 1 mae 1 end-to-end-learning 1 didemo 1 contrastive-learning 1 activitynet 1 open-set-recognition 1 long-tailed-recognition 1 few-shot-recognition 1 video-to-video 1 text-to-video 1 computer-vision 1 multimodal-learning 1 vision 1 srl 1 semantic-roles 1 nlp 1 grounding 1 event-relations 1 captioning-videos 1 captioning 1 temporal-localization 1 pytorch-implementation 1 dense-video-captioning 1 video-understanding 1 efficiency 1 chatgpt 1 egocentric-vision 1 prompt-learning 1 llm-agent 1 gui 1 highlight-detection 1 video-to-text 1 video-retrieval 1 video-description 1 youcookii 1 segmentation 1 retrieval-task 1 pretrain 1 multimodality 1 multimodal-sentiment-analysis 1 localization 1 joint 1 coin 1 caption-task 1 caption 1 alignment 1 visual-instruction-tuning 1 spatial-temporal 1 mllm 1 referring-video-object-segmentation 1 video-transformer 1 transformer-readling-list 1 multi-modal-cvpr2021 1 mlp-mixer 1 language 1 image-transformer 1 efficiency-transformer 1 video-story-generation 1 video-story 1