GitHub topics: video-language
microsoft/UniVL
An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
Language: Python - Size: 219 KB - Last synced at: 6 days ago - Pushed at: about 1 year ago - Stars: 360 - Forks: 58

willyfh/awesome-video-text-datasets
A curated list of video-text datasets in a variety of languages. These datasets can be used for video captioning (video description) or video retrieval.
Size: 48.8 KB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 38 - Forks: 3

showlab/UniVTG
[ICCV 2023] UniVTG: Towards Unified Video-Language Temporal Grounding
Language: Python - Size: 22.7 MB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 356 - Forks: 34

showlab/VideoGUI
[NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Language: JavaScript - Size: 32.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 35 - Forks: 2

salesforce/ALPRO 📦
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
Language: Python - Size: 311 KB - Last synced at: 19 days ago - Pushed at: 4 months ago - Stars: 188 - Forks: 17

showlab/EgoVLP
[NeurIPS 2022] Egocentric Video-Language Pretraining
Language: Python - Size: 1.97 MB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 242 - Forks: 20

showlab/VLog
[CVPR 2025] Video Narration as Vocabulary & Video as Long Document
Language: Python - Size: 10.8 MB - Last synced at: 4 months ago - Pushed at: 6 months ago - Stars: 567 - Forks: 28

showlab/all-in-one
[CVPR2023] All in One: Exploring Unified Video-Language Pre-training
Language: Python - Size: 1.53 MB - Last synced at: 4 months ago - Pushed at: over 2 years ago - Stars: 282 - Forks: 19

bytedance/Shot2Story
A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.
Language: Python - Size: 153 MB - Last synced at: 5 months ago - Pushed at: 7 months ago - Stars: 128 - Forks: 6

junchen14/Multi-Modal-Transformer
The repository collects many various multi-modal transformer architectures, including image transformer, video transformer, image-language transformer, video-language transformer and self-supervised learning models. Additionally, it also collects many useful tutorials and tools in these related domains.
Size: 354 KB - Last synced at: 5 months ago - Pushed at: about 3 years ago - Stars: 225 - Forks: 31

wjn922/ReferFormer
[CVPR2022] Official Implementation of ReferFormer
Language: Python - Size: 52 MB - Last synced at: 5 months ago - Pushed at: 7 months ago - Stars: 339 - Forks: 25

bigai-nlco/VideoTGB
[EMNLP 2024] A Video Chat Agent with Temporal Prior
Language: Python - Size: 51.6 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 29 - Forks: 2

zinengtang/Perceiver_VL
PyTorch code for "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention" (WACV 2023)
Language: Python - Size: 1.09 MB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 33 - Forks: 4

JerryYLi/svitt
Code for CVPR 2023 paper "SViTT: Temporal Learning of Sparse Video-Text Transformers"
Language: Python - Size: 418 KB - Last synced at: 7 months ago - Pushed at: about 2 years ago - Stars: 18 - Forks: 1

SCZwangxiao/DEPICT
a multi-modal video caption dataset with richer annotation
Language: Python - Size: 1.48 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

patrick-tssn/MM-NIAVH
Pressure Testing Large Video-Language Models (LVLM): Doing multimodal retrieval from LVLM at any video lengths to measure accuracy
Language: Python - Size: 29.7 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 1

MikeWangWZHL/VidIL
Pytorch code for Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Language: Python - Size: 109 MB - Last synced at: over 1 year ago - Pushed at: almost 3 years ago - Stars: 110 - Forks: 2

showlab/Region_Learner
The Pytorch implementation for "Video-Text Pre-training with Learned Regions"
Language: Python - Size: 14.7 MB - Last synced at: 5 months ago - Pushed at: about 3 years ago - Stars: 42 - Forks: 2

zinengtang/DeCEMBERT
Pytorch version of DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization (NAACL 2021)
Language: Python - Size: 215 KB - Last synced at: 5 months ago - Pushed at: over 2 years ago - Stars: 17 - Forks: 1

jena-shreyas/Awesome-Video-Language-Resources
A repository of Video Language papers, code and datasets.
Size: 7.81 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

SCZwangxiao/RTQ-MM2023
ACM Multimedia 2023 (Oral) - RTQ: Rethinking Video-language Understanding Based on Image-text Model
Language: Python - Size: 7.96 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 6 - Forks: 0

zjr2000/GVL
Official implementation for paper Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
Language: Python - Size: 109 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 19 - Forks: 6

TheShadow29/VidSitu
[CVPR21] Visual Semantic Role Labeling for Video Understanding (https://arxiv.org/abs/2104.00990)
Language: Python - Size: 928 KB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 50 - Forks: 7

waybarrios/guidance-based-video-grounding
The official PyTorch implementation of the paper: "Localizing Moments in Long Video Via Multimodal Guidance"
Size: 54.7 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 6 - Forks: 0

Maddy12/SSL4VideoSurvey
The official GitHub page for the survey paper "Self-Supervised learning for Videos: A survey"
Size: 665 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

MCG-NJU/VLG
VLG: General Video Recognition with Web Textual Knowledge (https://arxiv.org/abs/2212.01638)
Language: Python - Size: 96.7 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 8 - Forks: 0

shufangxun/MAC
An end-to-end masked contrastive video-and-language pre-training framework
Size: 1.96 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 23 - Forks: 0
