GitHub topics: image-captioning

Repositories

hsp-iit/embodied-captioning

Official repository of the preprint "Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions"

Language: Python - Size: 944 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2 - Forks: 0

HanXinzi-AI/awesome-computer-vision-resources

a collection of computer vision projects&tools. 计算机视觉方向项目和工具集合。

Size: 49.8 MB - Last synced at: 3 days ago - Pushed at: about 1 year ago - Stars: 273 - Forks: 34

MahmoudAdham6544/vision-speak

VisionSpeak: A deep learning pipeline that generates natural language captions from images using a Vision-Encoder and GPT-2 Decoder. Bridging vision and language with PyTorch and Transformers.

Language: Python - Size: 5.56 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

ZerolanCore integrates many open-source, locally deployable AI models, and aims to integrate a series of AI models such as large language model (LLM), automatic speech recognition (ASR), text-to-speech (TTS), image captioning, optical character recognition (OCR), video captioning, etc.

Language: Python - Size: 102 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3 - Forks: 0

X-PLUG/mPLUG

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)

Language: Python - Size: 1.56 MB - Last synced at: 3 days ago - Pushed at: about 2 years ago - Stars: 93 - Forks: 8

cuixing158/Awesome-CV-MasterHub

:fire: :fire: :fire: A paper list of some recent Computer Vision(CV) works

Size: 43.5 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 419 - Forks: 29

SocAIty/socaity

SDK for generative AI.

Language: Python - Size: 26.2 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

huiteuros/generalt

FastAPI de génération d'ALT d'image grâce au model BLIP

Language: Python - Size: 0 Bytes - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

AI-14/pkatransnet

[IVC 2025] [Official code] - Enhancing radiology report generation: A prior knowledge-aware transformer network for effective alignment and fusion of multi-modal radiological data

Language: Python - Size: 4.42 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 1

iOPENCap/awesome-remote-image-captioning

A list of awesome remote sensing image captioning resources

Language: Python - Size: 198 KB - Last synced at: about 22 hours ago - Pushed at: 13 days ago - Stars: 110 - Forks: 1

PtiCalin/vault_image-description

Ollama powered image description

Language: JavaScript - Size: 64.5 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

OpenGVLab/InternGPT

InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)

Language: Python - Size: 41.9 MB - Last synced at: 6 days ago - Pushed at: 10 months ago - Stars: 3,214 - Forks: 231

xogie/Add_Tags-Titles-to-Images

A Python tool that auto-generates captions and keyword tags for JPG/PNG images using a local vision-language model like BakLLaVA. Captions and tags are embedded into EXIF metadata (Title + Tags) for native Windows Explorer visibility. Includes batch processing and GUI folder selection.

Language: Python - Size: 12.7 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

SkalskiP/awesome-foundation-and-multimodal-models

👁️ + 💬 + 🎧 = 🤖 Curated list of top foundation and multimodal models! [Paper + Code + Examples + Tutorials]

Language: Python - Size: 58.6 KB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 621 - Forks: 45

cstsunfu/dlk

A PyTorch Based Deep Learning Quick Develop Framework. One-Stop for train/predict/server/demo

Language: Python - Size: 9.42 MB - Last synced at: 10 days ago - Pushed at: 5 months ago - Stars: 24 - Forks: 0

alasdairtran/transform-and-tell

[CVPR 2020] Transform and Tell: Entity-Aware News Image Captioning

Language: Python - Size: 14.2 MB - Last synced at: 6 days ago - Pushed at: about 1 year ago - Stars: 91 - Forks: 15

AkagawaTsurunaki/ZerolanLiveRobot

AI VTuber with LLM, ASR, TTS, OCR, CV and more technologies to live stream or play Minecraft with you.

Language: Python - Size: 2.48 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 29 - Forks: 3

claudaff/automatic-map-storytelling

An Efficient System for Automatic Map Storytelling using Generative Pre-trained Transformer (GPT) Models – A Case Study on Historical Maps

Language: Python - Size: 2 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 4 - Forks: 2

terry-r123/Awesome-Captioning

A curated list of Multimodal Captioning related research(including image captioning, video captioning, and text captioning)

Size: 56.6 KB - Last synced at: 4 days ago - Pushed at: about 3 years ago - Stars: 109 - Forks: 10

ZhuoxuanCao/BLIP-Hugging-Face-Quickstart-Finetune-Lora

A modular, easy-to-use framework for fine-tuning BLIP-1 on custom image captioning tasks using LoRA and Hugging Face Transformers. Includes data preprocessing, training scripts, and inference demos — with custom patching on the vision backbone. Ideal for researchers, engineers, and AI enthusiasts building lightweight captioning systems.

Language: Python - Size: 178 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 0 - Forks: 0

Pavansomisetty21/Image-Caption-Generation-using-LLMs-GEMINI-

we generate captions to the images which are given by user(user input) using prompt engineering and Generative AI

Language: Jupyter Notebook - Size: 366 KB - Last synced at: about 23 hours ago - Pushed at: 10 months ago - Stars: 10 - Forks: 1

kuanghuei/SCAN

PyTorch source code for "Stacked Cross Attention for Image-Text Matching" (ECCV 2018)

Language: Python - Size: 34.2 KB - Last synced at: 5 days ago - Pushed at: about 2 years ago - Stars: 565 - Forks: 115

digitechvishal/Image-Caption-Generator-Using-AI-Azure

This project is a lightweight web application that leverages Microsoft Azure’s Computer Vision API to generate accurate captions for uploaded images. Designed using Python and Streamlit, it provides a clean and intuitive interface to interact with AI-powered image analysis.

Language: Python - Size: 337 KB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 0 - Forks: 0

dp-ops/Image_captioning

Image captioning model using ResNet34 and Attention LSTM. The project is implimented from scratch. Using pretrained imagenet weights for resNet34 and finetunning the model in flickr8k and flickr30k datasets. Available reinforcement learning capabilities, but need fixing and better GPU

Language: Python - Size: 60.5 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

aakcay5656/image-captioning-pytorch

The project I did in the OBSS AI Intern Competition

Language: Jupyter Notebook - Size: 9.77 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

PrathameshPC77/ai_image_captioning

🖼️ AI Image Caption Generator — A simple and smart web app that generates descriptive captions for any image you upload using a pre-trained Vision Transformer (ViT) and GPT-2 model. Built with Python and Streamlit, powered by Hugging Face Transformers.

Language: Python - Size: 576 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

msamprovalaki/Exploring-Multimodal-Large-Language-Models-for-Medical-Image-Captioning

This repository includes the code for my Master Thesis, which investigates the application of Multimodal Large Language Models (MLLMs) for medical image captioning

Language: Python - Size: 5.45 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 6 - Forks: 0

TheoCoombes/ClipCap

Using pretrained encoder and language models to generate captions from multimedia inputs.

Language: Python - Size: 92.7 MB - Last synced at: 11 days ago - Pushed at: over 2 years ago - Stars: 97 - Forks: 13

sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning

Show, Attend, and Tell | a PyTorch Tutorial to Image Captioning

Language: Python - Size: 12.6 MB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 2,846 - Forks: 726

aimagelab/meshed-memory-transformer

Meshed-Memory Transformer for Image Captioning. CVPR 2020

Language: Python - Size: 7.07 MB - Last synced at: 28 days ago - Pushed at: over 2 years ago - Stars: 538 - Forks: 134

peteanderson80/Up-Down-Captioner

Automatic image captioning model based on Caffe, using features from bottom-up attention.

Language: Jupyter Notebook - Size: 2.6 MB - Last synced at: 9 days ago - Pushed at: over 2 years ago - Stars: 246 - Forks: 68

ejlnmusic/PaliGemma-flickr8k-finetuning

# PaliGemma-flickr8k-finetuningThis repository provides a method to fine-tune the PaliGemma model on the Flickr8k dataset for improved image captioning. Explore the features and utilities designed for efficient training and testing. 🐙🌟

Language: Jupyter Notebook - Size: 375 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

STCTheRealNooby/Image-Captioning-with-ViT-and-BERT

This repository provides a straightforward image-captioning pipeline that combines a Vision Transformer (ViT) encoder with a BERT decoder. Use this setup to fine-tune your model on the Flickr8k dataset and generate captions for new images. 🖼️✨

Language: Jupyter Notebook - Size: 5.11 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

imaginary-cloud/CameraManager

Simple Swift class to provide all the configurations you need to create custom camera view in your app

Language: Swift - Size: 4.7 MB - Last synced at: about 1 month ago - Pushed at: 12 months ago - Stars: 1,385 - Forks: 329

AHMEDSANA/PaliGemma-flickr8k-finetuning

This repository contains code for fine-tuning Google's PaliGemma vision-language model on the Flickr8k dataset for image captioning tasks

Language: Jupyter Notebook - Size: 401 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

AHMEDSANA/Image-Captioning-with-ViT-and-BERT

A concise image-captioning pipeline that fine-tunes a ViT encoder with a BERT decoder on Flickr8K for training, plus a standalone script to load the trained model and generate captions on new images.

Language: Jupyter Notebook - Size: 5.22 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

Markin-Wang/awesome_radiology_report_generation

Awesome radiology report generation and image captioning papers.

Size: 59.6 KB - Last synced at: 19 days ago - Pushed at: 9 months ago - Stars: 75 - Forks: 6

AnnikaLindh/Diverse_and_Specific_Image_Captioning

Unsupervised specificity-guided optimization of Image Captioning models to encourage meaningful diversity in the generated captions. Code for the paper Generating Diverse and Meaningful Captions: Unsupervised Specificity Optimization for Image Captioning (Lindh et al., 2018).

Language: Python - Size: 62.5 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 13 - Forks: 8

abhay-43/Internet-Memes-Classification-using-Multimodal-Learning-and-Image-Captioning

This project classifies internet memes using multimodal learning by combining textual and visual features. It performs offensive content detection and emotion classification leveraging the MultiOFF and Memotion-7k datasets. The model integrates ALBERT for text, VGG-11 for images, and BLIP-generated captions to improve understanding of meme sentimen

Language: Jupyter Notebook - Size: 6.01 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

stevan-milovanovic/LiteRT-for-Android

Image Classification with LiteRT

Language: Kotlin - Size: 171 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

ttengwang/Caption-Anything

Caption-Anything is a versatile tool combining image segmentation, visual captioning, and ChatGPT, generating tailored captions with diverse controls for user preferences. https://huggingface.co/spaces/TencentARC/Caption-Anything https://huggingface.co/spaces/VIPLab/Caption-Anything

Language: Python - Size: 51.9 MB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 1,741 - Forks: 104

salesforce/BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Language: Jupyter Notebook - Size: 6.34 MB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 5,265 - Forks: 688

jhc13/taggui

Tag manager and captioner for image datasets

Language: Python - Size: 22.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 994 - Forks: 46

tuanio/image2latex

Image to Latex using Encoder-Decoder architecture

Language: Jupyter Notebook - Size: 1.18 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 13 - Forks: 5

YehLi/xmodaler

X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).

Language: Python - Size: 12.2 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 970 - Forks: 105

OFA-Sys/OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Language: Python - Size: 120 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 2,501 - Forks: 248

salesforce/LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence

Language: Jupyter Notebook - Size: 79.3 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 10,558 - Forks: 1,031

Abhrankan-Chakrabarti/GeminiFusion

A versatile web application that leverages advanced AI models, including Gemini Flash, DALL-E 3, and Stable Diffusion XL, to provide three main features: Chatbot Interaction, Image Captioning, and Text-to-Image Generation.

Language: Python - Size: 43 KB - Last synced at: 12 days ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 2

peteanderson80/bottom-up-attention

Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

Language: Jupyter Notebook - Size: 13.4 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 1,450 - Forks: 378

gokayfem/ComfyUI_VLM_nodes

Custom ComfyUI nodes for Vision Language Models, Large Language Models, Image to Music, Text to Music, Consistent and Random Creative Prompt Generation

Language: Python - Size: 359 KB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 490 - Forks: 50

NVlabs/prismer

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

Language: Python - Size: 4.25 MB - Last synced at: 18 days ago - Pushed at: over 1 year ago - Stars: 1,308 - Forks: 73

milaan9/Deep_Learning_Algorithms_from_Scratch

This repository explores the variety of techniques and algorithms commonly used in deep learning and the implementation in MATLAB and PYTHON

Language: Jupyter Notebook - Size: 9.85 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 172 - Forks: 171

yashk2810/Image-Captioning

Image Captioning using InceptionV3 and beam search

Language: Jupyter Notebook - Size: 74.6 MB - Last synced at: about 1 month ago - Pushed at: almost 5 years ago - Stars: 329 - Forks: 123

symphl/blind-vision-assistant

An AI-powered embedded system that captures real-time images, generates descriptive captions using Qwen, and reads them out loud to assist the visually impaired.

Language: C++ - Size: 4.88 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

tanyuqian/redco

NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference

Language: Python - Size: 11.5 MB - Last synced at: 6 days ago - Pushed at: 7 months ago - Stars: 66 - Forks: 7

microsoft/Oscar 📦

Oscar and VinVL

Language: Python - Size: 715 KB - Last synced at: 3 days ago - Pushed at: almost 2 years ago - Stars: 1,049 - Forks: 251

Dewiin/blind-spot

CUNY Tech Prep 2025 Project

Language: JavaScript - Size: 3.63 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

kdexd/virtex

[CVPR 2021] VirTex: Learning Visual Representations from Textual Annotations

Language: Python - Size: 3.65 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 563 - Forks: 61

Narius2030/IMCP-Support-Blinders

This project focuses on image captioning by creating two primary models: DarkNetLM and DarkNetVG2. Both models leverage the CSP DarkNet53 architecture as the backbone of YOLOv8 for feature extraction from images. Combining with Transformers or LSTM to generating captions.

Language: Python - Size: 28.8 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

phachon/gis

gis (go image server) go 实现的图片服务，实现基本的上传，下载，存储，按比例裁剪等功能

Language: Go - Size: 1.84 MB - Last synced at: about 2 months ago - Pushed at: about 7 years ago - Stars: 123 - Forks: 36

nocaps-org/nocaps-org.github.io

Wesbite for nocaps

Language: HTML - Size: 46.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 3 - Forks: 1

JHansiduYapa/CNN-LSTM-Image-Caption-Generator

This repository implements an image caption generator using a pretrained ResNet101 for feature extraction and an LSTM network for generating captions from images.

Language: Jupyter Notebook - Size: 9.89 MB - Last synced at: about 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

Belkinmix/Streamlit-Mini-AI-App

A streamlit-powered app that showcases multiple AI-powered tools: facial emotion detection, batch image captioning, text sentiment analysis, and a chaos-filled fun zone.

Language: Python - Size: 2.13 MB - Last synced at: 26 days ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

bhoomikaniranjan/Depiction-of-image-features-with-audio-to-aid-visually-impaired-persons

This project transforms visual content into vivid audio narratives for visually impaired individuals. Using advanced image recognition and text-to-speech technologies, it generates detailed captions and provides audio output in English, Kannada, and Hindi, fostering inclusivity and independence.

Language: Python - Size: 8.79 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 0

kalyaninguva/Image_Captioning

This project generates textual descriptions for images using deep learning. I

Language: Jupyter Notebook - Size: 962 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

clementfornes13/leyenda_project

Leyenda is a Deep Learning-based project focused on image classification, preprocessing, and automatic caption generation. It combines Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to process visual data and describe it in natural language.

Language: Jupyter Notebook - Size: 172 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0