GitHub topics: multimodal

Repositories

xmed-lab/MultiEYE

[IEEE TMI 2024] MultiEYE: Dataset and Benchmark for OCT-Enhanced Retinal Disease Recognition from Fundus Images

Language: Python - Size: 692 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 18 - Forks: 2

FuxiaoLiu/LRV-Instruction

[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Language: Python - Size: 23.9 MB - Last synced at: 2 days ago - Pushed at: about 1 year ago - Stars: 277 - Forks: 13

atfortes/Awesome-LLM-Reasoning

Reasoning in LLMs: Papers and Resources, including Chain-of-Thought, OpenAI o1, and DeepSeek-R1 🍓

Size: 460 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 3,038 - Forks: 173

TEN-framework/ten-framework

The world’s first real-time, distributed, cloud-edge collaborative multimodal AI Agent Framework that simultaneously supports C/C++/Go/Python/JS/TS

Language: C - Size: 94.4 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 5,788 - Forks: 676

HCPLab-SYSU/Book-of-MLM

《多模态大模型：新一代人工智能技术范式》作者：刘阳，林倞

Language: HTML - Size: 33.7 MB - Last synced at: 6 days ago - Pushed at: 5 months ago - Stars: 205 - Forks: 21

Paddle Multimodal Integration and eXploration, supporting mainstream multi-modal tasks, including end-to-end large-scale multi-modal pretrain models and diffusion model toolbox. Equipped with high performance and flexibility.

Language: Python - Size: 177 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 628 - Forks: 210

rustic-ai/ui-components

React component library for crafting user-friendly and engaging conversational experiences

Language: JavaScript - Size: 20.5 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 59 - Forks: 12

1set-t/ai-model

Industrial-grade weather visualization system that transforms AI model predictions into professional meteorological plots, emphasizing operational forecasting capabilities.

Size: 1.95 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

JunyiYe/TextFlow

[NAACL 2025] Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for Superior Flowchart Understanding

Language: Python - Size: 284 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 6 - Forks: 2

Yangyi-Chen/Multimodal-AND-Large-Language-Models

Paper list about multimodal and large language models, only used to record papers I read in the daily arxiv for personal needs.

Size: 3.86 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 621 - Forks: 41

microsoft/unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Language: Python - Size: 66.4 MB - Last synced at: 7 days ago - Pushed at: 2 months ago - Stars: 21,188 - Forks: 2,620

akshaysinhaaa/sentiment-analysis

Language: Python - Size: 26.4 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

rerun-io/rerun

Visualize streams of multimodal data. Free, fast, easy to use, and simple to integrate. Built in Rust.

Language: Rust - Size: 644 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 8,337 - Forks: 445

swyxio/ai-notes

notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references under the /Resources folder.

Language: HTML - Size: 2.14 MB - Last synced at: 7 days ago - Pushed at: 23 days ago - Stars: 5,643 - Forks: 470

ritzz-ai/GUI-R1

Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Language: Python - Size: 974 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 70 - Forks: 5

kyegomez/NaViT

My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"

Language: Python - Size: 210 KB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 230 - Forks: 11

Wangbiao2/R1-Track

R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning.

Language: Python - Size: 1.71 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 28 - Forks: 1

lxe/llavavision

A simple "Be My Eyes" web app with a llama.cpp/llava backend

Language: JavaScript - Size: 27.2 MB - Last synced at: 2 days ago - Pushed at: over 1 year ago - Stars: 489 - Forks: 32

tattle-made/feluda

A configurable engine for analysing multi-lingual and multi-modal content.

Language: Python - Size: 28.1 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 60 - Forks: 51

enoche/MultimodalRecSys

A curated list of awesome resources about multimodal recommender systems.

Size: 335 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 361 - Forks: 24

roboflow/maestro

streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL

Language: Python - Size: 10.6 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 2,555 - Forks: 203

oidlabs-com/Lexoid

Multimodal document parser for high quality data understanding and extraction

Language: Python - Size: 46.7 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 44 - Forks: 6

kdeps/kdeps

Kdeps is an all-in-one AI framework for building Dockerized full-stack AI applications (FE and BE) that includes open-source LLM models out-of-the-box.

Language: Go - Size: 4.26 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 21 - Forks: 1

ALEEEHU/World-Simulator

Simulating the Real World: Survey & Resources, which contains our survey "Simulating the Real World: A Unified Survey of Multimodal Generative Models" and Awesome-Text2X-Resources. Watch this repository for the latest updates! 🔥

Size: 18.1 MB - Last synced at: 8 days ago - Pushed at: 12 days ago - Stars: 246 - Forks: 14

The-Martyr/CausalMM

[ICLR 2025] Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality

Language: Python - Size: 7.1 MB - Last synced at: 8 days ago - Pushed at: 9 days ago - Stars: 25 - Forks: 2

rom1504/clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them

Language: Jupyter Notebook - Size: 3.75 MB - Last synced at: 5 days ago - Pushed at: about 1 year ago - Stars: 2,546 - Forks: 223

alishhde/ArtBuddy

ArtBuddy is an AI-powered creative companion that enhances your graphic design workflow. It combines multiple intelligent agents to help you brainstorm ideas, find design inspiration, and refine your creative concepts.

Language: Python - Size: 5.11 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

mbodiai/embodied-agents

Seamlessly integrate state-of-the-art transformer models into robotics stacks

Language: Python - Size: 75.2 MB - Last synced at: 2 days ago - Pushed at: 19 days ago - Stars: 207 - Forks: 22

shure-dev/Awesome-LLM-Papers-Comprehensive-Topics

Awesome LLM Papers and repos on very comprehensive topics.

Size: 450 KB - Last synced at: 8 days ago - Pushed at: 9 months ago - Stars: 217 - Forks: 22

tyler-romero/tyler-romero.github.io

Technical Blog + Personal Website

Language: Nunjucks - Size: 56.3 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 2 - Forks: 0

gokayfem/awesome-vlm-architectures

Famous Vision Language Models and Their Architectures

Language: Markdown - Size: 2.26 MB - Last synced at: 9 days ago - Pushed at: 3 months ago - Stars: 804 - Forks: 42

reasoning-survey/Awesome-Reasoning-Foundation-Models

✨✨Latest Papers and Benchmarks in Reasoning with Foundation Models

Size: 7.37 MB - Last synced at: 9 days ago - Pushed at: 18 days ago - Stars: 571 - Forks: 56

GaochangWu/FMF-Benchmark

This is a cross-modal benchmark for industrial anomaly detection.

Language: Python - Size: 6.82 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 9 - Forks: 1

mbzuai-oryx/VideoGPT-plus

Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

Language: Python - Size: 16.5 MB - Last synced at: 2 days ago - Pushed at: about 1 month ago - Stars: 271 - Forks: 17

mahmoodlab/MCAT

Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images - ICCV 2021

Language: Jupyter Notebook - Size: 540 MB - Last synced at: 6 days ago - Pushed at: about 3 years ago - Stars: 200 - Forks: 40

bumbelbee777/SillyAI

Complex-valued neuro-symbolic transformer using PyTorch.

Language: Python - Size: 102 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

glami/glami-1m

The largest multilingual image-text classification dataset. It contains fashion products.

Language: Jupyter Notebook - Size: 5.43 MB - Last synced at: 5 days ago - Pushed at: almost 2 years ago - Stars: 72 - Forks: 7

pdaicode/awesome-LLMs-finetuning

Collection of resources for finetuning Large Language Models (LLMs).

Size: 103 KB - Last synced at: 8 days ago - Pushed at: 4 months ago - Stars: 77 - Forks: 8

kyegomez/RT-X

Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open X-Embodiment: Robotic Learning Datasets and RT-X Models"

Language: Python - Size: 940 KB - Last synced at: 6 days ago - Pushed at: 18 days ago - Stars: 205 - Forks: 22

jwu114/CAP

[NAACL Findings 2025] Code and data of "Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting"

Language: Python - Size: 88.9 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 3 - Forks: 0

willxxy/ECG-Bench

A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)

Language: Python - Size: 6.67 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 9 - Forks: 2

Moha111-h/Qwen3

Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.

Language: Shell - Size: 3.07 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

rom1504/cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...

Language: Python - Size: 50.8 KB - Last synced at: 6 days ago - Pushed at: over 1 year ago - Stars: 318 - Forks: 27

C-W-D-Harshit/lume-ai

AI-powered multimodal chat app with real-time responses, file support, token tracking, and dark mode. Built with Next.js. Open source under MIT.

Language: TypeScript - Size: 1.3 MB - Last synced at: about 6 hours ago - Pushed at: 4 months ago - Stars: 9 - Forks: 2

cogmhear/avse_challenge Fork of claritychallenge/clarity

COG-MHEAR Audio-Visual Speech Enhancement Challenge

Language: Python - Size: 774 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 40 - Forks: 11

Yutong-Zhou-cv/Awesome-Text-to-Image

(ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.

Size: 69.2 MB - Last synced at: 11 days ago - Pushed at: 18 days ago - Stars: 2,330 - Forks: 200

wgcyeo/UniversalRAG

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

Size: 623 KB - Last synced at: 11 days ago - Pushed at: 13 days ago - Stars: 34 - Forks: 2

GerrySant/multimodalhugs

MultimodalHugs is an extension of Hugging Face that offers a generalized framework for training, evaluating, and using multimodal AI models with minimal code differences, ensuring seamless compatibility with Hugging Face pipelines.

Language: Python - Size: 4.24 MB - Last synced at: 11 days ago - Pushed at: 12 days ago - Stars: 3 - Forks: 2

abhiverse01/hatespeech-multimodal-detection

Multi-Modal Hate Speech Detection using Deep Learning.

Language: Jupyter Notebook - Size: 8.32 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

vlm-run/vlmrun-hub

A hub for various industry-specific schemas to be used with VLMs.

Language: Python - Size: 352 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 501 - Forks: 23

Aisuko/notebooks

Implementation for the different ML tasks on Kaggle platform with GPUs.

Language: Jupyter Notebook - Size: 160 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 20 - Forks: 3

SiddhantBikram/MemeCLIP

MemeCLIP framework and PrideMM Dataset @ EMNLP 2024

Language: Python - Size: 249 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 11 - Forks: 0

Sinapsis-AI/sinapsis

Modular and Universal AI

Language: Python - Size: 374 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 35 - Forks: 10

Stability-AI/stability-sdk

SDK for interacting with stability.ai APIs (e.g. stable diffusion inference)

Language: Jupyter Notebook - Size: 447 MB - Last synced at: about 23 hours ago - Pushed at: 25 days ago - Stars: 2,440 - Forks: 344

AI4HealthUOL/MDS-ED

Repository for the paper 'MDS-ED: Multimodal Decision Support in the Emergency Department – a benchmark dataset based on MIMIC-IV'.

Language: Python - Size: 4 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 18 - Forks: 2

sofiamironbarroso/Multimodal-Cancer

An exploratory repository into different modelling approaches for Multimodal cancer type prediction.

Language: Jupyter Notebook - Size: 11.7 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

huggingface/OBELICS

Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.

Language: Python - Size: 512 KB - Last synced at: 5 days ago - Pushed at: 9 months ago - Stars: 202 - Forks: 10

bin123apple/InfantAgent

A multimodal agent that can interact with its own PC in a multimodal manner.

Language: Python - Size: 5.24 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 6 - Forks: 0

eliranwong/letmedoit

An advanced AI assistant that leverages the capabilities of ChatGPT API, Gemini Pro, AutoGen, and open-source LLMs, enabling it both to engage in conversations and to execute computing tasks on local devices.

Language: Python - Size: 126 MB - Last synced at: 2 days ago - Pushed at: 2 months ago - Stars: 127 - Forks: 25

monatis/clip.cpp

CLIP inference in plain C/C++ with no extra dependencies

Language: C++ - Size: 420 KB - Last synced at: 10 days ago - Pushed at: 9 months ago - Stars: 496 - Forks: 46

NetManAIOps/ChatTS

[VLDB' 25] ChatTS: Understanding, Chat, Reasoning about Time Series with TS-MLLM

Language: Python - Size: 3.52 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 137 - Forks: 16

rom1504/img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Language: Python - Size: 3.11 MB - Last synced at: 13 days ago - Pushed at: 9 months ago - Stars: 4,016 - Forks: 353

X-PLUG/MobileAgent

Mobile-Agent: The Powerful Mobile Device Operation Assistant Family

Language: Python - Size: 383 MB - Last synced at: 13 days ago - Pushed at: about 1 month ago - Stars: 4,149 - Forks: 412

jermmy19998/MMM

Repository forMulti-modal Mutual Mixer

Language: Python - Size: 39.5 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 1 - Forks: 0

IrohXu/Awesome-Multimodal-LLM-Autonomous-Driving

[WACV 2024 Survey Paper] Multimodal Large Language Models for Autonomous Driving

Size: 15 MB - Last synced at: 3 days ago - Pushed at: about 1 year ago - Stars: 286 - Forks: 11

HySonLab/Design2Code

Large Language Model in combination with Large Vision Model for the task of code generation given design sketch.

Language: Python - Size: 270 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 3 - Forks: 0

TIGER-AI-Lab/VL-Rethinker

The official code of "VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning"

Language: Python - Size: 4.92 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 75 - Forks: 1

umi-AIGC-saas/umi_ai_cms

双重驱动的智能AI系统，它对接了目前市场上主流的AI大模型，并根据这些大模型的优劣势进行算法分类。通过综合利用各种AI大模型的优势，无忧AI智脑能够提供更准确、更可靠的信息和解答。

Language: Python - Size: 4.16 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 2 - Forks: 0

showlab/Show-o

[ICLR 2025] Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.

Language: Python - Size: 169 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 1,362 - Forks: 58

xieyuquanxx/awesome-Large-MultiModal-Hallucination 📦

😎 curated list of awesome LMM hallucinations papers, methods & resources.

Size: 66.4 KB - Last synced at: about 5 hours ago - Pushed at: about 1 year ago - Stars: 149 - Forks: 14

open-mmlab/Multimodal-GPT

Multimodal-GPT

Language: Python - Size: 109 KB - Last synced at: 2 days ago - Pushed at: almost 2 years ago - Stars: 1,498 - Forks: 131

patrick-tssn/Awesome-Colorful-LLM

Recent advancements propelled by large language models (LLMs), encompassing an array of domains including Vision, Audio, Agent, Robotics, Fundamental Sciences such as Mathematics, and Ominous.

Size: 935 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 121 - Forks: 8

KarthikaRajagopal44/Text-to-voice-chatbot

Text-to-Speech (TTS) web application built with Gradio and powered by Microsoft Edge TTS voices

Language: Python - Size: 7.81 KB - Last synced at: 8 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

HICAI-ZJU/Scientific-LLM-Survey

Scientific Large Language Models: A Survey on Biological & Chemical Domains

Size: 523 KB - Last synced at: 12 days ago - Pushed at: 3 months ago - Stars: 304 - Forks: 30

YeonwooSung/MLOps

Miscellaneous codes and writings for MLOps

Language: Jupyter Notebook - Size: 542 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 12 - Forks: 1

visionxiang/awesome-salient-object-detection

A curated list of awesome resources for salient object detection (SOD), focusing more on multi-modal SOD, such as RGB-D SOD.

Size: 82 KB - Last synced at: 7 days ago - Pushed at: 8 months ago - Stars: 118 - Forks: 6

ekonwang/VisuoThink

[Arxiv Paper 2504.09130]: VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Language: Python - Size: 15.7 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 8 - Forks: 1

video-db/videodb-chat

Frontend interface for building chat based system and connecting with agent driven workflows.

Language: Vue - Size: 1.02 MB - Last synced at: 10 days ago - Pushed at: 2 months ago - Stars: 13 - Forks: 7

krishnaura45/astro-pulse

Extracting Faint Exoplanetary Signals from Ariel Observations

Size: 4.88 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 1 - Forks: 0

nv78/Autonomous-Intelligence

Autonomous Intelligence is a framework for building collaborative, intelligent multi agent AI systems. The framework provides a robust infrastructure for creating and managing multiple AI agents, and enables developers and organizations to build, deploy, and optimize AI agents that work well in dynamic, complex environments.

Language: HTML - Size: 123 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 18 - Forks: 6