GitHub topics: multi-modal

Repositories

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

Language: Python - Size: 4.5 MB - Last synced at: about 4 hours ago - Pushed at: about 5 hours ago - Stars: 2,239 - Forks: 336

2U1/Llama3.2-Vision-Finetune

An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.

Language: Python - Size: 90.8 KB - Last synced at: about 6 hours ago - Pushed at: about 7 hours ago - Stars: 152 - Forks: 21

Chiuqyan/arxiv-daily-audio-test Fork of beiyuouo/arxiv-daily

🎓 Automatically Update Some Fields Papers Daily using Github Actions / 12th hours

Language: Python - Size: 46.9 MB - Last synced at: about 7 hours ago - Pushed at: about 8 hours ago - Stars: 0 - Forks: 0

THUDM/CogVLM

a state-of-the-art-level open visual language model | 多模态预训练模型

Language: Python - Size: 25.8 MB - Last synced at: about 1 hour ago - Pushed at: 11 months ago - Stars: 6,485 - Forks: 429

modelscope/modelscope

ModelScope: bring the notion of Model-as-a-Service to life.

Language: Python - Size: 53.7 MB - Last synced at: 1 day ago - Pushed at: 7 days ago - Stars: 7,725 - Forks: 797

dvlab-research/LISA

Project Page for "LISA: Reasoning Segmentation via Large Language Model"

Language: Python - Size: 28.9 MB - Last synced at: 2 days ago - Pushed at: 2 months ago - Stars: 2,156 - Forks: 152

harlanhong/ACTalker

ACTalker: an end-to-end video diffusion framework for talking head synthesis that supports both single and multi-signal control (e.g., audio, expression).

Size: 41.8 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 216 - Forks: 12

vercel/modelfusion

The TypeScript library for building AI applications.

Language: TypeScript - Size: 15.6 MB - Last synced at: 2 days ago - Pushed at: 9 months ago - Stars: 1,253 - Forks: 89

valhalla/valhalla

Open Source Routing Engine for OpenStreetMap

Language: C++ - Size: 115 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 4,780 - Forks: 726

modelscope/data-juicer

Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷

Language: Python - Size: 169 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 4,222 - Forks: 227

kyegomez/zeta

Build high-performance AI models with modular building blocks

Language: Python - Size: 41.3 MB - Last synced at: 2 days ago - Pushed at: 7 days ago - Stars: 496 - Forks: 50

kyegomez/MultiModal-ToT

Multi-Modal Tree of thoughts for DALLE-3 like auto self improvement

Language: Python - Size: 81.2 MB - Last synced at: 1 day ago - Pushed at: 5 months ago - Stars: 16 - Forks: 2

mbzuai-oryx/ALM-Bench

[CVPR 2025 🔥] ALM-Bench is a multilingual multi-modal diverse cultural benchmark for 100 languages across 19 categories. It assesses the next generation of LMMs on cultural inclusitivity.

Language: Python - Size: 26.6 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 36 - Forks: 2

SciSharp/LLamaSharp

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.

Language: C# - Size: 391 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 3,120 - Forks: 414

2U1/Gemma3-Finetune

An open-source implementaion for Gemma3 series by Google.

Language: Python - Size: 43 KB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 16 - Forks: 3

AnswerDotAI/byaldi

Use late-interaction multi-modal models such as ColPali in just a few lines of code.

Language: Python - Size: 1.94 MB - Last synced at: about 7 hours ago - Pushed at: 3 months ago - Stars: 774 - Forks: 81

THUDM/CogVLM2

GPT4V-level open-source multi-modal model based on Llama3-8B

Language: Python - Size: 13.9 MB - Last synced at: about 1 hour ago - Pushed at: about 2 months ago - Stars: 2,336 - Forks: 153

BrainLesion/preprocessing

preprocessing tools for multi-modal 3D brain MRI

Language: C - Size: 1.18 GB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 18 - Forks: 6

kyegomez/SwitchTransformers

Implementation of Switch Transformers from the paper: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity"

Language: Python - Size: 2.42 MB - Last synced at: 2 days ago - Pushed at: 17 days ago - Stars: 97 - Forks: 13

presidio-oss/cline-based-code-generator

VS Code extension that streamlines development workflows through AI-powered task execution, intelligent file management, and automated code generation. Built on Cline, it integrates with various LLMs to enhance productivity and code quality while simplifying complex development tasks.

Language: TypeScript - Size: 15.5 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 39 - Forks: 30

WisconsinAIVision/ViP-LLaVA

[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

Language: Python - Size: 17.4 MB - Last synced at: 2 days ago - Pushed at: 9 months ago - Stars: 319 - Forks: 23

restoreml/m3n-vc

Multi-Modality Multi-Node Vehicle Classification Dataset

Language: Python - Size: 51.8 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 2 - Forks: 0

TuGraph-family/chat2graph

Chat2Graph: Graph Native Agentic System.

Language: Python - Size: 1.59 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 25 - Forks: 10

activeloopai/deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

Language: Python - Size: 65.3 MB - Last synced at: 6 days ago - Pushed at: 21 days ago - Stars: 8,528 - Forks: 656

NetEase-Media/grps_trtllm

Higher performance OpenAI LLM service than vLLM serve: A pure C++ high-performance OpenAI LLM service implemented with GPRS+TensorRT-LLM+Tokenizers.cpp, supporting chat and function call, AI agents, distributed multi-GPU inference, multimodal capabilities, and a Gradio chat interface.

Language: Python - Size: 127 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 129 - Forks: 8

awslabs/rhubarb

A Python framework for multi-modal document understanding with Amazon Bedrock

Language: Python - Size: 31.7 MB - Last synced at: 2 days ago - Pushed at: 6 days ago - Stars: 81 - Forks: 6

IDEA-Research/RexSeek

Referring any person or objects given a natural language description. Code base for RexSeek and HumanRef Benchmark

Language: Python - Size: 9.55 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 112 - Forks: 8

InternLM/InternEvo

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

Language: Python - Size: 6.73 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 374 - Forks: 63

salesforce/UniControl

Unified Controllable Visual Generation Model

Language: Python - Size: 145 MB - Last synced at: 7 days ago - Pushed at: 3 months ago - Stars: 642 - Forks: 35

IntelLabs/fastRAG

Efficient Retrieval Augmentation and Generation Framework

Language: Python - Size: 20.4 MB - Last synced at: 7 days ago - Pushed at: 3 months ago - Stars: 1,513 - Forks: 139

tangxyw/RecSysPapers

推荐/广告/搜索领域工业界经典以及最前沿论文集合。A collection of industry classics and cutting-edge papers in the field of recommendation/advertising/search.

Language: Python - Size: 1.55 GB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1,675 - Forks: 240

docarray/docarray

Represent, send, store and search multimodal data

Language: Python - Size: 242 MB - Last synced at: 1 day ago - Pushed at: 3 days ago - Stars: 3,042 - Forks: 233

bytedance/SALMONN

SALMONN: Speech Audio Language Music Open Neural Network

Language: Python - Size: 13 MB - Last synced at: 8 days ago - Pushed at: about 2 months ago - Stars: 1,202 - Forks: 96

zjunlp/DeepKE

[EMNLP 2022] An Open Toolkit for Knowledge Graph Extraction and Construction

Language: Python - Size: 121 MB - Last synced at: 9 days ago - Pushed at: about 1 month ago - Stars: 3,857 - Forks: 714

PKU-YuanGroup/LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Language: Python - Size: 18.6 MB - Last synced at: 8 days ago - Pushed at: about 1 year ago - Stars: 801 - Forks: 54

PKU-YuanGroup/MoE-LLaVA

Mixture-of-Experts for Large Vision-Language Models

Language: Python - Size: 16.5 MB - Last synced at: 8 days ago - Pushed at: 5 months ago - Stars: 2,140 - Forks: 134

souradipp76/MM-PoE

Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models

Language: Python - Size: 684 KB - Last synced at: 7 days ago - Pushed at: 9 days ago - Stars: 1 - Forks: 1

timbroed/MUSES

[ECCV 2024] SDK for MUSES: The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty

Language: Python - Size: 8.55 MB - Last synced at: 9 days ago - Pushed at: 10 days ago - Stars: 35 - Forks: 1

bustime-org/bustime Fork of norn/bustime

A complete full-stack solution for tracking public transport, designed for easy implementation and maximum accessibility.

Language: JavaScript - Size: 9.94 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 5 - Forks: 1

Kav-K/GPTDiscord

A robust, all-in-one GPT interface for Discord. ChatGPT-style conversations, image generation, AI-moderation, custom indexes/knowledgebase, youtube summarizer, and more!

Language: Python - Size: 1.76 MB - Last synced at: 6 days ago - Pushed at: 11 months ago - Stars: 1,842 - Forks: 293

OpenGVLab/InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

Language: Python - Size: 38.5 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 7,470 - Forks: 574

OpenBMB/MiniCPM-o

MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone

Language: Python - Size: 326 MB - Last synced at: 10 days ago - Pushed at: about 2 months ago - Stars: 19,200 - Forks: 1,386

ika-rwth-aachen/MultiCorrupt

[IV2024] MultiCorrupt: A benchmark for robust multi-modal 3D object detection, evaluating LiDAR-Camera fusion models in autonomous driving. Includes diverse corruption types (e.g., misalignment, miscalibration, weather) and severity levels. Assess model performance under challenging conditions.

Language: Jupyter Notebook - Size: 135 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 61 - Forks: 6

lucidrains/transfusion-pytorch

Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI

Language: Python - Size: 34.6 MB - Last synced at: 9 days ago - Pushed at: about 1 month ago - Stars: 1,044 - Forks: 46

xuyang-liu16/GlobalCom2

Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models

Language: Python - Size: 5.98 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 17 - Forks: 0

modelscope/agentscope

Start building LLM-empowered multi-agent applications in an easier way.

Language: Python - Size: 287 MB - Last synced at: 11 days ago - Pushed at: 12 days ago - Stars: 6,951 - Forks: 400

THUDM/VisualGLM-6B

Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型

Language: Python - Size: 18.1 MB - Last synced at: 10 days ago - Pushed at: 8 months ago - Stars: 4,150 - Forks: 424

xieyuquanxx/awesome-Large-MultiModal-Hallucination 📦

😎 curated list of awesome LMM hallucinations papers, methods & resources.

Size: 66.4 KB - Last synced at: 3 days ago - Pushed at: about 1 year ago - Stars: 150 - Forks: 14

Tebmer/Awesome-Knowledge-Distillation-of-LLMs

This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitation and Distillation Algorithms, and explore the Skill & Vertical Distillation of LLMs.

Size: 18.6 MB - Last synced at: 9 days ago - Pushed at: about 1 month ago - Stars: 979 - Forks: 56

marqo-ai/marqo

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

Language: Python - Size: 79.5 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 4,821 - Forks: 202

PKU-YuanGroup/Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Language: Python - Size: 113 MB - Last synced at: 11 days ago - Pushed at: 5 months ago - Stars: 3,217 - Forks: 233

amazon-science/fmcore

Foundation Models at every scale, on every modality

Language: Python - Size: 1.95 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 3 - Forks: 0

VectorSpaceLab/OmniGen

OmniGen: Unified Image Generation. https://arxiv.org/pdf/2409.11340

Language: Jupyter Notebook - Size: 417 MB - Last synced at: 11 days ago - Pushed at: 2 months ago - Stars: 3,912 - Forks: 335

howard-hou/VisualRWKV

VisualRWKV is the visual-enhanced version of the RWKV language model, enabling RWKV to handle various visual tasks.

Language: Python - Size: 10.9 MB - Last synced at: 5 days ago - Pushed at: 24 days ago - Stars: 220 - Forks: 18

MedMNIST/MedMNIST

[pip install medmnist] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification

Language: Python - Size: 13.6 MB - Last synced at: 11 days ago - Pushed at: 3 months ago - Stars: 1,168 - Forks: 177

kyegomez/TinyGPTV

Simple Implementation of TinyGPTV in super simple Zeta lego blocks

Language: Python - Size: 2.17 MB - Last synced at: 1 day ago - Pushed at: 5 months ago - Stars: 16 - Forks: 0

OFA-Sys/Chinese-CLIP

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

Language: Python - Size: 2.5 MB - Last synced at: 12 days ago - Pushed at: 9 months ago - Stars: 5,055 - Forks: 492

microsoft/farmvibes-ai

FarmVibes.AI: Multi-Modal GeoSpatial ML Models for Agriculture and Sustainability

Language: Jupyter Notebook - Size: 40 MB - Last synced at: 7 days ago - Pushed at: 2 months ago - Stars: 736 - Forks: 138

924973292/IDEA

【CVPR2025】IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification

Language: Python - Size: 34.9 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 17 - Forks: 3

Ruiyang-061X/Awesome-MLLM-Reasoning

📖Curated list about reasoning abilitiy of MLLM, including OpenAI o1, OpenAI o3-mini, and Slow-Thinking.

Size: 7.81 KB - Last synced at: 9 days ago - Pushed at: 2 months ago - Stars: 4 - Forks: 0

kyegomez/RT-2

Democratization of RT-2 "RT-2: New model translates vision and language into action"

Language: Python - Size: 2.59 MB - Last synced at: 9 days ago - Pushed at: 9 months ago - Stars: 441 - Forks: 63

OmniMMI/OpenOmniNexus

a fully open-source implementation of a GPT-4o-like speech-to-speech video understanding model.

Language: Python - Size: 39.6 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 3 - Forks: 0

OmniMMI/OmniMMI

[CVPR 2025] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Language: Python - Size: 25.3 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 8 - Forks: 0

Ruiyang-061X/VL-Uncertainty

🔎Official code for our paper: "VL-Uncertainty: Detecting Hallucination in Large Vision-Language Model via Uncertainty Estimation".

Language: Python - Size: 7.12 MB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 31 - Forks: 2

Ruiyang-061X/Awesome-MLLM-Uncertainty

✨A curated list of papers on the uncertainty in multi-modal large language model (MLLM).

Size: 381 KB - Last synced at: 11 days ago - Pushed at: 19 days ago - Stars: 42 - Forks: 0

jokieleung/awesome-visual-question-answering

A curated list of Visual Question Answering(VQA)(Image/Video Question Answering),Visual Question Generation ,Visual Dialog ,Visual Commonsense Reasoning and related area.

Size: 179 KB - Last synced at: 9 days ago - Pushed at: almost 2 years ago - Stars: 662 - Forks: 95

liuyang-ict/awesome-visual-transformers

[TNNLS] A Comprehensive Survey of Awesome Visual Transformer Literatures.

Size: 570 KB - Last synced at: 12 days ago - Pushed at: almost 2 years ago - Stars: 255 - Forks: 25

yshinya6/xbm

Code repository for "Explanation Bottleneck Models" (AAAI2025 Oral)

Language: Python - Size: 536 KB - Last synced at: 3 days ago - Pushed at: 14 days ago - Stars: 6 - Forks: 1

kyegomez/LUMIERE

Implementation of the text to video model LUMIERE from the paper: "A Space-Time Diffusion Model for Video Generation" by Google Research

Language: Python - Size: 2.18 MB - Last synced at: 14 days ago - Pushed at: 3 months ago - Stars: 51 - Forks: 5

OpenMotionLab/MotionGPT

[NeurIPS 2023] MotionGPT: Human Motion as a Foreign Language, a unified motion-language generation model using LLMs

Language: Python - Size: 8.54 MB - Last synced at: 13 days ago - Pushed at: about 1 year ago - Stars: 1,619 - Forks: 103

junchen14/Multi-Modal-Transformer

The repository collects many various multi-modal transformer architectures, including image transformer, video transformer, image-language transformer, video-language transformer and self-supervised learning models. Additionally, it also collects many useful tutorials and tools in these related domains.

Size: 354 KB - Last synced at: 8 days ago - Pushed at: over 2 years ago - Stars: 225 - Forks: 31

kyegomez/qformer

Implementation of Qformer from BLIP2 in Zeta Lego blocks.

Language: Python - Size: 2.19 MB - Last synced at: 13 days ago - Pushed at: 5 months ago - Stars: 38 - Forks: 0

alessioborgi/Z-SAMB_StyleAligned_MultiReference-MultiModal

Novel framework for Zero-Shot Style Alignment in Text-to-Image generation, incorporating Multi-Modal Context-Awareness and Multi-Reference Style Alignment, using minimal attention sharing, ensuring consistent style transfer without fine-tuning.

Language: Jupyter Notebook - Size: 1.49 GB - Last synced at: 1 day ago - Pushed at: 18 days ago - Stars: 2 - Forks: 0

thu-ml/MMTrustEval

A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)

Language: Python - Size: 15.8 MB - Last synced at: 16 days ago - Pushed at: 26 days ago - Stars: 145 - Forks: 10

v-iashin/SpecVQGAN

Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)

Language: Jupyter Notebook - Size: 163 MB - Last synced at: 12 days ago - Pushed at: 9 months ago - Stars: 360 - Forks: 39

RasmussenLab/MOVE

MOVE (Multi-Omics Variational autoEncoder) for integrating multi-omics data and identifying cross modal associations

Language: Jupyter Notebook - Size: 540 MB - Last synced at: 11 days ago - Pushed at: 6 months ago - Stars: 75 - Forks: 27

ses4255/Versatile-OCR-Program

Multi-modal OCR pipeline optimized for ML training (text, math, tables, diagrams)

Size: 3.09 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0

chaohaoyuan/PAAG

Source code for Annotation-guided Protein Design with Multi-Level Domain Alignment. (KDD 2025)

Language: Python - Size: 7.74 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 4 - Forks: 1

SeoBuAs/Advanced_Anomaly_Detection_in_CCTV_Systems_with_VLM

CCTV Abnormaly Detection and Logging System

Size: 0 Bytes - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 1 - Forks: 0

clin1223/VLDet

[ICLR 2023] PyTorch implementation of VLDet （https://arxiv.org/abs/2211.14843）

Language: Python - Size: 1.56 MB - Last synced at: 17 days ago - Pushed at: about 1 year ago - Stars: 186 - Forks: 11

saforem2/mmm

Multi-Modal Modeling

Language: Python - Size: 271 KB - Last synced at: 7 days ago - Pushed at: 28 days ago - Stars: 6 - Forks: 0

muanderson/VA_MM_DL

Repo for code relating to the paper 'Enhancing Post-Treatment Visual Acuity Prediction with Multi-Modal Deep Learning on Small-scale Clinical and OCT Datasets'

Language: Jupyter Notebook - Size: 212 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 1

2U1/SmolVLM-Finetune

An open-source implementaion for fine-tuning SmolVLM.

Language: Python - Size: 56.6 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 17 - Forks: 3

kyegomez/Simba

A simpler Pytorch + Zeta Implementation of the paper: "SiMBA: Simplified Mamba-based Architecture for Vision and Multivariate Time series"

Language: Python - Size: 2.48 MB - Last synced at: 17 days ago - Pushed at: 5 months ago - Stars: 28 - Forks: 2

2U1/Phi3-Vision-Finetune Fork of GaiZhenbiao/Phi3V-Finetuning

An open-source implementaion for fine-tuning Phi3-Vision and Phi3.5-Vision by Microsoft.

Language: Python - Size: 909 KB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 89 - Forks: 16

lanl/EPBD-BERT

Transcription factor binding site prediction for novel DNA sequence data aiding in mutation identification and drug discovery

Language: Jupyter Notebook - Size: 4.17 MB - Last synced at: 1 day ago - Pushed at: 8 months ago - Stars: 7 - Forks: 1

xlang-ai/Spider2-V

[NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

Language: Jupyter Notebook - Size: 134 MB - Last synced at: 22 days ago - Pushed at: 8 months ago - Stars: 120 - Forks: 7

mlvlab/Flipped-VQA

Large Language Models are Temporal and Causal Reasoners for Video Question Answering (EMNLP 2023)

Language: Python - Size: 1.24 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 75 - Forks: 10

KimRass/CLIP

PyTorch implementation of 'CLIP' (Radford et al., 2021) from scratch and training it on Flickr8k + Flickr30k

Language: Python - Size: 18.3 MB - Last synced at: 20 days ago - Pushed at: about 1 year ago - Stars: 10 - Forks: 0

FareedKhan-dev/all-rag-techniques

Implementation of all RAG techniques in a simpler way

Language: Jupyter Notebook - Size: 1.82 MB - Last synced at: 29 days ago - Pushed at: about 1 month ago - Stars: 894 - Forks: 108

zjukg/MyGO

[Paper][AAAI 2025] (MyGO)Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity Representation

Language: Python - Size: 90.8 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 242 - Forks: 4

louis-alexandre-laguet/GoldenLeaf

GoldenLeaf is a Python application for creating an image search system using the CLIP model. It generates descriptions for herbarium images using the Llava model, enhancing multi-modal search capabilities. The system allows automated image description generation, multi-modal data augmentation, and customizable configurations for efficient training.

Language: Jupyter Notebook - Size: 1.46 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 0 - Forks: 0

Related Keywords

multi-modal 346 deep-learning 56 pytorch 42 llm 37 artificial-intelligence 36 machine-learning 36 ai 35 computer-vision 23 transformers 23 transformer 22 large-language-models 21 gpt4 19 nlp 19 ml 18 clip 18 multi-modal-learning 15 multi-modality 15 chatbot 15 gpt 14 attention 13 attention-mechanism 12 vision-language-model 12 python 11 multimodal 11 openai 10 gpt-4 10 robotics 9 attention-is-all-you-need 9 object-detection 9 contrastive-learning 9 knowledge-graph 9 tensorflow 9 vision-language 9 llava 9 dataset 8 llama 8 audio 8 rag 8 image-generation 8 language-model 8 open-source 7 natural-language-processing 7 chatgpt 7 generative-ai 7 image-to-image-translation 6 instruction-tuning 6 vector-database 6 streamlit 6 segmentation 6 gan 6 3d 6 cnn 6 alignment 6 lidar 6 point-cloud 5 mllm 5 llama3 5 semantic-search 5 benchmark 5 vlm 5 llms 5 aigc 5 pretraining 5 text-to-image 5 autonomous-driving 5 speech 5 neural-network 5 vqa 4 reid 4 medical-imaging 4 multimodal-deep-learning 4 image 4 deep-neural-networks 4 audio-visual 4 image-text-retrieval 4 inference 4 diffusion-models 4 foundation-models 4 prompt 4 music 4 gemini 4 cross-modal 4 multi-task 4 agent 4 image-registration 4 information-retrieval 4 large-language-model 4 cv 4 llama2 4 multi-view 4 visual-question-answering 4 3d-object-detection 4 vision-and-language 4 image-synthesis 4 deeplearning 4 representation-learning 4 knowledge-graph-completion 4 multi-modal-fusion 4 vision 4 stable-diffusion 4