Topic: "data-centric-ai"
cleanlab/cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Language: Python - Size: 11.5 MB - Last synced at: 4 days ago - Pushed at: 16 days ago - Stars: 10,486 - Forks: 824

voxel51/fiftyone
Refine high-quality datasets and visual AI models
Language: Python - Size: 1.91 GB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 9,412 - Forks: 622

Docta-ai/docta
A Doctor for your data
Language: Python - Size: 27.8 MB - Last synced at: 16 days ago - Pushed at: 3 months ago - Stars: 3,098 - Forks: 231

code-kern-ai/refinery
The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
Language: Python - Size: 3.54 MB - Last synced at: 15 days ago - Pushed at: 5 months ago - Stars: 1,433 - Forks: 71

Renumics/spotlight
Interactively explore unstructured datasets from your dataframe.
Language: TypeScript - Size: 46.8 MB - Last synced at: 12 days ago - Pushed at: 2 months ago - Stars: 1,164 - Forks: 86

HazyResearch/data-centric-ai
Resources for Data Centric AI
Language: TeX - Size: 917 KB - Last synced at: 21 days ago - Pushed at: over 1 year ago - Stars: 1,108 - Forks: 118

daochenzha/data-centric-AI
A curated, but incomplete, list of data-centric AI resources.
Size: 1.99 MB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 1,094 - Forks: 78

cleanlab/cleanvision
Automatically find issues in image datasets and practice data-centric computer vision.
Language: Python - Size: 2.12 MB - Last synced at: 17 days ago - Pushed at: 23 days ago - Stars: 1,068 - Forks: 73

Renumics/awesome-open-data-centric-ai
Curated list of open source tooling for data-centric AI on unstructured data.
Size: 572 KB - Last synced at: about 15 hours ago - Pushed at: over 1 year ago - Stars: 717 - Forks: 35

dcai-course/dcai-lab
Lab assignments for Introduction to Data-Centric AI, MIT IAP 2024 π©π½βπ»
Language: Jupyter Notebook - Size: 4.44 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 449 - Forks: 155

gszfwsb/NCFM
Official PyTorch implementation of the paper "Dataset Distillation with Neural Characteristic Function: A Minmax Perspective" (NCFM) in CVPR 2025 (Highlight).
Language: Python - Size: 1.17 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 325 - Forks: 18

GAIR-NLP/ProX
Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"
Language: Python - Size: 15.1 MB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 236 - Forks: 18

JieyuZ2/wrench
[NeurIPS 2021] WRENCH: Weak supeRvision bENCHmark
Language: Python - Size: 1.81 MB - Last synced at: 21 days ago - Pushed at: about 1 year ago - Stars: 224 - Forks: 33

aai-institute/pyDVL
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
Language: Python - Size: 435 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 124 - Forks: 7

dcai-course/dcai-course
Introduction to Data-Centric AI, MIT IAP 2023 π€
Language: CSS - Size: 278 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 98 - Forks: 13

opendataval/opendataval
OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)
Language: Python - Size: 23.4 MB - Last synced at: 17 days ago - Pushed at: 3 months ago - Stars: 96 - Forks: 7

yueyu1030/AttrPrompt
[NeurIPS 2023] This is the code for the paper `Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias`.
Language: Python - Size: 705 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 94 - Forks: 5

OFA-Sys/DiverseEvol
Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning
Language: Python - Size: 62 MB - Last synced at: 12 months ago - Pushed at: over 1 year ago - Stars: 60 - Forks: 2

TonyLianLong/UnsupervisedSelectiveLabeling
[ECCV 2022] Official Implementation for Unsupervised Selective Labeling for More Effective Semi-Supervised Learning
Language: Python - Size: 218 KB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 51 - Forks: 5

astutic/Acharya
A Data Centric NER annotation tool for your Named Entity Recognition projects
Size: 11.3 MB - Last synced at: 4 months ago - Pushed at: about 1 year ago - Stars: 45 - Forks: 3

NextBrain-ai/nbsynthetic
nbsynthetic is simple and robust tabular synthetic data generation library for small and medium size datasets
Language: Jupyter Notebook - Size: 2.32 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 42 - Forks: 7

koalazf99/Awesome-DataCentric-LLM
Trending projects & awesome papers about data-centric llm studies.
Size: 13.7 KB - Last synced at: 9 days ago - Pushed at: 3 months ago - Stars: 34 - Forks: 2

awesome-mlops/awesome-data-management
A curated list of awesome open source tools and commercial products to catalog, version, and manage data π
Size: 4.88 KB - Last synced at: 4 days ago - Pushed at: about 3 years ago - Stars: 32 - Forks: 3

Digital-Dermatology/SelfClean
π§Όπ A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors (NeurIPS'24).
Language: Python - Size: 37.7 MB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 31 - Forks: 1

cleanlab/cleanlab-studio
Client interface to Cleanlab Studio and the Trustworthy Language Model
Language: Python - Size: 3.52 MB - Last synced at: 13 days ago - Pushed at: 2 months ago - Stars: 30 - Forks: 8

luo-junyu/Awesome-Data-Efficient-LLM
A list of data-efficient and data-centric LLM (Large Language Model) papers. Our Survey Paper: Towards Efficient LLM Post Training: A Data-centric Perspective
Size: 884 KB - Last synced at: 7 days ago - Pushed at: 2 months ago - Stars: 29 - Forks: 4

nachifur/LLPC
Frontiers in Neuroinformatics 2022: Local Label Point Correction for Edge Detection of Overlapping Cervical Cells
Language: Python - Size: 1.1 MB - Last synced at: 24 days ago - Pushed at: 11 months ago - Stars: 29 - Forks: 4

KibromBerihu/ai4elife
This data-centric AI repository implements a robust deep learning method (LFBNet) for fully automated tumor segmentation in whole-body [18]F-FDG PET/CT images.
Language: Python - Size: 35.2 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 28 - Forks: 9

ear-team/bambird
Unsupervised classification to improve the quality of a bird song recording dataset. https://doi.org/10.1016/j.ecoinf.2022.101952
Language: Python - Size: 207 MB - Last synced at: 8 days ago - Pushed at: 9 days ago - Stars: 26 - Forks: 6

voxel51/reconstruction-error-ratios
Estimate dataset difficulty and detect label mistakes using reconstruction error ratios!
Language: Python - Size: 606 KB - Last synced at: 15 days ago - Pushed at: 4 months ago - Stars: 24 - Forks: 0

kennethleungty/Data-Centric-AI-Competition
Codes for a Top 5% finish in the Data-Centric AI Competition organized by Andrew Ng and DeepLearning.AI
Language: Jupyter Notebook - Size: 11.4 MB - Last synced at: 19 days ago - Pushed at: over 3 years ago - Stars: 22 - Forks: 3

code-kern-ai/refinery-python-sdk
Official Python SDK for Kern AI refinery.
Language: Python - Size: 171 KB - Last synced at: 17 days ago - Pushed at: 5 months ago - Stars: 19 - Forks: 3

Lichang-Chen/AlpaGasus
A better Alpaca Model Trained with Less Data (only 9k instructions of the original set)
Language: HTML - Size: 6.66 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 19 - Forks: 3

SJTU-DMTai/awesome-ml-data-quality-papers
Papers about training data quality management for ML models.
Size: 1.08 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 15 - Forks: 2

autonlab/aqua
AQuA: A Benchmarking Tool for Label Quality Assessment
Language: Jupyter Notebook - Size: 3.26 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 15 - Forks: 0

Living-with-machines/genre-classification
Jupyter book showing how to build an ML powered book genre classifier
Language: Jupyter Notebook - Size: 7.95 MB - Last synced at: 13 days ago - Pushed at: 6 months ago - Stars: 12 - Forks: 2

Nokia-Bell-Labs/data-centric-federated-learning
Enhancing Efficiency in Multidevice Federated Learning through Data Selection
Language: Python - Size: 1.93 MB - Last synced at: 5 months ago - Pushed at: about 1 year ago - Stars: 12 - Forks: 3

fuxiAIlab/NetEaseCrowd-Dataset
NetEaseCrowd dataset, a collection of data obtained from You Ling crowdsourcing platform, Fuxi AI Lab, NetEase.
Size: 50.1 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 9 - Forks: 0

jacobmarks/gpt4-vision-plugin
Chat with your images using GPT-4 Vision!
Language: Python - Size: 10.7 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 9 - Forks: 3

mdbloice/Labeller
Quickly set up an image labelling web application for manually tagging images for machine learning tasks.
Language: Python - Size: 91.8 KB - Last synced at: 5 days ago - Pushed at: over 1 year ago - Stars: 9 - Forks: 2

seedatnabeel/DIPS
You canβt handle the (dirty) truth: Data-centric insights improve pseudo-labeling
Language: Jupyter Notebook - Size: 40.8 MB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 7 - Forks: 1

jacobmarks/semantic-document-search-plugin
Semantically search through OCR text blocks with Qdrant, Sentence Transformers, and FiftyOne!
Language: Python - Size: 20.5 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 7 - Forks: 0

seedatnabeel/TRIAGE
TRIAGE: Characterizing and auditing training data for improved regression (NeurIPS 2023)
Language: Jupyter Notebook - Size: 22.6 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 3

Digital-Dermatology/SelfClean-Revised-Benchmarks
π§Όπ SelfClean revised versions of benchmark datasets for more reliable performance estimation.
Size: 180 KB - Last synced at: 12 months ago - Pushed at: over 1 year ago - Stars: 6 - Forks: 0

seedatnabeel/Data-SUITE
Data-SUITE: Data-centric identification of in-distribution incongruous examples (ICML 2022)
Language: Jupyter Notebook - Size: 4.22 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 5 - Forks: 4

seedatnabeel/Data-IQ
Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data (NeurIPS 2022)
Language: Jupyter Notebook - Size: 14.1 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 5 - Forks: 2

3lc-ai/ultralytics Fork of ultralytics/ultralytics
Ultralytics YOLO11 with a 3LC integration
Language: Python - Size: 23.3 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 4 - Forks: 0

iSE-UET-VNU/Cola
Official implementation of our paper: "COLA: Leveraging Local and Global Relationships for Corrupted Label Detection"
Language: Python - Size: 276 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 4 - Forks: 0

Blacksujit/100X-Engineers-GenAI-Hackathon-Submission
Dataviz AI is an AI powered web application that enables users to generate animated infographic videos based on input Data , text,files. This MVP leverages the Pexels API for video content and incorporates advanced natural language processing (NLP) techniques, including LangChain and stable diffusion techniques, to analyze and create visual impact
Language: Jupyter Notebook - Size: 113 MB - Last synced at: 4 days ago - Pushed at: 19 days ago - Stars: 4 - Forks: 1

TsingZ0/CoAutoGen
Cloud-Edge Collaboration Platform for Automated Synthetic Dataset Generation
Language: Python - Size: 167 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 4 - Forks: 0

jacobmarks/clustering-plugin
Compute clustering on your data in a visual, intuitive way with FiftyOne and Sklearn!
Language: Python - Size: 59.6 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 4 - Forks: 0

davanstrien/ImageIN
Find illustrations in historic book using computer vision
Language: Jupyter Notebook - Size: 22.9 MB - Last synced at: about 4 hours ago - Pushed at: over 2 years ago - Stars: 4 - Forks: 0

IS2AI/AnyFace
Input-Agnostic Face Detection
Language: Jupyter Notebook - Size: 54.2 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 1

Weixin-Liang/data-centric-AI-perspective
Language: Jupyter Notebook - Size: 3.43 MB - Last synced at: 12 months ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 0

jacobmarks/double-band-filter-plugin
Filter a float-valued field on two ranges simultaneously with this FiftyOne Plugin!
Language: Python - Size: 8.79 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

markdavidmc0/churning-mists
Customer churn train/prediction library with automatic dataset size optimisation features.
Language: Python - Size: 6.84 KB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 0

Decentralized-AI-Reserach-Lab/FedNS
Collaboratively Learning Federated Models from Noisy Decentralized Data
Language: Python - Size: 1.14 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 1 - Forks: 0

voxel51/fiftyone-huggingface-plugins
Hugging Face Plugins for FiftyOne
Language: Python - Size: 19.5 KB - Last synced at: 15 days ago - Pushed at: 12 months ago - Stars: 1 - Forks: 0

xml94/EmbracingLimitedImperfectTrainingDatasets
Embrace limited and imperfect training datasets in plant disease recognition using deep learning.
Size: 14.6 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

sheikhomar/dcc
Repository for the Data-Centric AI Competition
Language: Jupyter Notebook - Size: 199 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

PugtgYosuky/EDCA
Evolutionary Data-Centric AutoML Framework for Efficient Pipelines
Language: Jupyter Notebook - Size: 8.86 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

ArthurMangussi/AdvML
Adversarial Machine Learning Applied to Missing Data Imputation
Language: Python - Size: 140 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 0 - Forks: 0

miriamspsantos/dcai-ecai-tutorial-2024
A multi-view panorama of Data-Centric AI: Techniques, Tools, and Applications (ECAI Tutorial 2024)
Size: 1.8 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

dimasthoriq/data-centric-image-classification
Applying various data engineering techniques into image classification task for KAIST DS801 term project
Language: Jupyter Notebook - Size: 25.1 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

beigecap/Data_Centric_AI_course
Course of data centric AI. ITMO University. AI Talent Hub.
Language: Jupyter Notebook - Size: 24.4 MB - Last synced at: 13 days ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

miriamspsantos/data-typology
Implementation of data typology for imbalanced datasets.
Language: MATLAB - Size: 1.29 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

LamineTourelab/dcai-lab Fork of dcai-course/dcai-lab
Lab assignments for Introduction to Data-Centric AI, MIT IAP 2023 π©π½βπ»
Size: 6.92 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

flyswot/book
π flyswot book on developing a pragmatic machine learning workflow in a library setting
Language: HTML - Size: 6.68 MB - Last synced at: about 4 hours ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

Tre-Xanh/lapros
Denoise data
Language: Julia - Size: 1.57 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0
