An open API service providing repository metadata for many open source software ecosystems.

Topic: "data-centric-ai"

cleanlab/cleanlab

The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

Language: Python - Size: 11.5 MB - Last synced at: 4 days ago - Pushed at: 16 days ago - Stars: 10,486 - Forks: 824

voxel51/fiftyone

Refine high-quality datasets and visual AI models

Language: Python - Size: 1.91 GB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 9,412 - Forks: 622

Docta-ai/docta

A Doctor for your data

Language: Python - Size: 27.8 MB - Last synced at: 16 days ago - Pushed at: 3 months ago - Stars: 3,098 - Forks: 231

code-kern-ai/refinery

The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.

Language: Python - Size: 3.54 MB - Last synced at: 15 days ago - Pushed at: 5 months ago - Stars: 1,433 - Forks: 71

Renumics/spotlight

Interactively explore unstructured datasets from your dataframe.

Language: TypeScript - Size: 46.8 MB - Last synced at: 12 days ago - Pushed at: 2 months ago - Stars: 1,164 - Forks: 86

HazyResearch/data-centric-ai

Resources for Data Centric AI

Language: TeX - Size: 917 KB - Last synced at: 21 days ago - Pushed at: over 1 year ago - Stars: 1,108 - Forks: 118

daochenzha/data-centric-AI

A curated, but incomplete, list of data-centric AI resources.

Size: 1.99 MB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 1,094 - Forks: 78

cleanlab/cleanvision

Automatically find issues in image datasets and practice data-centric computer vision.

Language: Python - Size: 2.12 MB - Last synced at: 17 days ago - Pushed at: 23 days ago - Stars: 1,068 - Forks: 73

Renumics/awesome-open-data-centric-ai

Curated list of open source tooling for data-centric AI on unstructured data.

Size: 572 KB - Last synced at: about 15 hours ago - Pushed at: over 1 year ago - Stars: 717 - Forks: 35

dcai-course/dcai-lab

Lab assignments for Introduction to Data-Centric AI, MIT IAP 2024 πŸ‘©πŸ½β€πŸ’»

Language: Jupyter Notebook - Size: 4.44 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 449 - Forks: 155

gszfwsb/NCFM

Official PyTorch implementation of the paper "Dataset Distillation with Neural Characteristic Function: A Minmax Perspective" (NCFM) in CVPR 2025 (Highlight).

Language: Python - Size: 1.17 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 325 - Forks: 18

GAIR-NLP/ProX

Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"

Language: Python - Size: 15.1 MB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 236 - Forks: 18

JieyuZ2/wrench

[NeurIPS 2021] WRENCH: Weak supeRvision bENCHmark

Language: Python - Size: 1.81 MB - Last synced at: 21 days ago - Pushed at: about 1 year ago - Stars: 224 - Forks: 33

aai-institute/pyDVL

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation

Language: Python - Size: 435 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 124 - Forks: 7

dcai-course/dcai-course

Introduction to Data-Centric AI, MIT IAP 2023 πŸ€–

Language: CSS - Size: 278 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 98 - Forks: 13

opendataval/opendataval

OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)

Language: Python - Size: 23.4 MB - Last synced at: 17 days ago - Pushed at: 3 months ago - Stars: 96 - Forks: 7

yueyu1030/AttrPrompt

[NeurIPS 2023] This is the code for the paper `Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias`.

Language: Python - Size: 705 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 94 - Forks: 5

OFA-Sys/DiverseEvol

Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning

Language: Python - Size: 62 MB - Last synced at: 12 months ago - Pushed at: over 1 year ago - Stars: 60 - Forks: 2

TonyLianLong/UnsupervisedSelectiveLabeling

[ECCV 2022] Official Implementation for Unsupervised Selective Labeling for More Effective Semi-Supervised Learning

Language: Python - Size: 218 KB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 51 - Forks: 5

astutic/Acharya

A Data Centric NER annotation tool for your Named Entity Recognition projects

Size: 11.3 MB - Last synced at: 4 months ago - Pushed at: about 1 year ago - Stars: 45 - Forks: 3

NextBrain-ai/nbsynthetic

nbsynthetic is simple and robust tabular synthetic data generation library for small and medium size datasets

Language: Jupyter Notebook - Size: 2.32 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 42 - Forks: 7

koalazf99/Awesome-DataCentric-LLM

Trending projects & awesome papers about data-centric llm studies.

Size: 13.7 KB - Last synced at: 9 days ago - Pushed at: 3 months ago - Stars: 34 - Forks: 2

awesome-mlops/awesome-data-management

A curated list of awesome open source tools and commercial products to catalog, version, and manage data πŸš€

Size: 4.88 KB - Last synced at: 4 days ago - Pushed at: about 3 years ago - Stars: 32 - Forks: 3

Digital-Dermatology/SelfClean

πŸ§ΌπŸ”Ž A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors (NeurIPS'24).

Language: Python - Size: 37.7 MB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 31 - Forks: 1

cleanlab/cleanlab-studio

Client interface to Cleanlab Studio and the Trustworthy Language Model

Language: Python - Size: 3.52 MB - Last synced at: 13 days ago - Pushed at: 2 months ago - Stars: 30 - Forks: 8

luo-junyu/Awesome-Data-Efficient-LLM

A list of data-efficient and data-centric LLM (Large Language Model) papers. Our Survey Paper: Towards Efficient LLM Post Training: A Data-centric Perspective

Size: 884 KB - Last synced at: 7 days ago - Pushed at: 2 months ago - Stars: 29 - Forks: 4

nachifur/LLPC

Frontiers in Neuroinformatics 2022: Local Label Point Correction for Edge Detection of Overlapping Cervical Cells

Language: Python - Size: 1.1 MB - Last synced at: 24 days ago - Pushed at: 11 months ago - Stars: 29 - Forks: 4

KibromBerihu/ai4elife

This data-centric AI repository implements a robust deep learning method (LFBNet) for fully automated tumor segmentation in whole-body [18]F-FDG PET/CT images.

Language: Python - Size: 35.2 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 28 - Forks: 9

ear-team/bambird

Unsupervised classification to improve the quality of a bird song recording dataset. https://doi.org/10.1016/j.ecoinf.2022.101952

Language: Python - Size: 207 MB - Last synced at: 8 days ago - Pushed at: 9 days ago - Stars: 26 - Forks: 6

voxel51/reconstruction-error-ratios

Estimate dataset difficulty and detect label mistakes using reconstruction error ratios!

Language: Python - Size: 606 KB - Last synced at: 15 days ago - Pushed at: 4 months ago - Stars: 24 - Forks: 0

kennethleungty/Data-Centric-AI-Competition

Codes for a Top 5% finish in the Data-Centric AI Competition organized by Andrew Ng and DeepLearning.AI

Language: Jupyter Notebook - Size: 11.4 MB - Last synced at: 19 days ago - Pushed at: over 3 years ago - Stars: 22 - Forks: 3

code-kern-ai/refinery-python-sdk

Official Python SDK for Kern AI refinery.

Language: Python - Size: 171 KB - Last synced at: 17 days ago - Pushed at: 5 months ago - Stars: 19 - Forks: 3

Lichang-Chen/AlpaGasus

A better Alpaca Model Trained with Less Data (only 9k instructions of the original set)

Language: HTML - Size: 6.66 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 19 - Forks: 3

SJTU-DMTai/awesome-ml-data-quality-papers

Papers about training data quality management for ML models.

Size: 1.08 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 15 - Forks: 2

autonlab/aqua

AQuA: A Benchmarking Tool for Label Quality Assessment

Language: Jupyter Notebook - Size: 3.26 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 15 - Forks: 0

Living-with-machines/genre-classification

Jupyter book showing how to build an ML powered book genre classifier

Language: Jupyter Notebook - Size: 7.95 MB - Last synced at: 13 days ago - Pushed at: 6 months ago - Stars: 12 - Forks: 2

Nokia-Bell-Labs/data-centric-federated-learning

Enhancing Efficiency in Multidevice Federated Learning through Data Selection

Language: Python - Size: 1.93 MB - Last synced at: 5 months ago - Pushed at: about 1 year ago - Stars: 12 - Forks: 3

fuxiAIlab/NetEaseCrowd-Dataset

NetEaseCrowd dataset, a collection of data obtained from You Ling crowdsourcing platform, Fuxi AI Lab, NetEase.

Size: 50.1 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 9 - Forks: 0

jacobmarks/gpt4-vision-plugin

Chat with your images using GPT-4 Vision!

Language: Python - Size: 10.7 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 9 - Forks: 3

mdbloice/Labeller

Quickly set up an image labelling web application for manually tagging images for machine learning tasks.

Language: Python - Size: 91.8 KB - Last synced at: 5 days ago - Pushed at: over 1 year ago - Stars: 9 - Forks: 2

seedatnabeel/DIPS

You can’t handle the (dirty) truth: Data-centric insights improve pseudo-labeling

Language: Jupyter Notebook - Size: 40.8 MB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 7 - Forks: 1

jacobmarks/semantic-document-search-plugin

Semantically search through OCR text blocks with Qdrant, Sentence Transformers, and FiftyOne!

Language: Python - Size: 20.5 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 7 - Forks: 0

seedatnabeel/TRIAGE

TRIAGE: Characterizing and auditing training data for improved regression (NeurIPS 2023)

Language: Jupyter Notebook - Size: 22.6 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 3

Digital-Dermatology/SelfClean-Revised-Benchmarks

πŸ§ΌπŸ”Ž SelfClean revised versions of benchmark datasets for more reliable performance estimation.

Size: 180 KB - Last synced at: 12 months ago - Pushed at: over 1 year ago - Stars: 6 - Forks: 0

seedatnabeel/Data-SUITE

Data-SUITE: Data-centric identification of in-distribution incongruous examples (ICML 2022)

Language: Jupyter Notebook - Size: 4.22 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 5 - Forks: 4

seedatnabeel/Data-IQ

Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data (NeurIPS 2022)

Language: Jupyter Notebook - Size: 14.1 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 5 - Forks: 2

3lc-ai/ultralytics Fork of ultralytics/ultralytics

Ultralytics YOLO11 with a 3LC integration

Language: Python - Size: 23.3 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 4 - Forks: 0

iSE-UET-VNU/Cola

Official implementation of our paper: "COLA: Leveraging Local and Global Relationships for Corrupted Label Detection"

Language: Python - Size: 276 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 4 - Forks: 0

Blacksujit/100X-Engineers-GenAI-Hackathon-Submission

Dataviz AI is an AI powered web application that enables users to generate animated infographic videos based on input Data , text,files. This MVP leverages the Pexels API for video content and incorporates advanced natural language processing (NLP) techniques, including LangChain and stable diffusion techniques, to analyze and create visual impact

Language: Jupyter Notebook - Size: 113 MB - Last synced at: 4 days ago - Pushed at: 19 days ago - Stars: 4 - Forks: 1

TsingZ0/CoAutoGen

Cloud-Edge Collaboration Platform for Automated Synthetic Dataset Generation

Language: Python - Size: 167 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 4 - Forks: 0

jacobmarks/clustering-plugin

Compute clustering on your data in a visual, intuitive way with FiftyOne and Sklearn!

Language: Python - Size: 59.6 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 4 - Forks: 0

davanstrien/ImageIN

Find illustrations in historic book using computer vision

Language: Jupyter Notebook - Size: 22.9 MB - Last synced at: about 4 hours ago - Pushed at: over 2 years ago - Stars: 4 - Forks: 0

IS2AI/AnyFace

Input-Agnostic Face Detection

Language: Jupyter Notebook - Size: 54.2 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 1

Weixin-Liang/data-centric-AI-perspective

Language: Jupyter Notebook - Size: 3.43 MB - Last synced at: 12 months ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 0

jacobmarks/double-band-filter-plugin

Filter a float-valued field on two ranges simultaneously with this FiftyOne Plugin!

Language: Python - Size: 8.79 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

markdavidmc0/churning-mists

Customer churn train/prediction library with automatic dataset size optimisation features.

Language: Python - Size: 6.84 KB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 0

Decentralized-AI-Reserach-Lab/FedNS

Collaboratively Learning Federated Models from Noisy Decentralized Data

Language: Python - Size: 1.14 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 1 - Forks: 0

voxel51/fiftyone-huggingface-plugins

Hugging Face Plugins for FiftyOne

Language: Python - Size: 19.5 KB - Last synced at: 15 days ago - Pushed at: 12 months ago - Stars: 1 - Forks: 0

xml94/EmbracingLimitedImperfectTrainingDatasets

Embrace limited and imperfect training datasets in plant disease recognition using deep learning.

Size: 14.6 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

sheikhomar/dcc

Repository for the Data-Centric AI Competition

Language: Jupyter Notebook - Size: 199 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

PugtgYosuky/EDCA

Evolutionary Data-Centric AutoML Framework for Efficient Pipelines

Language: Jupyter Notebook - Size: 8.86 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

ArthurMangussi/AdvML

Adversarial Machine Learning Applied to Missing Data Imputation

Language: Python - Size: 140 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 0 - Forks: 0

miriamspsantos/dcai-ecai-tutorial-2024

A multi-view panorama of Data-Centric AI: Techniques, Tools, and Applications (ECAI Tutorial 2024)

Size: 1.8 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

dimasthoriq/data-centric-image-classification

Applying various data engineering techniques into image classification task for KAIST DS801 term project

Language: Jupyter Notebook - Size: 25.1 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

beigecap/Data_Centric_AI_course

Course of data centric AI. ITMO University. AI Talent Hub.

Language: Jupyter Notebook - Size: 24.4 MB - Last synced at: 13 days ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

miriamspsantos/data-typology

Implementation of data typology for imbalanced datasets.

Language: MATLAB - Size: 1.29 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

LamineTourelab/dcai-lab Fork of dcai-course/dcai-lab

Lab assignments for Introduction to Data-Centric AI, MIT IAP 2023 πŸ‘©πŸ½β€πŸ’»

Size: 6.92 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

flyswot/book

πŸ“• flyswot book on developing a pragmatic machine learning workflow in a library setting

Language: HTML - Size: 6.68 MB - Last synced at: about 4 hours ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

Tre-Xanh/lapros

Denoise data

Language: Julia - Size: 1.57 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

Related Topics
machine-learning 30 data-science 17 deep-learning 16 data-quality 15 computer-vision 11 data-centric-machine-learning 11 data-cleaning 9 python 8 data-centric 8 data-curation 8 fiftyone 5 nlp 5 active-learning 5 ai 5 natural-language-processing 5 data-labeling 4 artificial-intelligence 4 labeling 4 dataset 4 text-classification 4 llm 4 large-language-models 4 data-profiling 4 image-classification 4 automl 3 noisy-labels 3 exploratory-data-analysis 3 data-validation 3 annotation 3 text-annotation 3 robust-machine-learning 3 data-valuation 3 plugin 3 labeling-tool 3 data-visualization 3 outlier-detection 3 synthetic-data 3 transformers 3 weak-supervision 2 llms 2 annotations 2 course 2 homework 2 semi-supervised-learning 2 awesome-list 2 filtering-data 2 missing-data-imputation 2 data-management 2 game-theory 2 bias-detection 2 segmentation 2 object-detection 2 vector-search 2 pre-training 2 data 2 huggingface 2 embeddings 2 dataops 2 imbalanced-learning 2 imbalanced-data 2 data-complexity 2 unstructured-data 2 neural-search 2 glam 2 image-segmentation 2 federated-learning 2 spacy 2 datasets 2 generative-adversarial-network 2 supervised-learning 2 data-quality-monitoring 2 openai-api 2 trustworthy-ai 2 trustworthy-machine-learning 2 responsible-ai 2 data-quality-assessment 2 lab 2 data-generation 1 dataset-generation 1 neural-symbolic 1 api 1 benchmark-framework 1 text-annotation-tool 1 data-programming 1 segmantation 1 point-correction 1 robust-learning 1 overlapping-cell 1 sequence-labeling 1 label-correction 1 edge-detection 1 split-learning 1 machine-learning-library 1 ner 1 ai4db 1 named-entity-recognition 1 data-debugging 1 cybersecurity 1 mlops 1 db4ai 1