GitHub topics: data-centric-machine-learning
luo-junyu/Awesome-Data-Efficient-LLM
A list of data-efficient and data-centric LLM (Large Language Model) papers. Our Survey Paper: Towards Efficient LLM Post Training: A Data-centric Perspective
Size: 884 KB - Last synced at: 8 days ago - Pushed at: 2 months ago - Stars: 29 - Forks: 4

microsoft/data-centric-satellite-segmentation
Contains implementations of data-centric approaches for improving semantic segmentation on satellite imagery.
Language: Python - Size: 561 KB - Last synced at: 7 days ago - Pushed at: 17 days ago - Stars: 36 - Forks: 1

Docta-ai/docta
A Doctor for your data
Language: Python - Size: 27.8 MB - Last synced at: 17 days ago - Pushed at: 3 months ago - Stars: 3,098 - Forks: 231

Decentralized-AI-Reserach-Lab/FedNS
Collaboratively Learning Federated Models from Noisy Decentralized Data
Language: Python - Size: 1.14 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

daochenzha/data-centric-AI
A curated, but incomplete, list of data-centric AI resources.
Size: 1.99 MB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 1,094 - Forks: 78

seedatnabeel/DIPS
You can’t handle the (dirty) truth: Data-centric insights improve pseudo-labeling
Language: Jupyter Notebook - Size: 40.8 MB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 7 - Forks: 1

mashijie1028/TrustDD
Code for our paper "Towards Trustworthy Dataset Distillation" (Pattern Recognition 2025)
Language: Python - Size: 639 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

Nokia-Bell-Labs/data-centric-federated-learning
Enhancing Efficiency in Multidevice Federated Learning through Data Selection
Language: Python - Size: 1.93 MB - Last synced at: 5 months ago - Pushed at: about 1 year ago - Stars: 12 - Forks: 3

miriamspsantos/dcai-ecai-tutorial-2024
A multi-view panorama of Data-Centric AI: Techniques, Tools, and Applications (ECAI Tutorial 2024)
Size: 1.8 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

sangmichaelxie/doremi
Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
Language: HTML - Size: 24.4 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 226 - Forks: 30

seedatnabeel/TRIAGE
TRIAGE: Characterizing and auditing training data for improved regression (NeurIPS 2023)
Language: Jupyter Notebook - Size: 22.6 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 3

miriamspsantos/data-typology
Implementation of data typology for imbalanced datasets.
Language: MATLAB - Size: 1.29 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

seedatnabeel/Data-SUITE
Data-SUITE: Data-centric identification of in-distribution incongruous examples (ICML 2022)
Language: Jupyter Notebook - Size: 4.22 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 5 - Forks: 4

seedatnabeel/Data-IQ
Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data (NeurIPS 2022)
Language: Jupyter Notebook - Size: 14.1 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 5 - Forks: 2

ElsevierSoftwareX/SOFTX-D-21-00177 Fork of parichit/DCEM
Data Clustering using Expectation Maximization algorithm. To cite this Original Software Publication: https://www.sciencedirect.com/science/article/pii/S2352711021001771
Size: 8.6 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0
