Topic: "data-curation"
cleanlab/cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Language: Python - Size: 11.5 MB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 10,518 - Forks: 826

voxel51/fiftyone
Refine high-quality datasets and visual AI models
Language: Python - Size: 1.92 GB - Last synced at: 4 days ago - Pushed at: 6 days ago - Stars: 9,467 - Forks: 629

Docta-ai/docta
A Doctor for your data
Language: Python - Size: 27.8 MB - Last synced at: 3 days ago - Pushed at: 4 months ago - Stars: 3,264 - Forks: 236

visual-layer/fastdup
fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.
Language: Python - Size: 1.73 GB - Last synced at: 2 days ago - Pushed at: 4 months ago - Stars: 1,680 - Forks: 82

Renumics/spotlight
Interactively explore unstructured datasets from your dataframe.
Language: TypeScript - Size: 46.8 MB - Last synced at: 3 days ago - Pushed at: 3 months ago - Stars: 1,175 - Forks: 87

daochenzha/data-centric-AI
A curated, but incomplete, list of data-centric AI resources.
Size: 1.99 MB - Last synced at: about 2 months ago - Pushed at: 11 months ago - Stars: 1,094 - Forks: 78

NVIDIA/NeMo-Curator
Scalable data pre processing and curation toolkit for LLMs
Language: Jupyter Notebook - Size: 7.83 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 909 - Forks: 128

Renumics/awesome-open-data-centric-ai
Curated list of open source tooling for data-centric AI on unstructured data.
Size: 572 KB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 719 - Forks: 35

getmetamapper/metamapper
Metamapper is a data discovery and documentation platform for improving how teams understand and interact with their data.
Language: Python - Size: 33 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 79 - Forks: 6

Renumics/sliceguard
A library for detecting problematic data segments in structured and unstructured data with few lines of code.
Language: Python - Size: 4.28 MB - Last synced at: 20 days ago - Pushed at: over 1 year ago - Stars: 64 - Forks: 3

LaureBerti/Learn2Clean
Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning
Language: Python - Size: 34.6 MB - Last synced at: 5 days ago - Pushed at: over 2 years ago - Stars: 51 - Forks: 20

UCSC-REAL/DS2
[ICLR 2025] Improving Data Efficiency via Curating LLM-Driven Rating Systems
Language: Python - Size: 18 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 49 - Forks: 5

whythawk/data-as-a-science
Lesson guide and textbook for "Data as a Science" course.
Language: Jupyter Notebook - Size: 7.58 MB - Last synced at: about 1 year ago - Pushed at: almost 4 years ago - Stars: 41 - Forks: 9

x-CK-x/Dataset-Curation-Tool
A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your images & tags. Additional tabs for downloading other desired code repositories as well as S.O.T.A. diffusion and auto-tag/caption models for your purposes. Custom datasets can be added!
Language: Python - Size: 13.9 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 35 - Forks: 7

Digital-Dermatology/SelfClean
🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors (NeurIPS'24).
Language: Python - Size: 37.7 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 31 - Forks: 1

cleanlab/cleanlab-studio
Client interface to Cleanlab Studio and the Trustworthy Language Model
Language: Python - Size: 3.52 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 30 - Forks: 8

iwangjian/TopDial
Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation (EMNLP 2023)
Language: Python - Size: 1.16 MB - Last synced at: 30 days ago - Pushed at: about 1 year ago - Stars: 30 - Forks: 1

PennLINC/CuBIDS
Curation of BIDS (CuBIDS): A sanity-preserving software package for processing BIDS datasets.
Language: Python - Size: 8.56 MB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 25 - Forks: 12

TieuLongPhan/SynRBL
Rebalancing chemical reaction
Language: Python - Size: 137 MB - Last synced at: 22 days ago - Pushed at: about 2 months ago - Stars: 21 - Forks: 2

neo-chem/awesome-chemical-data
Curated list of known efforts in collecting and/or curating of chemical/materials data
Size: 98.6 KB - Last synced at: 9 days ago - Pushed at: over 4 years ago - Stars: 21 - Forks: 1

WolframResearch/Data-Curation-Training
Language: Mathematica - Size: 3.61 MB - Last synced at: about 1 month ago - Pushed at: over 7 years ago - Stars: 13 - Forks: 7

pg-space/panspace
Embedding-based indexing for compact storage, rapid querying, and curation of bacterial pan-genomes
Language: Jupyter Notebook - Size: 22.8 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 10 - Forks: 0

VIDA-NYU/openclean-core
Data Cleaning and Data Profiling Library for Python
Language: Python - Size: 44.9 MB - Last synced at: about 1 month ago - Pushed at: about 3 years ago - Stars: 10 - Forks: 3

mcsorkun/AqSolDB
AqSolDB: A curated aqueous solubility dataset contains 9.982 unique compounds.
Language: Python - Size: 3.06 MB - Last synced at: over 1 year ago - Pushed at: about 5 years ago - Stars: 7 - Forks: 2

Grelot/global_fish_genetic_diversity
Codes I wrote for the paper : "Global determinants of freshwater and marine fish genetic diversity" Nature Communications, 2020
Language: R - Size: 82.4 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 7 - Forks: 0

UHBristolDataScience/ICNARC-to-Philips-Linkage
Code for data linkage (curation of research database).
Language: Jupyter Notebook - Size: 1.58 MB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 6 - Forks: 3

thehyve/tmtk 📦
tranSMART Arborist ETL toolkit
Language: Python - Size: 3.89 MB - Last synced at: 3 months ago - Pushed at: almost 5 years ago - Stars: 6 - Forks: 4

ARUP-CAS/aiscr-webamcr
Archaeological Map of the Czech Republic (AMCR)
Language: Python - Size: 16.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 5 - Forks: 0

ELIXIR-Norway-Training/DMP-writing-workshop
Teaching material for DMP writing course
Size: 490 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 7

Academich/reagent_emb_vis
Reaction data exploration: a map of reagents with regions of similar reagent purpose.
Language: Python - Size: 230 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 4 - Forks: 1

bluestero/urlgenie
Python package to make URL extraction, generalization, validation, and filtration easy.
Language: Python - Size: 204 KB - Last synced at: 7 days ago - Pushed at: 11 months ago - Stars: 4 - Forks: 1

cgnorthcutt/reliablity_framework_for_rag
Demo showing how the Trustworthy Language Model add reliability to LLM outputs and improves RAG, agents, and data enrichment worfklows. can be used to improve fine-tuning of LLMs, accuracy of LLM outputs, and smart routing for RAG and agents.
Language: Jupyter Notebook - Size: 18.4 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 4 - Forks: 2

voxel51/fiftyone_mlflow_plugin
Track model training experiments with MLflow and FiftyOne!
Language: Python - Size: 149 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 4 - Forks: 0

ARUP-CAS/aiscr-digiarchiv-2
Digitální archiv AMČR
Language: Java - Size: 18.7 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 3 - Forks: 0

halbritter-lab/gene-curator
Gene Curator is an open-source platform for managing and curating genetic data. It facilitates gene data analysis, entry, and reporting, serving genetics researchers with tools for efficient data handling.
Language: Vue - Size: 22.9 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 1

thehyve/arborist 📦
TranSMART Arborist: Graphical tool for reshaping your data for the tranSMART data warehouse.
Language: JavaScript - Size: 441 KB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 0

UAL-RE/ldcoolp-figshare
Python tool using the Figshare API for data curation
Language: Python - Size: 79.1 KB - Last synced at: 25 days ago - Pushed at: almost 3 years ago - Stars: 3 - Forks: 1

Henrium/ET-AL
Entropy-targeted active learning for bias mitigation in materials data.
Language: Python - Size: 19.4 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 2 - Forks: 1

gliff-ai/curate
gliff.ai CURATE – a user-friendly browser interface for curating large multidimensional image datasets for machine learning development
Language: TypeScript - Size: 3.57 MB - Last synced at: 19 days ago - Pushed at: 5 months ago - Stars: 2 - Forks: 0

yago-mendoza/suskind-knowledge-graph
Graph-based NLP framework leveraging a curated database and an intuitive CLI for advanced, context-rich language understanding.
Language: Python - Size: 271 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

hubmapconsortium/py-hubmap-inventory
Package that builds a JSON inventory/manifest from public primary or derived datasets
Language: Python - Size: 330 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 5

datahiv3/Legalese-Nodes
Comprehensive framework for Legalese Nodes in the DataHive ecosystem, including legal data indexing, curation, and legal intelligence layers.
Size: 1.79 MB - Last synced at: about 2 months ago - Pushed at: 7 months ago - Stars: 2 - Forks: 0

johannesuhl/hisdac-es
HISDAC-ES: Creating historical settlement data for Spain (1900-2020) based on cadastral building footprint data
Language: Python - Size: 19.3 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 1

IQTLabs/VennData 📦
One of the biggest barriers to widespread machine learning adoption is the difficulty in collecting a 'good' dataset. There is an overall consensus that a 'good' dataset is a big dataset, but we believe that we can do better. As such the VennData project was created to develop tools to guide in the collection, curation, augmentation and validation of data.
Language: Jupyter Notebook - Size: 115 MB - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 2 - Forks: 1

gage1145/quicR
Open-source R toolkit for RT-QuIC data analysis
Language: R - Size: 21.9 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 1 - Forks: 0

tznurmin/TEA_curated_data
Curated data and source articles for microbial strains (Strain Tagger) and human microbial pathogens (Pathogen Identifier) datasets. Over 3,500 tagged entities from hundreds of full-length articles.
Size: 20.7 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

JoeLollo21/WikiFreaks
An open repository of Wikipedia data + final project for LIS 546 at UW; by Joe Lollo and Lily Woodard.
Language: HTML - Size: 1.33 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

JoeLollo21/LIS-54X
Assignments and activities for Data Curation I and II in the MLIS program at the University of Washington.
Language: XSLT - Size: 7.02 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0

cmrn-rhi/covid19-crf-analysis
COVID19 Case Report Form Analysis - data and collection forms.
Size: 6.04 MB - Last synced at: over 1 year ago - Pushed at: almost 3 years ago - Stars: 1 - Forks: 0

mars-aria/superhero_data_analysis
For this human-centered data science project, I analyzed some data on the Gender characteristics of Superheroes and Villains to determine the ratio of female characters that appear in comic books compared to their male counterparts.
Language: Python - Size: 53.7 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 1 - Forks: 0

RaulRC/Covid-19
Some analysis on public datasets [WIP]
Language: Jupyter Notebook - Size: 7.55 MB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 1 - Forks: 0

AGENTproject/historic_pheno_data_analysis
Analysis of wheat and barley historical data from 9 AGENT genebanks.
Language: Jupyter Notebook - Size: 159 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

apelullo/yelp_health_data_curation_ops
An AWS-based data pipeline to extract, process, store, and monitor Yelp "health-related" facility data in support of ongoing health system initiatives.
Language: Jupyter Notebook - Size: 1.17 MB - Last synced at: about 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

laura-budurlean/Data-Wrangling-Exercise-RO4532A
This R script performs data wrangling, cleaning, and transformation tasks for a fictitious study RO4532A. It processes multiple sheets from an Excel file, merges and reshapes the data, and generates a curated dataset.
Language: R - Size: 27.3 KB - Last synced at: about 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

caumente/multi_task_breast_cancer
Multi-task framework for breast cancer segmentation and classification
Language: Python - Size: 1.21 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

ameencaslam/Filter-V2
Image Filter Tool V2 is a powerful web-based application designed to streamline the process of filtering, categorizing, and managing large image datasets. With customizable layouts, multiple selection modes, and real-time progress tracking.
Language: HTML - Size: 27.3 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

lodac/curation-ontology
Repository for Data Curation Process Ontology
Size: 1.72 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 1

USGCRP/gcis-conventions
Repository for the collection, management, and versioning of the GCIS data management conventions.
Size: 1.4 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

ShreyaPatil1199/Gender_By_Name
This dataset compiles the number of occurrences of male and female baby names during specific time periods. It then calculates the probability of a name based on the total count. The data comes directly from government authorities, ensuring its credibility.
Language: Jupyter Notebook - Size: 2.02 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

fer-aguirre/data-annotator 📦
Web application for text-based data labeling 🏷️
Language: Python - Size: 392 KB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

rainbowmycelium/ConSequences
R script for GenBank sequences names changing, filling-in missing molecular markers data and sequences concatenation
Language: R - Size: 24.4 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

arkansas-research-platform/RBioTools
The RBiotools package provides support for Basic Comparative Microbial Genomics. It supports microbial comparative genomics with functions parameterized by a list of Gen-Bank accession numbers and R implementations of Prodigal, RNAmmer, and Linclust.
Language: R - Size: 10.4 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 2

wmacmillan/data-products
Business glossary and discussion of data product terms.
Language: HTML - Size: 1.15 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

Wimmics/SameLive
This program consists in discovering equivalence links (owl:sameAs) for a given set of URIs dynamically and online with SPARQL queries.
Language: Python - Size: 110 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

skoc/pdxnet-preprocessing
Language: HTML - Size: 2.09 MB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

acdh-oeaw/tokeneditor 📦
TokenEditor is a web application for manual annotation (or manual review of automatic annotations) of text. Albeit primarily aimed at reviewing PoS tags and lemmas, it is fully customizable, to support any annotation levels.
Language: JavaScript - Size: 1.27 MB - Last synced at: 2 months ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

kthrog/LIS-546-guest-lecture
Materials from a guest lecture entitled, "Beyond Data Standards," prepared for University of Washington's LIS 546 (Data Curation II) in Spring 2021.
Size: 26.6 MB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 0

lucia15/Diplodatos2020-Practicos-Mentoria-Coronavirus
Practices of the "Diploma in Data Sciences, Machine Learning and its applications", in which I was a mentor.
Language: Jupyter Notebook - Size: 1.19 MB - Last synced at: 7 months ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 0

Randhir123/data-integration
Language: Jupyter Notebook - Size: 8.79 KB - Last synced at: almost 2 years ago - Pushed at: about 7 years ago - Stars: 0 - Forks: 0
