An open API service providing repository metadata for many open source software ecosystems.

Topic: "data-curation"

cleanlab/cleanlab

The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

Language: Python - Size: 11.5 MB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 10,518 - Forks: 826

voxel51/fiftyone

Refine high-quality datasets and visual AI models

Language: Python - Size: 1.92 GB - Last synced at: 4 days ago - Pushed at: 6 days ago - Stars: 9,467 - Forks: 629

Docta-ai/docta

A Doctor for your data

Language: Python - Size: 27.8 MB - Last synced at: 3 days ago - Pushed at: 4 months ago - Stars: 3,264 - Forks: 236

visual-layer/fastdup

fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.

Language: Python - Size: 1.73 GB - Last synced at: 2 days ago - Pushed at: 4 months ago - Stars: 1,680 - Forks: 82

Renumics/spotlight

Interactively explore unstructured datasets from your dataframe.

Language: TypeScript - Size: 46.8 MB - Last synced at: 3 days ago - Pushed at: 3 months ago - Stars: 1,175 - Forks: 87

daochenzha/data-centric-AI

A curated, but incomplete, list of data-centric AI resources.

Size: 1.99 MB - Last synced at: about 2 months ago - Pushed at: 11 months ago - Stars: 1,094 - Forks: 78

NVIDIA/NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs

Language: Jupyter Notebook - Size: 7.83 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 909 - Forks: 128

Renumics/awesome-open-data-centric-ai

Curated list of open source tooling for data-centric AI on unstructured data.

Size: 572 KB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 719 - Forks: 35

getmetamapper/metamapper

Metamapper is a data discovery and documentation platform for improving how teams understand and interact with their data.

Language: Python - Size: 33 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 79 - Forks: 6

Renumics/sliceguard

A library for detecting problematic data segments in structured and unstructured data with few lines of code.

Language: Python - Size: 4.28 MB - Last synced at: 20 days ago - Pushed at: over 1 year ago - Stars: 64 - Forks: 3

LaureBerti/Learn2Clean

Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning

Language: Python - Size: 34.6 MB - Last synced at: 5 days ago - Pushed at: over 2 years ago - Stars: 51 - Forks: 20

UCSC-REAL/DS2

[ICLR 2025] Improving Data Efficiency via Curating LLM-Driven Rating Systems

Language: Python - Size: 18 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 49 - Forks: 5

whythawk/data-as-a-science

Lesson guide and textbook for "Data as a Science" course.

Language: Jupyter Notebook - Size: 7.58 MB - Last synced at: about 1 year ago - Pushed at: almost 4 years ago - Stars: 41 - Forks: 9

x-CK-x/Dataset-Curation-Tool

A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your images & tags. Additional tabs for downloading other desired code repositories as well as S.O.T.A. diffusion and auto-tag/caption models for your purposes. Custom datasets can be added!

Language: Python - Size: 13.9 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 35 - Forks: 7

Digital-Dermatology/SelfClean

🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors (NeurIPS'24).

Language: Python - Size: 37.7 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 31 - Forks: 1

cleanlab/cleanlab-studio

Client interface to Cleanlab Studio and the Trustworthy Language Model

Language: Python - Size: 3.52 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 30 - Forks: 8

iwangjian/TopDial

Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation (EMNLP 2023)

Language: Python - Size: 1.16 MB - Last synced at: 30 days ago - Pushed at: about 1 year ago - Stars: 30 - Forks: 1

PennLINC/CuBIDS

Curation of BIDS (CuBIDS): A sanity-preserving software package for processing BIDS datasets.

Language: Python - Size: 8.56 MB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 25 - Forks: 12

TieuLongPhan/SynRBL

Rebalancing chemical reaction

Language: Python - Size: 137 MB - Last synced at: 22 days ago - Pushed at: about 2 months ago - Stars: 21 - Forks: 2

neo-chem/awesome-chemical-data

Curated list of known efforts in collecting and/or curating of chemical/materials data

Size: 98.6 KB - Last synced at: 9 days ago - Pushed at: over 4 years ago - Stars: 21 - Forks: 1

WolframResearch/Data-Curation-Training

Language: Mathematica - Size: 3.61 MB - Last synced at: about 1 month ago - Pushed at: over 7 years ago - Stars: 13 - Forks: 7

pg-space/panspace

Embedding-based indexing for compact storage, rapid querying, and curation of bacterial pan-genomes

Language: Jupyter Notebook - Size: 22.8 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 10 - Forks: 0

VIDA-NYU/openclean-core

Data Cleaning and Data Profiling Library for Python

Language: Python - Size: 44.9 MB - Last synced at: about 1 month ago - Pushed at: about 3 years ago - Stars: 10 - Forks: 3

mcsorkun/AqSolDB

AqSolDB: A curated aqueous solubility dataset contains 9.982 unique compounds.

Language: Python - Size: 3.06 MB - Last synced at: over 1 year ago - Pushed at: about 5 years ago - Stars: 7 - Forks: 2

Grelot/global_fish_genetic_diversity

Codes I wrote for the paper : "Global determinants of freshwater and marine fish genetic diversity" Nature Communications, 2020

Language: R - Size: 82.4 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 7 - Forks: 0

UHBristolDataScience/ICNARC-to-Philips-Linkage

Code for data linkage (curation of research database).

Language: Jupyter Notebook - Size: 1.58 MB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 6 - Forks: 3

thehyve/tmtk 📦

tranSMART Arborist ETL toolkit

Language: Python - Size: 3.89 MB - Last synced at: 3 months ago - Pushed at: almost 5 years ago - Stars: 6 - Forks: 4

ARUP-CAS/aiscr-webamcr

Archaeological Map of the Czech Republic (AMCR)

Language: Python - Size: 16.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 5 - Forks: 0

ELIXIR-Norway-Training/DMP-writing-workshop

Teaching material for DMP writing course

Size: 490 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 7

Academich/reagent_emb_vis

Reaction data exploration: a map of reagents with regions of similar reagent purpose.

Language: Python - Size: 230 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 4 - Forks: 1

bluestero/urlgenie

Python package to make URL extraction, generalization, validation, and filtration easy.

Language: Python - Size: 204 KB - Last synced at: 7 days ago - Pushed at: 11 months ago - Stars: 4 - Forks: 1

cgnorthcutt/reliablity_framework_for_rag

Demo showing how the Trustworthy Language Model add reliability to LLM outputs and improves RAG, agents, and data enrichment worfklows. can be used to improve fine-tuning of LLMs, accuracy of LLM outputs, and smart routing for RAG and agents.

Language: Jupyter Notebook - Size: 18.4 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 4 - Forks: 2

voxel51/fiftyone_mlflow_plugin

Track model training experiments with MLflow and FiftyOne!

Language: Python - Size: 149 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 4 - Forks: 0

ARUP-CAS/aiscr-digiarchiv-2

Digitální archiv AMČR

Language: Java - Size: 18.7 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 3 - Forks: 0

halbritter-lab/gene-curator

Gene Curator is an open-source platform for managing and curating genetic data. It facilitates gene data analysis, entry, and reporting, serving genetics researchers with tools for efficient data handling.

Language: Vue - Size: 22.9 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 1

thehyve/arborist 📦

TranSMART Arborist: Graphical tool for reshaping your data for the tranSMART data warehouse.

Language: JavaScript - Size: 441 KB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 0

UAL-RE/ldcoolp-figshare

Python tool using the Figshare API for data curation

Language: Python - Size: 79.1 KB - Last synced at: 25 days ago - Pushed at: almost 3 years ago - Stars: 3 - Forks: 1

Henrium/ET-AL

Entropy-targeted active learning for bias mitigation in materials data.

Language: Python - Size: 19.4 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 2 - Forks: 1

gliff-ai/curate

gliff.ai CURATE – a user-friendly browser interface for curating large multidimensional image datasets for machine learning development

Language: TypeScript - Size: 3.57 MB - Last synced at: 19 days ago - Pushed at: 5 months ago - Stars: 2 - Forks: 0

yago-mendoza/suskind-knowledge-graph

Graph-based NLP framework leveraging a curated database and an intuitive CLI for advanced, context-rich language understanding.

Language: Python - Size: 271 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

hubmapconsortium/py-hubmap-inventory

Package that builds a JSON inventory/manifest from public primary or derived datasets

Language: Python - Size: 330 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 5

datahiv3/Legalese-Nodes

Comprehensive framework for Legalese Nodes in the DataHive ecosystem, including legal data indexing, curation, and legal intelligence layers.

Size: 1.79 MB - Last synced at: about 2 months ago - Pushed at: 7 months ago - Stars: 2 - Forks: 0

johannesuhl/hisdac-es

HISDAC-ES: Creating historical settlement data for Spain (1900-2020) based on cadastral building footprint data

Language: Python - Size: 19.3 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 1

IQTLabs/VennData 📦

One of the biggest barriers to widespread machine learning adoption is the difficulty in collecting a 'good' dataset. There is an overall consensus that a 'good' dataset is a big dataset, but we believe that we can do better. As such the VennData project was created to develop tools to guide in the collection, curation, augmentation and validation of data.

Language: Jupyter Notebook - Size: 115 MB - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 2 - Forks: 1

gage1145/quicR

Open-source R toolkit for RT-QuIC data analysis

Language: R - Size: 21.9 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 1 - Forks: 0

tznurmin/TEA_curated_data

Curated data and source articles for microbial strains (Strain Tagger) and human microbial pathogens (Pathogen Identifier) datasets. Over 3,500 tagged entities from hundreds of full-length articles.

Size: 20.7 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

JoeLollo21/WikiFreaks

An open repository of Wikipedia data + final project for LIS 546 at UW; by Joe Lollo and Lily Woodard.

Language: HTML - Size: 1.33 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

JoeLollo21/LIS-54X

Assignments and activities for Data Curation I and II in the MLIS program at the University of Washington.

Language: XSLT - Size: 7.02 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0

cmrn-rhi/covid19-crf-analysis

COVID19 Case Report Form Analysis - data and collection forms.

Size: 6.04 MB - Last synced at: over 1 year ago - Pushed at: almost 3 years ago - Stars: 1 - Forks: 0

mars-aria/superhero_data_analysis

For this human-centered data science project, I analyzed some data on the Gender characteristics of Superheroes and Villains to determine the ratio of female characters that appear in comic books compared to their male counterparts.

Language: Python - Size: 53.7 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 1 - Forks: 0

RaulRC/Covid-19

Some analysis on public datasets [WIP]

Language: Jupyter Notebook - Size: 7.55 MB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 1 - Forks: 0

AGENTproject/historic_pheno_data_analysis

Analysis of wheat and barley historical data from 9 AGENT genebanks.

Language: Jupyter Notebook - Size: 159 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

apelullo/yelp_health_data_curation_ops

An AWS-based data pipeline to extract, process, store, and monitor Yelp "health-related" facility data in support of ongoing health system initiatives.

Language: Jupyter Notebook - Size: 1.17 MB - Last synced at: about 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

laura-budurlean/Data-Wrangling-Exercise-RO4532A

This R script performs data wrangling, cleaning, and transformation tasks for a fictitious study RO4532A. It processes multiple sheets from an Excel file, merges and reshapes the data, and generates a curated dataset.

Language: R - Size: 27.3 KB - Last synced at: about 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

caumente/multi_task_breast_cancer

Multi-task framework for breast cancer segmentation and classification

Language: Python - Size: 1.21 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

ameencaslam/Filter-V2

Image Filter Tool V2 is a powerful web-based application designed to streamline the process of filtering, categorizing, and managing large image datasets. With customizable layouts, multiple selection modes, and real-time progress tracking.

Language: HTML - Size: 27.3 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

lodac/curation-ontology

Repository for Data Curation Process Ontology

Size: 1.72 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 1

USGCRP/gcis-conventions

Repository for the collection, management, and versioning of the GCIS data management conventions.

Size: 1.4 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

ShreyaPatil1199/Gender_By_Name

This dataset compiles the number of occurrences of male and female baby names during specific time periods. It then calculates the probability of a name based on the total count. The data comes directly from government authorities, ensuring its credibility.

Language: Jupyter Notebook - Size: 2.02 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

fer-aguirre/data-annotator 📦

Web application for text-based data labeling 🏷️

Language: Python - Size: 392 KB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

rainbowmycelium/ConSequences

R script for GenBank sequences names changing, filling-in missing molecular markers data and sequences concatenation

Language: R - Size: 24.4 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

arkansas-research-platform/RBioTools

The RBiotools package provides support for Basic Comparative Microbial Genomics. It supports microbial comparative genomics with functions parameterized by a list of Gen-Bank accession numbers and R implementations of Prodigal, RNAmmer, and Linclust.

Language: R - Size: 10.4 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 2

wmacmillan/data-products

Business glossary and discussion of data product terms.

Language: HTML - Size: 1.15 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

Wimmics/SameLive

This program consists in discovering equivalence links (owl:sameAs) for a given set of URIs dynamically and online with SPARQL queries.

Language: Python - Size: 110 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

skoc/pdxnet-preprocessing

Language: HTML - Size: 2.09 MB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

acdh-oeaw/tokeneditor 📦

TokenEditor is a web application for manual annotation (or manual review of automatic annotations) of text. Albeit primarily aimed at reviewing PoS tags and lemmas, it is fully customizable, to support any annotation levels.

Language: JavaScript - Size: 1.27 MB - Last synced at: 2 months ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

kthrog/LIS-546-guest-lecture

Materials from a guest lecture entitled, "Beyond Data Standards," prepared for University of Washington's LIS 546 (Data Curation II) in Spring 2021.

Size: 26.6 MB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 0

lucia15/Diplodatos2020-Practicos-Mentoria-Coronavirus

Practices of the "Diploma in Data Sciences, Machine Learning and its applications", in which I was a mentor.

Language: Jupyter Notebook - Size: 1.19 MB - Last synced at: 7 months ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 0

Randhir123/data-integration

Language: Jupyter Notebook - Size: 8.79 KB - Last synced at: almost 2 years ago - Pushed at: about 7 years ago - Stars: 0 - Forks: 0

Related Topics
data-science 13 data-cleaning 11 machine-learning 11 python 10 data-centric-ai 8 data-analysis 8 data-visualization 7 computer-vision 6 deep-learning 6 data-quality 6 visualization 5 active-learning 4 bioinformatics 4 data 4 outlier-detection 4 data-processing 3 image-classification 3 artificial-intelligence 3 metadata 3 dataset 3 r 3 exploratory-data-analysis 3 noisy-labels 3 data-labeling 3 data-wrangling 3 repository 3 llms 2 genomics 2 web-application 2 data-centric-machine-learning 2 nlp 2 fair 2 transmart 2 digital-archive 2 gender 2 archaeology 2 hacktoberfest 2 dataops 2 unstructured-data 2 object-detection 2 ontology 2 llm 2 natural-language-processing 2 data-validation 2 data-profiling 2 genetics 2 research-data 2 user-interface 2 r-programming 2 open-source 2 language-model 2 datasets 2 large-language-models 2 synonym-matching 1 semantics 1 data-governance 1 knowledge-representation 1 graph-theory 1 design-algorithm 1 data-handling 1 command-line-tool 1 training-data 1 strain-identification 1 pathogens 1 ner 1 named-entity-recognition 1 corpus 1 biomedical-datasets 1 annotated-data 1 data-management 1 yelp-dataset 1 pangenome 1 url-generalization 1 generalization 1 data-sanitization 1 data-cleansing 1 progress-tracking 1 visualization-tools 1 image-filtering 1 image-dataset-management 1 image-categorization 1 flask 1 dataset-filtering 1 dataset-cleaning 1 mlflow 1 fiftyone-datasets 1 data-deposition 1 fiftyone 1 experiment-tracking 1 legalese-nodes 1 legal-intelligence 1 teaching-material 1 dialogue-systems 1 decentralized-legal-framework 1 datahive 1 blockchain 1 personalization 1 uspto 1 figshare 1 reagents 1