An open API service providing repository metadata for many open source software ecosystems.

Topic: "data-quality"

GokuMohandas/Made-With-ML

Learn how to design, develop, deploy and iterate on production-grade ML applications.

Language: Jupyter Notebook - Size: 3.82 MB - Last synced at: about 7 hours ago - Pushed at: about 1 year ago - Stars: 42,744 - Forks: 6,640

eugeneyan/applied-ml

📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.

Size: 388 KB - Last synced at: 6 months ago - Pushed at: about 1 year ago - Stars: 27,827 - Forks: 3,736

ydataai/ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

Language: Python - Size: 840 MB - Last synced at: 2 days ago - Pushed at: 7 days ago - Stars: 13,123 - Forks: 1,742

cleanlab/cleanlab

Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

Language: Python - Size: 11.5 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 10,867 - Forks: 850

great-expectations/great_expectations

Always know what to expect from your data.

Language: Python - Size: 222 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 10,690 - Forks: 1,606

voxel51/fiftyone

Refine high-quality datasets and visual AI models

Language: Python - Size: 2.01 GB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 9,855 - Forks: 662

open-metadata/OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

Language: TypeScript - Size: 2.66 GB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 7,433 - Forks: 1,403

evidentlyai/evidently

Evidently is ​​an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.

Language: Jupyter Notebook - Size: 320 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 6,578 - Forks: 722

feast-dev/feast

The Open Source Feature Store for AI/ML

Language: Python - Size: 160 MB - Last synced at: about 8 hours ago - Pushed at: about 9 hours ago - Stars: 6,313 - Forks: 1,140

treeverse/lakeFS

lakeFS - Data version control for your data lake | Git for data

Language: Go - Size: 153 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 4,850 - Forks: 391

GokuMohandas/mlops-course

Learn how to design, develop, deploy and iterate on production-grade ML applications.

Language: Jupyter Notebook - Size: 7.16 MB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 3,125 - Forks: 551

datafold/data-diff 📦

Compare tables within or across databases

Language: Python - Size: 3.98 MB - Last synced at: 19 days ago - Pushed at: over 1 year ago - Stars: 2,988 - Forks: 294

whylabs/whylogs

An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈

Language: Jupyter Notebook - Size: 181 MB - Last synced at: 10 days ago - Pushed at: 8 months ago - Stars: 2,749 - Forks: 130

sodadata/soda-core

:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io

Language: Python - Size: 4.49 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 2,158 - Forks: 243

featureform/featureform

The Virtual Feature Store. Turn your existing data infrastructure into a feature store.

Language: Go - Size: 217 MB - Last synced at: 10 days ago - Pushed at: 2 months ago - Stars: 1,941 - Forks: 98

feathr-ai/feathr

Feathr – A scalable, unified data and AI engineering platform for enterprise

Language: Scala - Size: 29.4 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 1,906 - Forks: 235

re-data/re-data

re_data - fix data issues before your users & CEO would discover them 😊

Language: HTML - Size: 76.5 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 1,570 - Forks: 122

opendatadiscovery/odd-platform

First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.

Language: Java - Size: 27.9 MB - Last synced at: 4 months ago - Pushed at: 7 months ago - Stars: 1,317 - Forks: 123

NVIDIA-NeMo/Curator

Scalable data pre processing and curation toolkit for LLMs

Language: Python - Size: 16 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1,130 - Forks: 168

daochenzha/data-centric-AI

A curated, but incomplete, list of data-centric AI resources.

Size: 1.99 MB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 1,116 - Forks: 78

cleanlab/cleanvision

Automatically find issues in image datasets and practice data-centric computer vision.

Language: Python - Size: 2.12 MB - Last synced at: 23 days ago - Pushed at: 5 months ago - Stars: 1,106 - Forks: 75

rstudio/pointblank

Data quality assessment and metadata reporting for data frames and database tables

Language: R - Size: 105 MB - Last synced at: 2 days ago - Pushed at: 13 days ago - Stars: 987 - Forks: 59

opendatadiscovery/awesome-data-catalogs

📙 Awesome Data Catalogs and Observability Platforms.

Size: 312 KB - Last synced at: 2 days ago - Pushed at: 24 days ago - Stars: 900 - Forks: 64

WeBankFinTech/Qualitis

Qualitis is a one-stop data quality management platform that supports quality verification, notification, and management for various datasource. It is used to solve various data quality problems caused by data processing. https://github.com/WeBankFinTech/Qualitis

Language: Java - Size: 48.6 MB - Last synced at: 5 months ago - Pushed at: 6 months ago - Stars: 738 - Forks: 308

kennethleungty/Failed-ML

Compilation of high-profile real-world examples of failed machine learning projects

Size: 5.08 MB - Last synced at: 5 months ago - Pushed at: about 1 year ago - Stars: 729 - Forks: 48

datavane/datavines

Know your data better!Datavines is Next-gen Data Observability Platform, support metadata manage and data quality.

Language: Java - Size: 22.8 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 644 - Forks: 184

bitol-io/open-data-contract-standard

Home of the Open Data Contract Standard (ODCS).

Language: Ruby - Size: 8.4 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 523 - Forks: 53

polyaxon/traceml

Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.

Language: Python - Size: 118 MB - Last synced at: 3 days ago - Pushed at: 3 months ago - Stars: 520 - Forks: 46

InfuseAI/piperider

Code review for data in dbt

Language: Python - Size: 32.6 MB - Last synced at: 11 days ago - Pushed at: 8 months ago - Stars: 489 - Forks: 24

encord-team/encord-active

The toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling.

Language: Python - Size: 264 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 449 - Forks: 26

MigoXLab/dingo

Dingo: A Comprehensive AI Data Quality Evaluation Tool

Language: JavaScript - Size: 21.9 MB - Last synced at: about 17 hours ago - Pushed at: about 19 hours ago - Stars: 438 - Forks: 43

alibaba/feathub

FeatHub - A stream-batch unified feature store for real-time machine learning

Language: Python - Size: 4.56 MB - Last synced at: 19 days ago - Pushed at: over 1 year ago - Stars: 338 - Forks: 57

Data-Centric-AI-Community/awesome-data-centric-ai

Open-Source Software, Tutorials, and Research on Data-Centric AI 🤖

Language: Jupyter Notebook - Size: 6.73 MB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 338 - Forks: 47

data-drift/data-drift

Metrics Observability & Troubleshooting

Language: HTML - Size: 11.7 MB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 323 - Forks: 12

databrickslabs/dqx

Databricks framework to validate Data Quality of pySpark DataFrames

Language: Python - Size: 2.95 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 309 - Forks: 59

posit-dev/pointblank

Data validation made beautiful and powerful

Language: Python - Size: 160 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 272 - Forks: 19

adidas/lakehouse-engine

The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Products.

Language: Python - Size: 8.79 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 256 - Forks: 45

ubisoft/mobydq

:whale: Tool to automate data quality checks on data pipelines

Language: Vue - Size: 188 MB - Last synced at: 19 days ago - Pushed at: almost 3 years ago - Stars: 255 - Forks: 62

GAIR-NLP/ProX

[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale

Language: Python - Size: 15.1 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 248 - Forks: 17

BdR76/CSVLint

CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.

Language: C# - Size: 13.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 205 - Forks: 18

canimus/cuallee

Possibly the fastest DataFrame-agnostic quality check library in town.

Language: Python - Size: 2.28 MB - Last synced at: 3 days ago - Pushed at: 6 days ago - Stars: 202 - Forks: 20

frederick0329/TracIn

Implementation of Estimating Training Data Influence by Tracing Gradient Descent (NeurIPS 2020)

Language: Jupyter Notebook - Size: 98 MB - Last synced at: almost 2 years ago - Pushed at: over 3 years ago - Stars: 196 - Forks: 15

whylabs/whylogs-java 📦

Profile and monitor your ML data pipeline end-to-end

Language: Java - Size: 5.95 MB - Last synced at: 8 months ago - Pushed at: almost 4 years ago - Stars: 178 - Forks: 7

Hyhyhyhyhyhyh/Django-Data-quality-system

数据治理、数据质量检核/监控平台(Django+jQuery+MySQL)

Language: Python - Size: 19 MB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 175 - Forks: 73

astronomer/airflow-provider-great-expectations

Great Expectations Airflow operator

Language: Python - Size: 2.55 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 168 - Forks: 58

OHDSI/DataQualityDashboard

A tool to help improve data quality standards in observational data science.

Language: JavaScript - Size: 14.9 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 165 - Forks: 102

dqops/dqo

Data Quality and Observability platform for the whole data lifecycle, from profiling new data sources to full automation with Data Observability. Configure data quality checks from the UI or in YAML files, let DQOps run the data quality checks daily to detect data quality issues.

Language: Java - Size: 95.4 MB - Last synced at: 8 days ago - Pushed at: about 1 month ago - Stars: 161 - Forks: 35

atrocore/atrocore

AtroCore is an open-source Data Platform, Data Management and Master Data Management (MDM) software, which can be used to quickly create any business application.

Language: JavaScript - Size: 115 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 160 - Forks: 45

AKSW/RDFUnit

An RDF Unit Testing Suite

Language: Java - Size: 8.54 MB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 159 - Forks: 41

aai-institute/pyDVL

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation

Language: Python - Size: 436 MB - Last synced at: 2 days ago - Pushed at: 4 months ago - Stars: 136 - Forks: 7

DataKitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

Language: Python - Size: 211 KB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 124 - Forks: 12

Seddryck/NBi

NBi is a testing framework (add-on to NUnit) for Business Intelligence and Data Access. The main goal of this framework is to let users create tests with a declarative approach based on an Xml syntax. By the means of NBi, you don't need to develop C# or Java code to specify your tests! Either, you don't need Visual Studio or Eclipse to compile your test suite. Just create an Xml file and let the framework interpret it and play your tests. The framework is designed as an add-on of NUnit but with the possibility to port it easily to other testing frameworks.

Language: C# - Size: 15.7 MB - Last synced at: 6 days ago - Pushed at: 4 months ago - Stars: 108 - Forks: 37

re-data/dbt-re-data

re_data - fix data issues before your users & CEO would discover them 😊

Language: Python - Size: 4.12 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 99 - Forks: 41

evidentlyai/ml_observability_course

Free Open-source ML observability course for data scientists and ML engineers. Learn how to monitor and debug your ML models in production.

Language: Jupyter Notebook - Size: 25.2 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 92 - Forks: 32

Data-Centric-AI-Community/awesome-python-for-data-science

A curated list of awesome resources such as books, tutorials, courses, open-source libraries, exercises, and other materials that support Pythonistas in the making, and Pythonistas migrating into Data Science! 📊

Language: Jupyter Notebook - Size: 51.8 MB - Last synced at: 9 days ago - Pushed at: over 1 year ago - Stars: 89 - Forks: 19

aws-samples/amazon-deequ-glue

Automated data quality suggestions and analysis with Deequ on AWS Glue

Language: Scala - Size: 2.1 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 86 - Forks: 24

Swiple/swiple

Swiple enables you to easily observe, understand, validate and improve the quality of your data

Language: Python - Size: 201 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 84 - Forks: 11

awesome-mlops/awesome-ml-monitoring

A curated list of awesome open source tools and commercial products for monitoring data quality, monitoring model performance, and profiling data 🚀

Size: 4.88 KB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 84 - Forks: 8

SJTU-DMTai/awesome-ml-data-quality-papers

Papers about training data quality management for ML models.

Size: 1.07 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 83 - Forks: 6

GClunies/reflekt

Define, govern, and model event data for warehouse-first product analytics.

Language: Python - Size: 5.84 MB - Last synced at: 13 days ago - Pushed at: about 1 year ago - Stars: 83 - Forks: 4

great-expectations/great_expectations_action

A GitHub Action that makes it easy to use Great Expectations to validate your data pipelines in your CI workflows.

Language: Jupyter Notebook - Size: 25.5 MB - Last synced at: 9 days ago - Pushed at: over 1 year ago - Stars: 81 - Forks: 13

socialpoint-labs/sqlbucket 📦

Lightweight library to write, orchestrate and test your SQL ETL. Writing ETL with data integrity in mind.

Language: Python - Size: 463 KB - Last synced at: 10 days ago - Pushed at: over 1 year ago - Stars: 73 - Forks: 8

kevinadhiguna/dqlab-career-track

A collection of scripts written to complete DQLab Data Analyst Career Track 📊

Language: Python - Size: 5.5 MB - Last synced at: 5 months ago - Pushed at: almost 3 years ago - Stars: 72 - Forks: 49

Impetus/jumbune

Jumbune, an open source BigData APM & Data Quality Management Platform for Data Clouds. Enterprise feature offering is available at http://jumbune.com. More details of open source offering are at,

Language: Java - Size: 31.7 MB - Last synced at: 5 months ago - Pushed at: over 2 years ago - Stars: 70 - Forks: 32

data-catering/data-caterer Fork of pflooky/data-caterer

Test data management tool for any data source, batch or real-time. Generate, validate and clean up data all in one tool.

Language: Scala - Size: 47.6 MB - Last synced at: 6 days ago - Pushed at: 17 days ago - Stars: 68 - Forks: 8

sodadata/soda-spark

Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

Language: Python - Size: 118 KB - Last synced at: 2 days ago - Pushed at: about 3 years ago - Stars: 64 - Forks: 8

sanzujinwu/sanzuwu

三足乌数据中台融合数据规划、数据接入、数据开发、数据仓库、数据治理、数据资产、数据服务、数据运维、系统管理等功能模块为一体。打通数据壁垒,解决数据孤岛问题,实现数据的低代码可视化开发,助力政府、企业数字化转型。

Size: 51.5 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 63 - Forks: 0

DataKitchen/dataops-testgen

DataOps Data Quality TestGen is part of DataKitchen's Open Source Data Observability. DataOps TestGen delivers simple, fast data quality test generation and execution by data profiling,  new dataset hygiene review, AI generation of data quality validation tests, ongoing testing of data refreshes, & continuous anomaly monitoring

Language: Python - Size: 6.69 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 61 - Forks: 3

ucd-dnp/leila

Librería para la evaluación de calidad de datos, e interacción con el portal de datos.gov.co

Language: Jupyter Notebook - Size: 29.7 MB - Last synced at: 7 days ago - Pushed at: about 1 month ago - Stars: 61 - Forks: 22

bufbuild/bufstream-demo

A demo of Bufstream, a drop-in replacement for Apache Kafka that's 8x less expensive to operate and brings broker-side schema awareness to Kafka

Language: Go - Size: 162 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 59 - Forks: 9

provectus/data-quality-gate

Data Quality Gate based on AWS

Language: Python - Size: 24 MB - Last synced at: 19 days ago - Pushed at: about 1 year ago - Stars: 57 - Forks: 5

anerv/BikeDNA

BikeDNA: Bicycle Infrastructure Data & Network Assessment

Language: Jupyter Notebook - Size: 584 MB - Last synced at: 3 days ago - Pushed at: about 1 year ago - Stars: 56 - Forks: 4

sparkdq-community/sparkdq

A declarative PySpark framework for row- and aggregate-level data quality validation.

Language: Python - Size: 6.96 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 52 - Forks: 6

KentHsu/Udacity-Data-Engineering-Nanodgree

Udacity Data Engineering Nanodegree Program

Language: Jupyter Notebook - Size: 2.12 MB - Last synced at: 5 months ago - Pushed at: over 4 years ago - Stars: 52 - Forks: 59

davidberenstein1957/dataset-viber

Dataset Viber is your chill repo for data collection, annotation and vibe checks.

Language: Python - Size: 1.3 MB - Last synced at: 22 days ago - Pushed at: about 1 year ago - Stars: 47 - Forks: 12

astutic/Acharya

A Data Centric NER annotation tool for your Named Entity Recognition projects

Size: 11.3 MB - Last synced at: 9 months ago - Pushed at: over 1 year ago - Stars: 45 - Forks: 3

GIScience/ohsome-quality-api

Data quality estimations for OpenStreetMap

Language: Python - Size: 21.9 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 42 - Forks: 7

ropensci/daiquiri

Data quality reporting for temporal datasets.

Language: R - Size: 31 MB - Last synced at: about 20 hours ago - Pushed at: 2 months ago - Stars: 38 - Forks: 3

tkleykamp/DataOps

DataOps for Government

Size: 76.2 KB - Last synced at: about 1 year ago - Pushed at: almost 7 years ago - Stars: 34 - Forks: 9

cleanlab/cleanlab-studio

Client interface to Cleanlab Studio and the Trustworthy Language Model

Language: Python - Size: 3.52 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 32 - Forks: 9

mikulskibartosz/check-engine 📦

Data validation library for PySpark 3.0.0

Language: Python - Size: 94.7 KB - Last synced at: about 2 years ago - Pushed at: almost 3 years ago - Stars: 32 - Forks: 4

emilyriederer/convo

R package based on "Column Names as Contracts" blog post (https://emilyriederer.netlify.app/post/column-name-contracts/)

Language: R - Size: 1.2 MB - Last synced at: 7 days ago - Pushed at: almost 4 years ago - Stars: 31 - Forks: 1

mfcabrera/hooqu

hooqu is a library built on top of Pandas-like Dataframes for defining "unit tests for data". This is a spiritual port of Apache Deequ to Python

Language: Python - Size: 209 KB - Last synced at: 21 days ago - Pushed at: 9 months ago - Stars: 29 - Forks: 1

bolcom/hive_compared_bq

hive_compared_bq compares/validates 2 (SQL like) tables, and graphically shows the rows/columns that are different.

Language: Python - Size: 229 KB - Last synced at: 25 days ago - Pushed at: over 7 years ago - Stars: 28 - Forks: 9

FIIT-IAU/IAU-course

Intelligent Data Analysis (IAU_B) @ FIIT STU in Bratislava

Language: Jupyter Notebook - Size: 61.5 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 27 - Forks: 7

aws-samples/monitoring-apache-iceberg-table-metadata-layer

Sample code to collect Apache Iceberg metrics for table monitoring

Language: Python - Size: 787 KB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 27 - Forks: 4

mrpowers-io/tsumugi-spark

SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.

Language: Python - Size: 1.64 MB - Last synced at: 2 months ago - Pushed at: 7 months ago - Stars: 26 - Forks: 6

google/dqm 📦

A simple platform dedicated to data quality issues detection, especially in the context of online advertising.

Language: Python - Size: 2.4 MB - Last synced at: 5 days ago - Pushed at: about 1 year ago - Stars: 26 - Forks: 4

andre-salvati/databricks-template

A production-ready PySpark project template with medallion architecture, Python packaging, unit tests, integration tests, CI/CD automation, Databricks Asset Bundles, and DQX data quality framework.

Language: Python - Size: 1.49 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 24 - Forks: 7

timgent/data-flare

Data quality control tool built on spark and deequ

Language: Scala - Size: 2.59 MB - Last synced at: 5 months ago - Pushed at: 6 months ago - Stars: 24 - Forks: 10

PEDSnet/Data-Quality-Analysis

The PEDSnet Data Quality Assessment Toolkit (OMOP CDM)

Language: R - Size: 1.11 MB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 24 - Forks: 6

bitol-io/open-data-product-standard

Home of the Open Data Product Standard (ODPS).

Language: Shell - Size: 968 KB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 23 - Forks: 1

aivanzhang/panda_patrol

Language: Python - Size: 33.2 MB - Last synced at: about 7 hours ago - Pushed at: about 1 year ago - Stars: 23 - Forks: 0

Oslandia/osm-data-classification 📦

Migrated to: https://gitlab.com/Oslandia/osm-data-classification

Language: Python - Size: 108 MB - Last synced at: over 2 years ago - Pushed at: almost 6 years ago - Stars: 23 - Forks: 7

scienxlab/redflag

Safety net for machine learning pipelines. Plays nice with sklearn and pandas.

Language: Python - Size: 10.7 MB - Last synced at: 6 days ago - Pushed at: over 1 year ago - Stars: 21 - Forks: 6

DP6/penguin-datalayer-collect

A data layer quality monitoring and validation module, this solution is part of the Raft Suite ecosystem.

Language: HCL - Size: 2.07 MB - Last synced at: about 10 hours ago - Pushed at: over 2 years ago - Stars: 21 - Forks: 4

realdatadriven/etlx

This project is an ETL / ELT Framework powered by DuckDB, designed to seamlessly integrate and process data from diverse sources. It leverages Markdown as a configuration medium, where YAML blocks define metadata for each data source, and embedded SQL blocks specify the extraction, transformation, and loading logic.

Language: Go - Size: 3.03 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 20 - Forks: 2

Azure/Verified-Telemetry 📦

Azure Verified Telemetry for IoT is a state-of-the-art solution to seamlessly determine the health of the sensor in real-time.

Language: C - Size: 5.8 MB - Last synced at: 2 days ago - Pushed at: over 1 year ago - Stars: 20 - Forks: 6

ismaildawoodjee/GreatEx

A project for exploring how Great Expectations can be used to ensure data quality and validate batches within a data pipeline defined in Airflow.

Language: Python - Size: 1.73 MB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 20 - Forks: 6

lukasvermeer/srm

The SRM Checker Chrome extension is no longer being maintained and has been retired. The website and related documentation will remain available for reference, but the extension itself will no longer be updated, supported, or guaranteed to work.

Language: JavaScript - Size: 1.85 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 19 - Forks: 11

Related Topics
data-engineering 79 python 79 data-science 79 data-quality-checks 51 machine-learning 50 data-validation 35 data-quality-monitoring 35 spark 34 data-profiling 34 data 33 sql 32 data-analysis 31 data-cleaning 28 pandas 26 pyspark 25 data-visualization 25 etl 24 data-governance 24 dataquality 20 mlops 19 dbt 19 data-observability 18 data-quality-assessment 16 data-centric-ai 16 deep-learning 15 docker 15 data-testing 15 validation 14 analytics 14 data-management 14 exploratory-data-analysis 14 data-pipeline 14 r 13 eda 13 big-data 12 database 12 postgresql 11 metadata 11 jupyter-notebook 11 great-expectations 11 data-transformation 10 dataops 10 hacktoberfest 10 data-wrangling 10 data-quality-measurement 10 data-monitoring 10 snowflake 10 data-reliability 10 apache-spark 9 natural-language-processing 9 computer-vision 9 llm 8 databricks 8 bigquery 8 airflow 8 data-catalog 8 monitoring 8 data-contracts 8 etl-pipeline 8 javascript 8 data-quality-report 8 feature-engineering 7 data-curation 7 dataset 7 streamlit 7 python3 7 quality 7 open-source 7 ai 7 data-warehouse 7 data-lineage 7 pipeline 7 observability 6 synthetic-data 6 data-quality-framework 6 artificial-intelligence 6 datasets 6 statistics 6 tabular-data 6 data-exploration 6 gcp 6 csv 6 duckdb 6 data-processing 6 visualization 6 data-unit-tests 5 business-intelligence 5 data-centric 5 llms 5 dataframes 5 ml 5 typescript 5 openstreetmap 5 data-preparation 5 missing-data 5 mysql 5 dbt-packages 4 redshift 4 data-modeling 4 datatesting 4