An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: data-profiling

ydataai/ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

Language: Python - Size: 840 MB - Last synced at: about 17 hours ago - Pushed at: 4 days ago - Stars: 12,893 - Forks: 1,715

zaw-may/Supply-Chain-Analysis

Optimizing and Forecasting Supply Chain Performance

Language: HTML - Size: 2.76 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

great-expectations/great_expectations

Always know what to expect from your data.

Language: Python - Size: 218 MB - Last synced at: about 21 hours ago - Pushed at: about 21 hours ago - Stars: 10,379 - Forks: 1,577

dqops/dqo

Data Quality and Observability platform for the whole data lifecycle, from profiling new data sources to full automation with Data Observability. Configure data quality checks from the UI or in YAML files, let DQOps run the data quality checks daily to detect data quality issues.

Language: Java - Size: 91 MB - Last synced at: about 23 hours ago - Pushed at: 2 days ago - Stars: 147 - Forks: 28

open-metadata/OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

Language: TypeScript - Size: 1.8 GB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 6,619 - Forks: 1,216

cleanlab/cleanlab

The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

Language: Python - Size: 11.5 MB - Last synced at: 1 day ago - Pushed at: about 1 month ago - Stars: 10,518 - Forks: 826

polyaxon/haupt

Lineage metadata API, artifacts streams, sandbox, API, and spaces for Polyaxon

Language: Python - Size: 1.14 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 453 - Forks: 209

Desbordante/desbordante-core

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

Language: C++ - Size: 143 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 401 - Forks: 76

sodadata/soda-core

:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io

Language: Python - Size: 3.87 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 2,084 - Forks: 234

DataKitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

Language: Python - Size: 358 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 117 - Forks: 10

open-metadata/openmetadata-site

Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.

Language: TypeScript - Size: 54.6 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 14 - Forks: 11

hi-primus/optimus

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

Language: Python - Size: 110 MB - Last synced at: 3 days ago - Pushed at: 5 months ago - Stars: 1,508 - Forks: 232

fbdesignpro/sweetviz

Visualize and compare datasets, target values and associations, with one line of code.

Language: Python - Size: 15.3 MB - Last synced at: 3 days ago - Pushed at: 9 months ago - Stars: 3,009 - Forks: 285

databrickslabs/dqx

Databricks framework to validate Data Quality of pySpark DataFrames

Language: Python - Size: 1.9 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 252 - Forks: 35

polyaxon/traceml

Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.

Language: Python - Size: 118 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 515 - Forks: 44

datavane/datavines

Know your data better!Datavines is Next-gen Data Observability Platform, support metadata manage and data quality.

Language: Java - Size: 22.4 MB - Last synced at: 24 days ago - Pushed at: 25 days ago - Stars: 607 - Forks: 180

opendatadiscovery/odd-platform

First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.

Language: Java - Size: 27.9 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 1,306 - Forks: 122

cleanlab/cleanvision

Automatically find issues in image datasets and practice data-centric computer vision.

Language: Python - Size: 2.12 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1,068 - Forks: 73

HansMeershoek/pytics

Language: HTML - Size: 1.48 MB - Last synced at: 5 days ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

tsegall/fta

Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.

Language: Java - Size: 6.9 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 28 - Forks: 3

ing-bank/popmon

Monitor the stability of a Pandas or Spark dataframe ⚙︎

Language: Python - Size: 5.79 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 500 - Forks: 36

dm4ml/gate

Drift detection module for machine learning pipelines.

Language: Python - Size: 1.46 MB - Last synced at: 18 days ago - Pushed at: almost 2 years ago - Stars: 23 - Forks: 2

avrtt/gnomych 📦

Cleans and validates raw data against predefined rules

Language: Python - Size: 12.7 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

InfuseAI/piperider

Code review for data in dbt

Language: Python - Size: 32.6 MB - Last synced at: 26 days ago - Pushed at: 4 months ago - Stars: 487 - Forks: 23

cleanlab/cleanlab-studio

Client interface to Cleanlab Studio and the Trustworthy Language Model

Language: Python - Size: 3.52 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 30 - Forks: 8

ray310/Panda-Helper

Panda-Helper is a simple, open-source, Python data-profiling utility for Pandas' DataFrames and Series.

Language: Python - Size: 779 KB - Last synced at: about 19 hours ago - Pushed at: about 2 months ago - Stars: 3 - Forks: 1

Swiple/swiple

Swiple enables you to easily observe, understand, validate and improve the quality of your data

Language: Python - Size: 180 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 82 - Forks: 11

VIDA-NYU/auctus

Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index

Language: Python - Size: 10.4 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 43 - Forks: 9

hi-primus/bumblebee

🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)

Language: Vue - Size: 23 MB - Last synced at: 12 days ago - Pushed at: almost 2 years ago - Stars: 141 - Forks: 35

FritscheLab/whiteRRabbit

whiteRRabbit: An R-based data profiling tool for efficiently scanning large CSV/TSV files to generate comprehensive summary statistics and data quality metrics, inspired by OHDSI WhiteRabbit.

Language: R - Size: 69.3 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

darenasc/auto-fes

Automated exploration of files in a folder structure to extract metadata and potential usage of information.

Language: Python - Size: 676 KB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 7 - Forks: 0

IMAbril/RENIS

Language: Jupyter Notebook - Size: 4.18 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

CoDS-GCS/kglids

Linked Data Science powered by Knowledge Graphs

Language: Python - Size: 156 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 10 - Forks: 1

sidkris/megaprofiler

A Python library for automatic data profiling and validation

Language: Python - Size: 148 MB - Last synced at: 5 days ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

opendatadiscovery/odd-collector 📦

Open-source metadata collector based on ODD Specification

Language: Python - Size: 1.96 MB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 43 - Forks: 13

mzj14/function-dependency-exploration

Homework for exploring function dependencies in data sets

Language: Python - Size: 547 KB - Last synced at: about 2 months ago - Pushed at: about 8 years ago - Stars: 0 - Forks: 0

raymon-ai/raymon

The official http://raymon.ai data profiling and logging library.

Language: Python - Size: 78.7 MB - Last synced at: 23 days ago - Pushed at: about 3 years ago - Stars: 18 - Forks: 1

analyst-amitbisht/ydata-profiling

This repository showcases my learning process of automating EDA using 'ydata-profiling'

Language: HTML - Size: 812 KB - Last synced at: about 2 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

mwill5/quick_data_profiling

A comprehensive tool for quick data profiling and exploratory data analysis.

Language: HTML - Size: 711 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

TechnikInterlytics/VerityExamples

Data file examples and user guides for VerityPy and VerityDotNet libraries

Language: HTML - Size: 3.84 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

ahmadassaf/roomba

A Node.js tool to examine the correctness of Open Data Metadata and build custom dataset profiles

Language: JavaScript - Size: 3.36 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 12 - Forks: 3

hpcc-systems/DataPatterns

HPCC Systems ECL bundle that provides some basic data profiling and research tools to an ECL programmer

Language: ECL - Size: 691 KB - Last synced at: 3 months ago - Pushed at: 8 months ago - Stars: 3 - Forks: 4

teejay-tems/Electronics-Retailers

This Power BI project analyzes transactional data for a global electronics retailer, focusing on sales performance, product profitability, customer behavior, and the impact of currency on revenue. It offers insights into seasonal trends, order revenue, customer locations, delivery times, and the differences between in-store and online sales.

Size: 1.95 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

SJTU-DMTai/awesome-ml-data-quality-papers

Papers about training data quality management for ML models.

Size: 1.08 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 15 - Forks: 2

pflooky/data-caterer-example

Example API implementation for Data Caterer

Language: Scala - Size: 1.83 MB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 2

apicrafter/metacrafter

Metadata and data identification tool and Python library. Identifies PII, common identifiers, language specific identifiers. Fully customizable and flexible rules

Language: Python - Size: 81.7 MB - Last synced at: 9 months ago - Pushed at: 10 months ago - Stars: 43 - Forks: 6

nagapv/edexplore

A simple widget for interactive EDA / QA. Works on top of Pandas [in Jupyter Notebook] using IPyWidgets with a sprinkle of Regex.

Language: Python - Size: 1.8 MB - Last synced at: 17 days ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

LieseB-1746743/data-cleaning

Data cleaning tool.

Language: JavaScript - Size: 1.81 MB - Last synced at: 11 months ago - Pushed at: about 4 years ago - Stars: 9 - Forks: 5

hadarsharon/compars

DataFrame comparison done right, powered by Rust with polars (AKA the bear-agnostic 🐻 🐼 🐨 🐻‍❄️ DataFrame comparison library)

Language: Python - Size: 36.1 KB - Last synced at: 28 days ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

psebenick/data-profiling

a set of scripts to pull meta data and data profiling metrics from relational database systems

Language: Python - Size: 44.9 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 67 - Forks: 19

jmakeig/data-profile

Sandbox to test out ideas for profiling document data

Language: JavaScript - Size: 122 KB - Last synced at: about 1 year ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

Jakob-L-M/partial-SPIDER

A partialized version of the SPIDER Algorithm for inclusion dependency discovery

Language: Java - Size: 240 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

CCFGomes/Yelp-InsightX

Yelp InsightX is a comprehensive analysis conducted on the Yelp dataset, focusing on profiling, understanding, and deriving insights from the data.

Size: 13.7 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

mtna/rds-r

R package to simplify the usage of the RDS REST API and provide convenience in accessing data and metadata.

Language: R - Size: 1.41 MB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 4 - Forks: 0

bodybuilders-team/ist-meic-cd-g03

Data Science project of group 03 - MEIC @ IST 2023/2024.

Language: Python - Size: 146 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

ismaildawoodjee/GreatEx

A project for exploring how Great Expectations can be used to ensure data quality and validate batches within a data pipeline defined in Airflow.

Language: Python - Size: 1.73 MB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 20 - Forks: 6

aparna-chavan/New-York-City-Food-Inspections

Language: TSQL - Size: 12.3 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

opendatadiscovery/odd-collector-gcp 📦

Open-source GCP metadata collector based on ODD Specification

Language: Python - Size: 188 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 4 - Forks: 0

SebastianSchmidl/distod

DISTOD algorithm: Distributed discovery of bidirectional order dependencies

Language: Scala - Size: 17.9 MB - Last synced at: about 1 month ago - Pushed at: about 3 years ago - Stars: 9 - Forks: 2

Divya171997/Big-Data-for-Computational-Finance-Forex_Exchange_Rate

Analysis of forex exchange rate dataset, covering the historical aspects over the period of time, in short doing Timeseries Analysis ,Data Cleansing and Transformation of Forex Exchange Dataset in order to transform it in format or structure required during Timeseries Analysis and Machine Learning ,Visualization of Forex Exchange Dataset based on Timeseries Analysis done on it,Applying ARIMA MODEL, a Machine Learning algorithm, used to perform forecasting or predicting future value.

Language: Jupyter Notebook - Size: 2.22 MB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

VIDA-NYU/sato Fork of megagonlabs/sato

Fork of Sato for easy deployment as a Python package

Language: Python - Size: 32.6 MB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 0

dartwinshu/dibimbing-python-for-data-analysis

Course of business intelligence bootcamp by Dibimbing

Language: Jupyter Notebook - Size: 1.19 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

statsim/profile

Profile. Generate data profiles in the browser (work in progress)

Language: JavaScript - Size: 1.27 MB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 0

christianbors/OpenRefineQualityMetrics

MetricDoc is an interactive visual exploration environment for assessing data quality

Language: JavaScript - Size: 5.83 MB - Last synced at: about 1 month ago - Pushed at: about 5 years ago - Stars: 8 - Forks: 1

Marini97/Functional-Dependency

Project for the topic of Data Profiling for the course Advanced Topics in Computer Science.

Language: Python - Size: 94.2 MB - Last synced at: almost 2 years ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 0

baligoyem/dataqtor

🔍Your Data Quality Detector / Gain insight into your data and get it ready for use before you start working with it 💡📊🛠💎

Language: Python - Size: 9.43 MB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 14 - Forks: 6

xxl4tomxu98/data-engineering-python-great-expectations

Demo on Data Engineering using Great Expectations API

Language: Jupyter Notebook - Size: 6 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 4 - Forks: 0

wisemuffin/data-cicd-tpch

Language: Python - Size: 710 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 8 - Forks: 0

6point6/bytefreq Fork of minkymorgan/bytefreq

a nix DataProfiler for deep analysis of data quality on tabular files

Language: Awk - Size: 12.6 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 3

samapatil1729/Tools-In-Data-Science

This course will teach students to use popular tools for sourcing data, transforming it, building and optimizing models, communicating these as visual stories, and deploying them in production.

Language: HTML - Size: 25.1 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 0

giagiannis/data-profiler

Data profiler is an attempt to model the behavior of a given operator for a set of datasets.

Language: Go - Size: 1.48 MB - Last synced at: 11 months ago - Pushed at: over 6 years ago - Stars: 3 - Forks: 1

gandalf1819/NYCOpenData-Profiling-Analysis

Open Data Profiling, Quality and Analysis on NYC OpenData dataset with semantic profiling using fuzzy ratio, Levenshtein distance and regex

Language: Jupyter Notebook - Size: 17.9 MB - Last synced at: about 2 years ago - Pushed at: over 4 years ago - Stars: 6 - Forks: 4

bballamudi/deequ Fork of awslabs/deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Size: 68.9 MB - Last synced at: almost 2 years ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0

danielsobrado/dockerizations

My Dockerizations

Language: Dockerfile - Size: 8.79 KB - Last synced at: about 2 years ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

dagpunk/rimhistory

RimWorld game save data analyzer

Language: Python - Size: 9.03 MB - Last synced at: 6 days ago - Pushed at: 10 months ago - Stars: 2 - Forks: 0

mtna/rds-js

Simplify usage of the RDS API for TypeScript/JavaScript developers

Language: TypeScript - Size: 1.93 MB - Last synced at: 21 days ago - Pushed at: 8 months ago - Stars: 4 - Forks: 1

p-disha/NYC-Open-Dataset-Analysis

Identified data types for each distinct column value on 1900 data sets. For each column, summarized semantic types present in the column, using Fuzzy Logic, Levenshtein distance. Identified & derived inference the 3 most frequent 311 complaint types by borough.

Language: Python - Size: 25.4 KB - Last synced at: 27 days ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 0

amogh9594/r-language

Language: HTML - Size: 1020 KB - Last synced at: almost 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

dikoharyadhanto/Data-Analyst-Capstone-Project-Coursera

Data Analyst Capstone Project in Coursera

Language: Jupyter Notebook - Size: 2.67 MB - Last synced at: almost 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 3

rounayak/Data-Profiling-Tool

The program compares two files at a time and does the following 1.Gathering metadata on the individual tables(column count,record count,list of columns with datatype etc) 2.Identifying matching columns between tables based on names as well as data. Using machine learning, we are handling syntactic as well as semantic variations of column names for accurate matching. 3. Finding duplicate columns in single table with the option to deduplicate if required 4. Finding columns with missing data/null values.

Language: Python - Size: 1.95 KB - Last synced at: 4 months ago - Pushed at: about 7 years ago - Stars: 3 - Forks: 0

mtna/rds-js-examples

TypeScript/JavaScript example code using the RDS API

Language: TypeScript - Size: 7.04 MB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 3

bballamudi/Optimus Fork of hi-primus/optimus

:truck: Agile Data Science Workflows made easy with Pyspark

Size: 71.2 MB - Last synced at: almost 2 years ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0

arghyab0/FIFA19DreamTeam

A R Notebook to perform basic data profiling and exploratory data analysis on the FIFA19 players dataset and create a dream-team of the top 11 players considering various player attributes.

Language: HTML - Size: 3.1 MB - Last synced at: about 2 years ago - Pushed at: almost 4 years ago - Stars: 0 - Forks: 0

b-feldmann/hitucc

Distributable UCC Discovery Algorithm based on Akka

Language: Java - Size: 352 KB - Last synced at: about 2 years ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 0

bballamudi/great_expectations Fork of great-expectations/great_expectations

Always know what to expect from your data.

Size: 19.3 MB - Last synced at: almost 2 years ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0