Ecosyste.ms: Repos

An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: data-profiling

dqops/dqo

Data Quality and Observability platform for the whole data lifecycle, from profiling new data sources to full automation with Data Observability. Configure data quality checks from the UI or in YAML files, let DQOps run the data quality checks daily to detect data quality issues.

Language: Java - Size: 73.3 MB - Last synced: about 11 hours ago - Pushed: about 14 hours ago - Stars: 57 - Forks: 12

tsegall/fta

Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.

Language: Java - Size: 6.81 MB - Last synced: about 11 hours ago - Pushed: about 14 hours ago - Stars: 22 - Forks: 2

DataKitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

Language: Python - Size: 231 KB - Last synced: about 14 hours ago - Pushed: about 15 hours ago - Stars: 28 - Forks: 0

ydataai/ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

Language: Python - Size: 594 MB - Last synced: about 14 hours ago - Pushed: about 18 hours ago - Stars: 12,104 - Forks: 1,632

cleanlab/cleanlab-studio

Client interface for all things Cleanlab Studio

Language: Python - Size: 2.7 MB - Last synced: 3 days ago - Pushed: 3 days ago - Stars: 21 - Forks: 4

sodadata/soda-core

:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io

Language: Python - Size: 2.68 MB - Last synced: 3 days ago - Pushed: 4 days ago - Stars: 1,778 - Forks: 185

cleanlab/cleanlab

The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

Language: Python - Size: 11.1 MB - Last synced: 5 days ago - Pushed: 6 days ago - Stars: 8,814 - Forks: 677

cleanlab/cleanvision

Automatically find issues in image datasets and practice data-centric computer vision.

Language: Python - Size: 2.11 MB - Last synced: 10 days ago - Pushed: 27 days ago - Stars: 926 - Forks: 69

hadarsharon/compars

DataFrame comparison done right, powered by Rust with polars (AKA the bear-agnostic 🐻 🐼 🐨 πŸ»β€β„οΈ DataFrame comparison library)

Language: Python - Size: 36.1 KB - Last synced: 15 days ago - Pushed: 15 days ago - Stars: 0 - Forks: 0

Swiple/swiple

Swiple enables you to easily observe, understand, validate and improve the quality of your data

Language: Python - Size: 122 MB - Last synced: 19 days ago - Pushed: 19 days ago - Stars: 78 - Forks: 10

open-metadata/OpenMetadata

Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.

Language: TypeScript - Size: 1.3 GB - Last synced: 19 days ago - Pushed: 19 days ago - Stars: 4,168 - Forks: 837

hpcc-systems/DataPatterns

HPCC Systems ECL bundle that provides some basic data profiling and research tools to an ECL programmer

Language: ECL - Size: 621 KB - Last synced: 18 days ago - Pushed: 19 days ago - Stars: 3 - Forks: 4

InfuseAI/piperider

Code review for data in dbt

Language: Python - Size: 32.6 MB - Last synced: 7 days ago - Pushed: 2 months ago - Stars: 469 - Forks: 21

fbdesignpro/sweetviz

Visualize and compare datasets, target values and associations, with one line of code.

Language: Python - Size: 15.3 MB - Last synced: 22 days ago - Pushed: 6 months ago - Stars: 2,839 - Forks: 268

polyaxon/traceml

Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.

Language: Python - Size: 118 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 492 - Forks: 43

ing-bank/popmon

Monitor the stability of a Pandas or Spark dataframe βš™οΈŽ

Language: Python - Size: 5.77 MB - Last synced: about 18 hours ago - Pushed: 4 months ago - Stars: 487 - Forks: 35

open-metadata/openmetadata-site

Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.

Language: CSS - Size: 40.5 MB - Last synced: 25 days ago - Pushed: 25 days ago - Stars: 11 - Forks: 8

SJTU-Quant/awesome-ml-data-quality-papers

Papers about training data quality management for ML models.

Size: 854 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 7 - Forks: 0

Desbordante/desbordante-core

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

Language: C++ - Size: 126 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 62 - Forks: 50

polyaxon/haupt

Lineage metadata API, artifacts streams, sandbox, API, and spaces for Polyaxon

Language: Python - Size: 1000 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 452 - Forks: 213

psebenick/data-profiling

a set of scripts to pull meta data and data profiling metrics from relational database systems

Language: Python - Size: 44.9 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 67 - Forks: 19

opendatadiscovery/odd-platform

First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.

Language: Java - Size: 28.1 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 1,104 - Forks: 91

great-expectations/great_expectations

Always know what to expect from your data.

Language: Python - Size: 189 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 9,420 - Forks: 1,464

jmakeig/data-profile

Sandbox to test out ideas for profiling document data

Language: JavaScript - Size: 122 KB - Last synced: about 1 month ago - Pushed: about 1 year ago - Stars: 0 - Forks: 0

opendatadiscovery/odd-collector πŸ“¦

Open-source metadata collector based on ODD Specification

Language: Python - Size: 1.96 MB - Last synced: 22 days ago - Pushed: 7 months ago - Stars: 41 - Forks: 13

hi-primus/optimus

:truck: Agile Data Preparation Workflows madeΒ easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

Language: Python - Size: 110 MB - Last synced: about 1 month ago - Pushed: about 2 months ago - Stars: 1,441 - Forks: 233

Jakob-L-M/partial-SPIDER

A partialized version of the SPIDER Algorithm for inclusion dependency discovery

Language: Java - Size: 240 KB - Last synced: 15 days ago - Pushed: 16 days ago - Stars: 0 - Forks: 0

CCFGomes/Yelp-InsightX

Yelp InsightX is a comprehensive analysis conducted on the Yelp dataset, focusing on profiling, understanding, and deriving insights from the data.

Size: 13.7 KB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 0 - Forks: 0

apicrafter/metacrafter

Metadata and data identification tool and Python library. Identifies PII, common identifiers, language specific identifiers. Fully customizable and flexible rules

Language: Python - Size: 81.7 MB - Last synced: 9 days ago - Pushed: 3 months ago - Stars: 38 - Forks: 6

mtna/rds-r

R package to simplify the usage of the RDS REST API and provide convenience in accessing data and metadata.

Language: R - Size: 1.41 MB - Last synced: about 2 months ago - Pushed: 10 months ago - Stars: 4 - Forks: 0

bodybuilders-team/ist-meic-cd-g03

Data Science project of group 03 - MEIC @ IST 2023/2024.

Language: Python - Size: 146 MB - Last synced: about 1 month ago - Pushed: 5 months ago - Stars: 0 - Forks: 0

dm4ml/gate

Drift detection module for machine learning pipelines.

Language: Python - Size: 1.46 MB - Last synced: 18 days ago - Pushed: 11 months ago - Stars: 19 - Forks: 1

ismaildawoodjee/GreatEx

A project for exploring how Great Expectations can be used to ensure data quality and validate batches within a data pipeline defined in Airflow.

Language: Python - Size: 1.73 MB - Last synced: about 1 month ago - Pushed: over 1 year ago - Stars: 20 - Forks: 6

CoDS-GCS/kglids

Linked Data Science powered by Knowledge Graphs

Language: Python - Size: 156 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 6 - Forks: 1

aparna-chavan/New-York-City-Food-Inspections

Language: TSQL - Size: 12.3 MB - Last synced: 6 months ago - Pushed: 6 months ago - Stars: 0 - Forks: 0

pflooky/data-caterer-example

Example API implementation for Data Caterer

Language: Scala - Size: 1.81 MB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 2 - Forks: 1

opendatadiscovery/odd-collector-gcp πŸ“¦

Open-source GCP metadata collector based on ODD Specification

Language: Python - Size: 188 KB - Last synced: 4 months ago - Pushed: 8 months ago - Stars: 4 - Forks: 0

darenasc/auto-fes

Automated exploration of files in a folder structure to extract metadata and potential usage of information.

Language: Python - Size: 88.9 KB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 4 - Forks: 0

hi-primus/bumblebee

πŸš• A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)

Language: Vue - Size: 23 MB - Last synced: 7 months ago - Pushed: 10 months ago - Stars: 130 - Forks: 34

Divya171997/Big-Data-for-Computational-Finance-Forex_Exchange_Rate

Analysis of forex exchange rate dataset, covering the historical aspects over the period of time, in short doing Timeseries Analysis ,Data Cleansing and Transformation of Forex Exchange Dataset in order to transform it in format or structure required during Timeseries Analysis and Machine Learning ,Visualization of Forex Exchange Dataset based on Timeseries Analysis done on it,Applying ARIMA MODEL, a Machine Learning algorithm, used to perform forecasting or predicting future value.

Language: Jupyter Notebook - Size: 2.22 MB - Last synced: 7 months ago - Pushed: almost 4 years ago - Stars: 0 - Forks: 0

ahmadassaf/roomba

A Node.js tool to examine the correctness of Open Data Metadata and build custom dataset profiles

Language: JavaScript - Size: 3.36 MB - Last synced: about 1 month ago - Pushed: 8 months ago - Stars: 12 - Forks: 3

VIDA-NYU/sato Fork of megagonlabs/sato

Fork of Sato for easy deployment as a Python package

Language: Python - Size: 32.6 MB - Last synced: 8 months ago - Pushed: over 3 years ago - Stars: 0 - Forks: 0

VIDA-NYU/auctus

Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index

Language: Python - Size: 10.1 MB - Last synced: 8 months ago - Pushed: over 1 year ago - Stars: 38 - Forks: 11

dartwinshu/dibimbing-python-for-data-analysis

Course of business intelligence bootcamp by Dibimbing

Language: Jupyter Notebook - Size: 1.19 MB - Last synced: 9 months ago - Pushed: over 1 year ago - Stars: 0 - Forks: 0

ray310/Panda-Helper

Panda-Helper: Data profiling utility for Pandas DataFrames and Series

Language: Python - Size: 67.4 KB - Last synced: 11 days ago - Pushed: almost 2 years ago - Stars: 2 - Forks: 0

mzj14/function-dependency-exploration

Homework for exploring function dependencies in data sets

Language: Python - Size: 547 KB - Last synced: 10 months ago - Pushed: about 7 years ago - Stars: 0 - Forks: 0

giagiannis/data-profiler

Data profiler is an attempt to model the behavior of a given operator for a set of datasets.

Language: Go - Size: 1.48 MB - Last synced: 11 months ago - Pushed: over 5 years ago - Stars: 3 - Forks: 1

statsim/profile

Profile. Generate data profiles in the browser (work in progress)

Language: JavaScript - Size: 1.19 MB - Last synced: 11 months ago - Pushed: 11 months ago - Stars: 3 - Forks: 0

christianbors/OpenRefineQualityMetrics

MetricDoc is an interactive visual exploration environment for assessing data quality

Language: JavaScript - Size: 5.83 MB - Last synced: about 1 month ago - Pushed: about 4 years ago - Stars: 8 - Forks: 1

Marini97/Functional-Dependency

Project for the topic of Data Profiling for the course Advanced Topics in Computer Science.

Language: Python - Size: 94.2 MB - Last synced: 12 months ago - Pushed: about 1 year ago - Stars: 1 - Forks: 0

baligoyem/dataqtor

πŸ”Your Data Quality Detector / Gain insight into your data and get it ready for use before you start working with it πŸ’‘πŸ“ŠπŸ› πŸ’Ž

Language: Python - Size: 9.43 MB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 14 - Forks: 6

xxl4tomxu98/data-engineering-python-great-expectations

Demo on Data Engineering using Great Expectations API

Language: Jupyter Notebook - Size: 6 MB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 4 - Forks: 0

wisemuffin/data-cicd-tpch

Language: Python - Size: 710 KB - Last synced: about 1 year ago - Pushed: about 2 years ago - Stars: 8 - Forks: 0

CodeLionX/distod

DISTOD algorithm: Distributed discovery of bidirectional order dependencies

Language: Scala - Size: 17.9 MB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 4 - Forks: 2

6point6/bytefreq Fork of minkymorgan/bytefreq

a nix DataProfiler for deep analysis of data quality on tabular files

Language: Awk - Size: 12.6 MB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 1 - Forks: 3

samapatil1729/Tools-In-Data-Science

This course will teach students to use popular tools for sourcing data, transforming it, building and optimizing models, communicating these as visual stories, and deploying them in production.

Language: HTML - Size: 25.1 MB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 2 - Forks: 0

gandalf1819/NYCOpenData-Profiling-Analysis

Open Data Profiling, Quality and Analysis on NYC OpenData dataset with semantic profiling using fuzzy ratio, Levenshtein distance and regex

Language: Jupyter Notebook - Size: 17.9 MB - Last synced: about 1 year ago - Pushed: over 3 years ago - Stars: 6 - Forks: 4

bballamudi/deequ Fork of awslabs/deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Size: 68.9 MB - Last synced: 11 months ago - Pushed: over 4 years ago - Stars: 0 - Forks: 0

danielsobrado/dockerizations

My Dockerizations

Language: Dockerfile - Size: 8.79 KB - Last synced: about 1 year ago - Pushed: almost 2 years ago - Stars: 0 - Forks: 0

viperior/rimhistory

RimWorld game save data analyzer

Language: Python - Size: 9.03 MB - Last synced: 14 days ago - Pushed: 15 days ago - Stars: 2 - Forks: 0

raymon-ai/raymon

The official http://raymon.ai data profiling and logging library.

Language: Python - Size: 78.7 MB - Last synced: about 2 months ago - Pushed: about 2 years ago - Stars: 18 - Forks: 1

mtna/rds-js

Simplify usage of the RDS API for TypeScript/JavaScript developers

Language: TypeScript - Size: 1.71 MB - Last synced: 18 days ago - Pushed: over 1 year ago - Stars: 4 - Forks: 1

p-disha/NYC-Open-Dataset-Analysis

Identified data types for each distinct column value on 1900 data sets. For each column, summarized semantic types present in the column, using Fuzzy Logic, Levenshtein distance. Identified & derived inference the 3 most frequent 311 complaint types by borough.

Language: Python - Size: 25.4 KB - Last synced: about 1 year ago - Pushed: about 4 years ago - Stars: 1 - Forks: 0

amogh9594/r-language

Language: HTML - Size: 1020 KB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 0 - Forks: 0

dikoharyadhanto/Data-Analyst-Capstone-Project-Coursera

Data Analyst Capstone Project in Coursera

Language: Jupyter Notebook - Size: 2.67 MB - Last synced: almost 1 year ago - Pushed: over 2 years ago - Stars: 0 - Forks: 3

mtna/rds-js-examples

TypeScript/JavaScript example code using the RDS API

Language: TypeScript - Size: 7.04 MB - Last synced: about 2 months ago - Pushed: over 1 year ago - Stars: 3 - Forks: 3

bballamudi/Optimus Fork of hi-primus/optimus

:truck: Agile Data Science Workflows madeΒ easy with Pyspark

Size: 71.2 MB - Last synced: 11 months ago - Pushed: over 4 years ago - Stars: 0 - Forks: 0

arghyab0/FIFA19DreamTeam

A R Notebook to perform basic data profiling and exploratory data analysis on the FIFA19 players dataset and create a dream-team of the top 11 players considering various player attributes.

Language: HTML - Size: 3.1 MB - Last synced: about 1 year ago - Pushed: almost 3 years ago - Stars: 0 - Forks: 0

b-feldmann/hitucc

Distributable UCC Discovery Algorithm based on Akka

Language: Java - Size: 352 KB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 1 - Forks: 0

bballamudi/great_expectations Fork of great-expectations/great_expectations

Always know what to expect from your data.

Size: 19.3 MB - Last synced: 11 months ago - Pushed: over 4 years ago - Stars: 0 - Forks: 0