Ecosyste.ms: Repos
An open API service providing repository metadata for many open source software ecosystems.
GitHub topics: data-profiling
dqops/dqo
Data Quality and Observability platform for the whole data lifecycle, from profiling new data sources to full automation with Data Observability. Configure data quality checks from the UI or in YAML files, let DQOps run the data quality checks daily to detect data quality issues.
Language: Java - Size: 73.3 MB - Last synced: about 11 hours ago - Pushed: about 14 hours ago - Stars: 57 - Forks: 12
tsegall/fta
Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.
Language: Java - Size: 6.81 MB - Last synced: about 11 hours ago - Pushed: about 14 hours ago - Stars: 22 - Forks: 2
DataKitchen/data-observability-installer
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.
Language: Python - Size: 231 KB - Last synced: about 14 hours ago - Pushed: about 15 hours ago - Stars: 28 - Forks: 0
ydataai/ydata-profiling
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Language: Python - Size: 594 MB - Last synced: about 14 hours ago - Pushed: about 18 hours ago - Stars: 12,104 - Forks: 1,632
cleanlab/cleanlab-studio
Client interface for all things Cleanlab Studio
Language: Python - Size: 2.7 MB - Last synced: 3 days ago - Pushed: 3 days ago - Stars: 21 - Forks: 4
sodadata/soda-core
:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
Language: Python - Size: 2.68 MB - Last synced: 3 days ago - Pushed: 4 days ago - Stars: 1,778 - Forks: 185
cleanlab/cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Language: Python - Size: 11.1 MB - Last synced: 5 days ago - Pushed: 6 days ago - Stars: 8,814 - Forks: 677
cleanlab/cleanvision
Automatically find issues in image datasets and practice data-centric computer vision.
Language: Python - Size: 2.11 MB - Last synced: 10 days ago - Pushed: 27 days ago - Stars: 926 - Forks: 69
hadarsharon/compars
DataFrame comparison done right, powered by Rust with polars (AKA the bear-agnostic π» πΌ π¨ π»ββοΈ DataFrame comparison library)
Language: Python - Size: 36.1 KB - Last synced: 15 days ago - Pushed: 15 days ago - Stars: 0 - Forks: 0
Swiple/swiple
Swiple enables you to easily observe, understand, validate and improve the quality of your data
Language: Python - Size: 122 MB - Last synced: 19 days ago - Pushed: 19 days ago - Stars: 78 - Forks: 10
open-metadata/OpenMetadata
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
Language: TypeScript - Size: 1.3 GB - Last synced: 19 days ago - Pushed: 19 days ago - Stars: 4,168 - Forks: 837
hpcc-systems/DataPatterns
HPCC Systems ECL bundle that provides some basic data profiling and research tools to an ECL programmer
Language: ECL - Size: 621 KB - Last synced: 18 days ago - Pushed: 19 days ago - Stars: 3 - Forks: 4
InfuseAI/piperider
Code review for data in dbt
Language: Python - Size: 32.6 MB - Last synced: 7 days ago - Pushed: 2 months ago - Stars: 469 - Forks: 21
fbdesignpro/sweetviz
Visualize and compare datasets, target values and associations, with one line of code.
Language: Python - Size: 15.3 MB - Last synced: 22 days ago - Pushed: 6 months ago - Stars: 2,839 - Forks: 268
polyaxon/traceml
Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.
Language: Python - Size: 118 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 492 - Forks: 43
ing-bank/popmon
Monitor the stability of a Pandas or Spark dataframe βοΈ
Language: Python - Size: 5.77 MB - Last synced: about 18 hours ago - Pushed: 4 months ago - Stars: 487 - Forks: 35
open-metadata/openmetadata-site
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
Language: CSS - Size: 40.5 MB - Last synced: 25 days ago - Pushed: 25 days ago - Stars: 11 - Forks: 8
SJTU-Quant/awesome-ml-data-quality-papers
Papers about training data quality management for ML models.
Size: 854 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 7 - Forks: 0
Desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
Language: C++ - Size: 126 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 62 - Forks: 50
polyaxon/haupt
Lineage metadata API, artifacts streams, sandbox, API, and spaces for Polyaxon
Language: Python - Size: 1000 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 452 - Forks: 213
psebenick/data-profiling
a set of scripts to pull meta data and data profiling metrics from relational database systems
Language: Python - Size: 44.9 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 67 - Forks: 19
opendatadiscovery/odd-platform
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
Language: Java - Size: 28.1 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 1,104 - Forks: 91
great-expectations/great_expectations
Always know what to expect from your data.
Language: Python - Size: 189 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 9,420 - Forks: 1,464
jmakeig/data-profile
Sandbox to test out ideas for profiling document data
Language: JavaScript - Size: 122 KB - Last synced: about 1 month ago - Pushed: about 1 year ago - Stars: 0 - Forks: 0
opendatadiscovery/odd-collector π¦
Open-source metadata collector based on ODD Specification
Language: Python - Size: 1.96 MB - Last synced: 22 days ago - Pushed: 7 months ago - Stars: 41 - Forks: 13
hi-primus/optimus
:truck: Agile Data Preparation Workflows madeΒ easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Language: Python - Size: 110 MB - Last synced: about 1 month ago - Pushed: about 2 months ago - Stars: 1,441 - Forks: 233
Jakob-L-M/partial-SPIDER
A partialized version of the SPIDER Algorithm for inclusion dependency discovery
Language: Java - Size: 240 KB - Last synced: 15 days ago - Pushed: 16 days ago - Stars: 0 - Forks: 0
CCFGomes/Yelp-InsightX
Yelp InsightX is a comprehensive analysis conducted on the Yelp dataset, focusing on profiling, understanding, and deriving insights from the data.
Size: 13.7 KB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 0 - Forks: 0
apicrafter/metacrafter
Metadata and data identification tool and Python library. Identifies PII, common identifiers, language specific identifiers. Fully customizable and flexible rules
Language: Python - Size: 81.7 MB - Last synced: 9 days ago - Pushed: 3 months ago - Stars: 38 - Forks: 6
mtna/rds-r
R package to simplify the usage of the RDS REST API and provide convenience in accessing data and metadata.
Language: R - Size: 1.41 MB - Last synced: about 2 months ago - Pushed: 10 months ago - Stars: 4 - Forks: 0
bodybuilders-team/ist-meic-cd-g03
Data Science project of group 03 - MEIC @ IST 2023/2024.
Language: Python - Size: 146 MB - Last synced: about 1 month ago - Pushed: 5 months ago - Stars: 0 - Forks: 0
dm4ml/gate
Drift detection module for machine learning pipelines.
Language: Python - Size: 1.46 MB - Last synced: 18 days ago - Pushed: 11 months ago - Stars: 19 - Forks: 1
ismaildawoodjee/GreatEx
A project for exploring how Great Expectations can be used to ensure data quality and validate batches within a data pipeline defined in Airflow.
Language: Python - Size: 1.73 MB - Last synced: about 1 month ago - Pushed: over 1 year ago - Stars: 20 - Forks: 6
CoDS-GCS/kglids
Linked Data Science powered by Knowledge Graphs
Language: Python - Size: 156 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 6 - Forks: 1
aparna-chavan/New-York-City-Food-Inspections
Language: TSQL - Size: 12.3 MB - Last synced: 6 months ago - Pushed: 6 months ago - Stars: 0 - Forks: 0
pflooky/data-caterer-example
Example API implementation for Data Caterer
Language: Scala - Size: 1.81 MB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 2 - Forks: 1
opendatadiscovery/odd-collector-gcp π¦
Open-source GCP metadata collector based on ODD Specification
Language: Python - Size: 188 KB - Last synced: 4 months ago - Pushed: 8 months ago - Stars: 4 - Forks: 0
darenasc/auto-fes
Automated exploration of files in a folder structure to extract metadata and potential usage of information.
Language: Python - Size: 88.9 KB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 4 - Forks: 0
hi-primus/bumblebee
π A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
Language: Vue - Size: 23 MB - Last synced: 7 months ago - Pushed: 10 months ago - Stars: 130 - Forks: 34
Divya171997/Big-Data-for-Computational-Finance-Forex_Exchange_Rate
Analysis of forex exchange rate dataset, covering the historical aspects over the period of time, in short doing Timeseries Analysis ,Data Cleansing and Transformation of Forex Exchange Dataset in order to transform it in format or structure required during Timeseries Analysis and Machine Learning ,Visualization of Forex Exchange Dataset based on Timeseries Analysis done on it,Applying ARIMA MODEL, a Machine Learning algorithm, used to perform forecasting or predicting future value.
Language: Jupyter Notebook - Size: 2.22 MB - Last synced: 7 months ago - Pushed: almost 4 years ago - Stars: 0 - Forks: 0
ahmadassaf/roomba
A Node.js tool to examine the correctness of Open Data Metadata and build custom dataset profiles
Language: JavaScript - Size: 3.36 MB - Last synced: about 1 month ago - Pushed: 8 months ago - Stars: 12 - Forks: 3
VIDA-NYU/sato Fork of megagonlabs/sato
Fork of Sato for easy deployment as a Python package
Language: Python - Size: 32.6 MB - Last synced: 8 months ago - Pushed: over 3 years ago - Stars: 0 - Forks: 0
VIDA-NYU/auctus
Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index
Language: Python - Size: 10.1 MB - Last synced: 8 months ago - Pushed: over 1 year ago - Stars: 38 - Forks: 11
dartwinshu/dibimbing-python-for-data-analysis
Course of business intelligence bootcamp by Dibimbing
Language: Jupyter Notebook - Size: 1.19 MB - Last synced: 9 months ago - Pushed: over 1 year ago - Stars: 0 - Forks: 0
ray310/Panda-Helper
Panda-Helper: Data profiling utility for Pandas DataFrames and Series
Language: Python - Size: 67.4 KB - Last synced: 11 days ago - Pushed: almost 2 years ago - Stars: 2 - Forks: 0
mzj14/function-dependency-exploration
Homework for exploring function dependencies in data sets
Language: Python - Size: 547 KB - Last synced: 10 months ago - Pushed: about 7 years ago - Stars: 0 - Forks: 0
giagiannis/data-profiler
Data profiler is an attempt to model the behavior of a given operator for a set of datasets.
Language: Go - Size: 1.48 MB - Last synced: 11 months ago - Pushed: over 5 years ago - Stars: 3 - Forks: 1
statsim/profile
Profile. Generate data profiles in the browser (work in progress)
Language: JavaScript - Size: 1.19 MB - Last synced: 11 months ago - Pushed: 11 months ago - Stars: 3 - Forks: 0
christianbors/OpenRefineQualityMetrics
MetricDoc is an interactive visual exploration environment for assessing data quality
Language: JavaScript - Size: 5.83 MB - Last synced: about 1 month ago - Pushed: about 4 years ago - Stars: 8 - Forks: 1
Marini97/Functional-Dependency
Project for the topic of Data Profiling for the course Advanced Topics in Computer Science.
Language: Python - Size: 94.2 MB - Last synced: 12 months ago - Pushed: about 1 year ago - Stars: 1 - Forks: 0
baligoyem/dataqtor
πYour Data Quality Detector / Gain insight into your data and get it ready for use before you start working with it π‘ππ π
Language: Python - Size: 9.43 MB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 14 - Forks: 6
xxl4tomxu98/data-engineering-python-great-expectations
Demo on Data Engineering using Great Expectations API
Language: Jupyter Notebook - Size: 6 MB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 4 - Forks: 0
wisemuffin/data-cicd-tpch
Language: Python - Size: 710 KB - Last synced: about 1 year ago - Pushed: about 2 years ago - Stars: 8 - Forks: 0
CodeLionX/distod
DISTOD algorithm: Distributed discovery of bidirectional order dependencies
Language: Scala - Size: 17.9 MB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 4 - Forks: 2
6point6/bytefreq Fork of minkymorgan/bytefreq
a nix DataProfiler for deep analysis of data quality on tabular files
Language: Awk - Size: 12.6 MB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 1 - Forks: 3
samapatil1729/Tools-In-Data-Science
This course will teach students to use popular tools for sourcing data, transforming it, building and optimizing models, communicating these as visual stories, and deploying them in production.
Language: HTML - Size: 25.1 MB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 2 - Forks: 0
gandalf1819/NYCOpenData-Profiling-Analysis
Open Data Profiling, Quality and Analysis on NYC OpenData dataset with semantic profiling using fuzzy ratio, Levenshtein distance and regex
Language: Jupyter Notebook - Size: 17.9 MB - Last synced: about 1 year ago - Pushed: over 3 years ago - Stars: 6 - Forks: 4
bballamudi/deequ Fork of awslabs/deequ
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Size: 68.9 MB - Last synced: 11 months ago - Pushed: over 4 years ago - Stars: 0 - Forks: 0
danielsobrado/dockerizations
My Dockerizations
Language: Dockerfile - Size: 8.79 KB - Last synced: about 1 year ago - Pushed: almost 2 years ago - Stars: 0 - Forks: 0
viperior/rimhistory
RimWorld game save data analyzer
Language: Python - Size: 9.03 MB - Last synced: 14 days ago - Pushed: 15 days ago - Stars: 2 - Forks: 0
raymon-ai/raymon
The official http://raymon.ai data profiling and logging library.
Language: Python - Size: 78.7 MB - Last synced: about 2 months ago - Pushed: about 2 years ago - Stars: 18 - Forks: 1
mtna/rds-js
Simplify usage of the RDS API for TypeScript/JavaScript developers
Language: TypeScript - Size: 1.71 MB - Last synced: 18 days ago - Pushed: over 1 year ago - Stars: 4 - Forks: 1
p-disha/NYC-Open-Dataset-Analysis
Identified data types for each distinct column value on 1900 data sets. For each column, summarized semantic types present in the column, using Fuzzy Logic, Levenshtein distance. Identified & derived inference the 3 most frequent 311 complaint types by borough.
Language: Python - Size: 25.4 KB - Last synced: about 1 year ago - Pushed: about 4 years ago - Stars: 1 - Forks: 0
amogh9594/r-language
Language: HTML - Size: 1020 KB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 0 - Forks: 0
dikoharyadhanto/Data-Analyst-Capstone-Project-Coursera
Data Analyst Capstone Project in Coursera
Language: Jupyter Notebook - Size: 2.67 MB - Last synced: almost 1 year ago - Pushed: over 2 years ago - Stars: 0 - Forks: 3
mtna/rds-js-examples
TypeScript/JavaScript example code using the RDS API
Language: TypeScript - Size: 7.04 MB - Last synced: about 2 months ago - Pushed: over 1 year ago - Stars: 3 - Forks: 3
bballamudi/Optimus Fork of hi-primus/optimus
:truck: Agile Data Science Workflows madeΒ easy with Pyspark
Size: 71.2 MB - Last synced: 11 months ago - Pushed: over 4 years ago - Stars: 0 - Forks: 0
arghyab0/FIFA19DreamTeam
A R Notebook to perform basic data profiling and exploratory data analysis on the FIFA19 players dataset and create a dream-team of the top 11 players considering various player attributes.
Language: HTML - Size: 3.1 MB - Last synced: about 1 year ago - Pushed: almost 3 years ago - Stars: 0 - Forks: 0
b-feldmann/hitucc
Distributable UCC Discovery Algorithm based on Akka
Language: Java - Size: 352 KB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 1 - Forks: 0
bballamudi/great_expectations Fork of great-expectations/great_expectations
Always know what to expect from your data.
Size: 19.3 MB - Last synced: 11 months ago - Pushed: over 4 years ago - Stars: 0 - Forks: 0