GitHub topics: record-linkage
maxharlow/csvmatch
🔎 Finds fuzzy matches between CSV files
Language: Python - Size: 158 KB - Last synced at: 2 days ago - Pushed at: about 2 months ago - Stars: 188 - Forks: 21

J535D165/recordlinkage
A powerful and modular toolkit for record linkage and duplicate detection in Python
Language: Python - Size: 70 MB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 1,005 - Forks: 156

moj-analytical-services/splink
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
Language: Python - Size: 101 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 1,592 - Forks: 180

dedupeio/dedupe
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Language: Python - Size: 5.98 MB - Last synced at: 4 days ago - Pushed at: 6 months ago - Stars: 4,295 - Forks: 560

openvenues/libpostal
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
Language: C - Size: 36.3 MB - Last synced at: 5 days ago - Pushed at: 27 days ago - Stars: 4,251 - Forks: 433

OlivierBinette/Awesome-Entity-Resolution
List of entity resolution software and resources.
Size: 28.3 KB - Last synced at: 5 days ago - Pushed at: 3 months ago - Stars: 67 - Forks: 8

spindle-health/carduus
PySpark implementation of the Open Privacy Preserving Record Linkage (OPPRL) specification.
Language: Python - Size: 1.75 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 14 - Forks: 1

ajl2718/whereabouts
Fast, accurate, open-source geocoding in Python
Language: Python - Size: 7.59 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 35 - Forks: 6

dedupeio/dedupe-examples
:id: Examples for using the dedupe library
Language: Python - Size: 5.12 MB - Last synced at: 1 day ago - Pushed at: 10 months ago - Stars: 412 - Forks: 214

Yomguithereal/talisman
Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
Language: JavaScript - Size: 3.39 MB - Last synced at: 3 days ago - Pushed at: 11 months ago - Stars: 715 - Forks: 47

matchID-project/backend
Backend (Docker & API) for matchID project
Language: Python - Size: 9.99 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 11 - Forks: 14

k3jph/phonics-in-r
Phonetic Spelling Algorithms in R
Language: R - Size: 443 KB - Last synced at: 11 days ago - Pushed at: about 1 year ago - Stars: 31 - Forks: 8

OlivierBinette/er-evaluation
An End-to-End Evaluation Framework for Entity Resolution Systems
Language: Python - Size: 62.4 MB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 28 - Forks: 9

vaneseltine/nominally
A maximum-strength name parser for record linkage.
Language: Python - Size: 1.09 MB - Last synced at: 11 days ago - Pushed at: 21 days ago - Stars: 37 - Forks: 1

ncn-foreigners/blocking
An R package for blocking records for record linkage / data deduplication based on approximate nearest neighbours algorithms.
Language: R - Size: 131 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 11 - Forks: 0

fritshermans/deduplipy
Python package for deduplication/entity resolution using active learning
Language: Python - Size: 521 KB - Last synced at: 11 days ago - Pushed at: 9 months ago - Stars: 79 - Forks: 9

J535D165/data-matching-software
A list of free data matching and record linkage software.
Size: 93.8 KB - Last synced at: 18 days ago - Pushed at: over 1 year ago - Stars: 382 - Forks: 42

ipums/hlink
Hierarchical record linkage at scale
Language: Python - Size: 13.3 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 12 - Forks: 2

vintasoftware/entity-embed
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
Language: Jupyter Notebook - Size: 11.4 MB - Last synced at: 7 days ago - Pushed at: over 2 years ago - Stars: 153 - Forks: 16

Bergvca/string_grouper
Super Fast String Matching in Python
Language: Python - Size: 2.59 MB - Last synced at: 27 days ago - Pushed at: 2 months ago - Stars: 367 - Forks: 76

NickCrews/mismo
The SQL/Ibis powered sklearn of record linkage
Language: Python - Size: 9.72 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 15 - Forks: 3

Senzing/awesome
Curated list of awesome software and resources for Senzing, The First Real-Time AI for Entity Resolution.
Language: Python - Size: 244 KB - Last synced at: 18 days ago - Pushed at: about 1 month ago - Stars: 57 - Forks: 2

PatentsView/PatentsView-Evaluation 📦
Evaluation and benchmarking of PatentsView disambiguation algorithms
Language: Python - Size: 156 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 13 - Forks: 8

ADBond/splinkclickhouse
Allows Clickhouse to be used as the execution engine for Splink
Language: Python - Size: 959 KB - Last synced at: 10 days ago - Pushed at: 3 months ago - Stars: 5 - Forks: 0

dell-research-harvard/linktransformer
A convenient way to link, deduplicate, aggregate and cluster data(frames) in Python using deep learning
Language: Python - Size: 1.81 MB - Last synced at: 10 days ago - Pushed at: about 2 months ago - Stars: 118 - Forks: 11

sssairohit/enm
Excel Name Matching is a Python-based automation tool that standardizes names in an Excel file using fuzzy matching techniques. It ensures consistency for data processing, making it easier to use VLOOKUP and other operations.
Language: Python - Size: 35.2 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

ul-mds/gecko
Python library for the generation and mutation of realistic personal identification data at scale
Language: Python - Size: 5.51 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 6 - Forks: 1

J535D165/recordlinkage-annotator
A browser user interface for manual labeling of record pairs.
Language: JavaScript - Size: 3.49 MB - Last synced at: about 2 months ago - Pushed at: almost 2 years ago - Stars: 46 - Forks: 8

dedupeio/csvdedupe
:id: Command line tool for deduplicating CSV files
Language: Python - Size: 1.12 MB - Last synced at: about 22 hours ago - Pushed at: about 5 years ago - Stars: 420 - Forks: 83

usc-isi-i2/rltk
Record Linkage ToolKit (Find and link entities)
Language: Python - Size: 9.59 MB - Last synced at: 5 days ago - Pushed at: almost 2 years ago - Stars: 110 - Forks: 23

zouzias/spark-lucenerdd
Spark RDD with Lucene's query and entity linkage capabilities
Language: Scala - Size: 11.3 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 125 - Forks: 36

dice-group/LIMES
Link Discovery Framework for Metric Spaces.
Language: JavaScript - Size: 38.4 MB - Last synced at: about 2 months ago - Pushed at: 9 months ago - Stars: 130 - Forks: 54

data61/blocklib
Python implementations of record linkage blocking techniques.
Language: Python - Size: 1.13 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 20 - Forks: 4

maxharlow/textmatch
🔎 Finds fuzzy matches between datasets
Language: Python - Size: 120 KB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 12 - Forks: 0

ihmeuw/person_linkage_case_study
Emulates the methods the US Census Bureau uses to link people across multiple data sources, using open-source software (Splink) and simulated data (from pseudopeople).
Language: HTML - Size: 4.43 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 3 - Forks: 0

krokane/movie_sites_entity_linking
Entity resolution project linking common movies between IMDb and Rotten Tomatoes using blocking, string similarity functions, and record linkage techniques. After finding common entities, created a knowledge graph to visualize the dataset using a schema ontology.
Language: Jupyter Notebook - Size: 26.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

Wikidata/soweego
Link Wikidata items to large catalogs
Language: Python - Size: 7.87 MB - Last synced at: 5 days ago - Pushed at: 3 months ago - Stars: 96 - Forks: 9

Xhst/data-engineering-projects
Projects for the course Data Engineering held by professor Paolo Merialdo at Roma Tre University.
Language: Python - Size: 114 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 1

data61/clkhash
CLK hash: hash pii for entity matching
Language: Python - Size: 3.49 MB - Last synced at: 7 days ago - Pushed at: 13 days ago - Stars: 47 - Forks: 9

ing-bank/spark-matcher
Record matching and entity resolution at scale in Spark
Language: Python - Size: 579 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 34 - Forks: 8

ErcinDedeoglu/Postalized
The ultimate address parsing tool. Effortlessly parse and expand postal data with our cutting-edge technology. Simplify your mailing, enhance accuracy, and embrace the future of postal efficiency. Get Postalized—where precision meets convenience.
Language: C - Size: 5.98 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 2 - Forks: 1

Evnsn/awsome-entity-resolution
A collection of awesome resources regarding Record Linkage.
Size: 13.7 KB - Last synced at: 18 days ago - Pushed at: 9 months ago - Stars: 7 - Forks: 0

iesl/stance
Learned string similarity for entity names using optimal transport.
Language: Python - Size: 71.3 KB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 35 - Forks: 3

ngmarchant/comparator
Similarity and distance measures for clustering and record linkage applications in R
Language: R - Size: 275 KB - Last synced at: 11 days ago - Pushed at: about 3 years ago - Stars: 18 - Forks: 0

cjerzak/LinkOrgs-software
LinkOrgs: An R package for linking linking records on organizations using half a billion open-collaborated records from LinkedIn
Language: R - Size: 90.8 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 11 - Forks: 1

dobraczka/klinker
🧱 blocking methods for entity resolution
Language: Python - Size: 1.19 MB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 6 - Forks: 0

NHSDigital/mps_diagnostics
Interpretable metadata for the results of NHS England record linkage
Language: Python - Size: 537 KB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

tteofili/certa
CERTA - Computing Entity Resolution explanations with TriAngles
Language: Python - Size: 26.8 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 5 - Forks: 3

dobraczka/eche
🕸️ Little helper for handling entity clusters
Language: Python - Size: 95.7 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

cleanzr/clevr
Clustering and Link Prediction Evaluation in R
Language: R - Size: 114 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 12 - Forks: 3

Felipecastanog/final_project_ENEL645
PRIVACY-PRESERVING RECORD LINKAGE METHODS FOR HOMELESSNESS DATA
Language: Jupyter Notebook - Size: 4.32 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

data61/anonlink
Python implementation of anonymous linkage using cryptographic linkage keys
Language: Python - Size: 3.19 MB - Last synced at: 7 days ago - Pushed at: about 1 year ago - Stars: 65 - Forks: 8

ul-mds/pprl
Collection of software packages for performing privacy-preserving record linkage based on Bloom filters
Language: Python - Size: 332 KB - Last synced at: 9 days ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

t2solve/recordlinkagenet
library for dataset comparison
Language: C# - Size: 279 KB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

J535D165/FEBRL-fork-v0.4.2
Fork of the Freely Extensible Biomedical Record Linkage program
Language: Python - Size: 6.36 MB - Last synced at: about 2 months ago - Pushed at: over 8 years ago - Stars: 24 - Forks: 21

data61/anonlink-entity-service
Privacy Preserving Record Linkage Service
Language: Python - Size: 12.2 MB - Last synced at: 7 days ago - Pushed at: about 2 years ago - Stars: 26 - Forks: 8

ngmarchant/oasis
A Python package for efficient evaluation based on OASIS (Optimal Asymptotic Sequential Importance Sampling).
Language: Python - Size: 16.3 MB - Last synced at: 22 days ago - Pushed at: almost 4 years ago - Stars: 15 - Forks: 3

moj-analytical-services/splink_graph 📦
pyspark-parallelised functions producing graph-theoretical metrics in connected component clusters for use in record-linkage (or other domains)
Language: HTML - Size: 2.71 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 10 - Forks: 3

ul-mds/gecko-examples
Example scripts for generating data with Gecko
Language: Python - Size: 36.1 KB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

joshuacortez/data-matching-workflow
A workflow template for deduplication and record linkage using the Dedupe library
Language: Jupyter Notebook - Size: 3.47 MB - Last synced at: 10 months ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

chansooligans/oagdedupe
Developed for Use by NY Office of the Attorney General: A Python library for scalable entity resolution, using active learning to learn blocking configurations, generate comparison pairs, then clasify matches
Language: Python - Size: 1.64 MB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 2

data61/anonlink-client
Language: Python - Size: 3.67 MB - Last synced at: 7 days ago - Pushed at: about 2 years ago - Stars: 5 - Forks: 2

jimbrig/lossrunAnalyzer 📦
R Package and Shiny App to Analyze Insurance Lossruns
Language: R - Size: 11.7 KB - Last synced at: 6 months ago - Pushed at: over 5 years ago - Stars: 4 - Forks: 0

cleanzr/dblink
Distributed Bayesian Entity Resolution in Apache Spark
Language: Scala - Size: 455 KB - Last synced at: 9 months ago - Pushed at: almost 4 years ago - Stars: 57 - Forks: 9

tteofili/cheapER
Low Cost Entity Resolution with Transformers
Language: Jupyter Notebook - Size: 10.6 MB - Last synced at: about 2 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

thomaswyrick/duplicate-data-generator
A Python script for generating duplicate data to test the performance of record linkage and master data management systems.
Language: Python - Size: 12.1 MB - Last synced at: 11 months ago - Pushed at: 12 months ago - Stars: 6 - Forks: 2

foxcroftjn/PAKDD-Class-Ratio
Supplementary code for "Class ratio and its implications for reproducibility and performance in record linkage" presented at The Pacific-Asia Conference on Knowledge Discovery and Data Mining 2024.
Language: Jupyter Notebook - Size: 34.4 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

catalyst-cooperative/ccai-entity-matching 📦
An exploration of generalizable approaches to unsupervised entity matching for use in linking tabular public energy data sources.
Language: Jupyter Notebook - Size: 12.2 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 1

ul-mds/gecko-data
Example data sources as a starting point for working with Gecko
Language: Jupyter Notebook - Size: 4.66 MB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

ikatic/StringMetrics
The StringMetrics project implements 7 string metric algorithms: Hamming, Dice, Jaro, Jaro-Winkler, Soundex, Levenshtein, and Damerau-Levenshtein. Metrics compare strings using IMetric interface providing an approximate similarity score from 0 (no match) to 1 (exact match) useful in data cleansing, record linkage, NLP, fraud detection, etc.
Language: C# - Size: 45.9 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

gpoulter/pydedupe 📦
(Archived) A Python library for record linkage and deduplication.
Language: Python - Size: 1.36 MB - Last synced at: 11 months ago - Pushed at: about 1 year ago - Stars: 19 - Forks: 2

zzachw/MedLink
KDD'23 | MedLink: De-Identified Patient Health Record Linkage
Language: Jupyter Notebook - Size: 321 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

andreac0/BeRTo-RecordLinkageTool
Python-based tool to link legal entity datasets when no common ID is available, using name and address information
Language: Jupyter Notebook - Size: 16.5 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

DecioXXIV/ID-hw6-DataIntegration Fork of AlessandroPesare/Progetto_Finale_IDD
Repository per HW6, Corso di Ingegneria dei Dati 2023/24
Language: Jupyter Notebook - Size: 46.8 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

ropeladder/record-linkage-resources
Resources for tackling record linkage / deduplication / data matching problems
Size: 26.4 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 103 - Forks: 16

fgregg/smered
Mirror of https://bitbucket.org/resteorts/smered
Language: Java - Size: 4.48 MB - Last synced at: about 1 month ago - Pushed at: about 8 years ago - Stars: 5 - Forks: 0

UltraArceus3/AttributeSelectionAlgorithm
This project is a algorithm that helps the users to find out attributes that are good for performing record linkage.
Language: C++ - Size: 24.5 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 1

ufbmi/onefl-deduper
Tools for EHR patient de-duplication (aka entity resolution)
Language: Python - Size: 19.4 MB - Last synced at: 5 days ago - Pushed at: about 2 years ago - Stars: 12 - Forks: 4

john-thuo1/RecordLinkage
Brief Overview of record linkage implementation
Language: Jupyter Notebook - Size: 155 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

tteofili/er-utils
utilities for working with Entity Resolution models
Language: Python - Size: 35.2 KB - Last synced at: about 2 months ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 0

ziqizhang/scholarlydata
Experimental code for author name and affiliation linking/disabmiguation
Language: Java - Size: 148 MB - Last synced at: about 1 year ago - Pushed at: almost 4 years ago - Stars: 4 - Forks: 2

KirovVerst/qlink
Entity Resolution and Record Linkage library
Language: Python - Size: 4.84 MB - Last synced at: 4 days ago - Pushed at: about 2 years ago - Stars: 7 - Forks: 0

OlivierBinette/groupbyrule
Deduplicate data using fuzzy and deterministic matching rules.
Language: Python - Size: 11.9 MB - Last synced at: about 1 year ago - Pushed at: about 2 years ago - Stars: 7 - Forks: 0

cleanzr/dblinkR
An R interface for the dblink Spark application
Language: R - Size: 19 MB - Last synced at: 9 months ago - Pushed at: almost 4 years ago - Stars: 5 - Forks: 1

cleanzr/representr
Create representative records post-record linkage
Language: R - Size: 1.02 MB - Last synced at: 9 months ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 0

ae3000/matchain
Record linkage - simple, flexible, efficient.
Language: Python - Size: 3.96 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

CangyuanLi/floof
Fuzzymatching made easy
Language: Rust - Size: 359 KB - Last synced at: 27 days ago - Pushed at: about 1 year ago - Stars: 5 - Forks: 0

mgranchelli/ingegneria-dei-dati-2022-23
Homework of 2022-2023 Ingegneria dei dati course at Roma Tre University.
Language: Jupyter Notebook - Size: 32.1 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

RecordLinkageIG/RecordLinkageIG.github.io
Blog of the American Statistical Association's Record Linkage Interest Group.
Language: HTML - Size: 6.24 MB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 4 - Forks: 2

adityarbhat/Data-Challenge-Projects
Contains solution notebooks of attempted data challenges
Language: Jupyter Notebook - Size: 28.1 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

a-wars/AGIW_DeepER
Implementation of DeepER system (record linkage with neural networks)
Language: Jupyter Notebook - Size: 30.8 MB - Last synced at: almost 2 years ago - Pushed at: almost 6 years ago - Stars: 3 - Forks: 0

cleanzr/fasthash
Performs unique entity estimation corresponding to Chen, Shrivastava, Steorts (2018).
Language: Python - Size: 1.6 MB - Last synced at: 9 months ago - Pushed at: over 6 years ago - Stars: 14 - Forks: 3

coletl/geocode
A short guide to approximate geocoding
Language: HTML - Size: 178 MB - Last synced at: almost 2 years ago - Pushed at: about 7 years ago - Stars: 0 - Forks: 0

dtmlinh/Food-Inspections-PostgreSQL
A database management system for restaurant inspection records, restaurant-related tweets, and other relevant data.
Language: Python - Size: 9.55 MB - Last synced at: almost 2 years ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 1

saifmahamood/sortableChallenge
My entry to a data analysis / record linkage coding challenge
Language: Python - Size: 427 KB - Last synced at: almost 2 years ago - Pushed at: over 7 years ago - Stars: 0 - Forks: 1

gcgbarbosa/rl-accuracy
A simple software that generates features and assess the accuracy of record linkage.
Language: Jupyter Notebook - Size: 59.6 KB - Last synced at: almost 2 years ago - Pushed at: almost 6 years ago - Stars: 0 - Forks: 0

purple29th/purpledproject
A META (FACEBOOK) PROJECT - Purpled allows artist to distribute content and monetize artistry. Contribute to the success of both new and experienced artists. Every like, play, remark, and repost reverberates, establishing a creator's reputation, motivating them, and expanding their reach making you always have the great music at your fingertips.
Language: Java - Size: 1.59 MB - Last synced at: almost 2 years ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

DForshner/RecordLinkagePipelineDemo
Exploring linking records from disparate data sources
Language: C# - Size: 980 KB - Last synced at: almost 2 years ago - Pushed at: almost 3 years ago - Stars: 1 - Forks: 0

magabrielaa/computer-science-applications
Range of computer science applications using Python.
Language: Python - Size: 1.43 MB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

fsentin/anon-reclinkage
K-Anonymization & Record-linkage Attack
Language: Jupyter Notebook - Size: 1.55 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0
