GitHub topics: minhash
sourmash-bio/sourmash
Quickly search, compare, and analyze genomic and metagenomic data sets.
Language: Python - Size: 46.9 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 502 - Forks: 83

dynatrace-oss/hash4j
Dynatrace hash library for Java
Language: Java - Size: 37 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 103 - Forks: 11

codelibs/minhash
This provides tools for b-bit MinHash algorism.
Language: Java - Size: 46.9 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 35 - Forks: 10

Callidon/bloom-filters
JS implementation of probabilistic data structures: Bloom Filter (and its derived), HyperLogLog, Count-Min Sketch, Top-K and MinHash
Language: TypeScript - Size: 9.16 MB - Last synced at: 6 days ago - Pushed at: 26 days ago - Stars: 398 - Forks: 47

ekzhu/datasketch
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
Language: Python - Size: 5.68 MB - Last synced at: 14 days ago - Pushed at: 11 months ago - Stars: 2,667 - Forks: 296

shaltielshmid/MinHashSharp
A Robust Library in C# for Similarity Estimation
Language: C# - Size: 39.1 KB - Last synced at: 9 days ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 1

beowolx/rensa
High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datasets
Language: Python - Size: 112 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 80 - Forks: 9

RagnarGrootKoerkamp/simd-sketch
Compute bottom-s sketches and s-buckets sketches, using simd-minimizers crate.
Language: Rust - Size: 60.5 KB - Last synced at: 9 days ago - Pushed at: about 1 month ago - Stars: 16 - Forks: 1

serega/gaoya
Locality Sensitive Hashing
Language: Rust - Size: 236 KB - Last synced at: 15 days ago - Pushed at: almost 2 years ago - Stars: 72 - Forks: 7

dnbaker/sketch
C++ Implementations of sketch data structures with SIMD Parallelism, including Python bindings
Language: C++ - Size: 4.43 MB - Last synced at: 16 days ago - Pushed at: 9 months ago - Stars: 152 - Forks: 13

h4sh5/bcddb
cross-architecture binary comparison database
Language: Python - Size: 255 KB - Last synced at: 11 days ago - Pushed at: 6 months ago - Stars: 8 - Forks: 2

justinbt1/Akin
Python library for detecting near duplicate texts in a corpus at scale.
Language: Python - Size: 2.77 MB - Last synced at: 6 days ago - Pushed at: about 2 months ago - Stars: 8 - Forks: 0

src-d/minhashcuda
Weighted MinHash implementation on CUDA (multi-gpu).
Language: C++ - Size: 89.8 KB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 117 - Forks: 24

bigmlcom/sketchy
Sketching Algorithms for Clojure (bloom filter, min-hash, hyper-loglog, count-min sketch)
Language: Clojure - Size: 147 KB - Last synced at: 9 days ago - Pushed at: almost 2 years ago - Stars: 148 - Forks: 18

sourmash-bio/wort
A database for signatures of public genomic sources
Language: Python - Size: 526 KB - Last synced at: 4 days ago - Pushed at: about 2 months ago - Stars: 18 - Forks: 2

dselivanov/LSHR
Locality Sensitive Hashing In R
Language: R - Size: 98.6 KB - Last synced at: 9 days ago - Pushed at: over 6 years ago - Stars: 40 - Forks: 13

LiveRamp/HyperMinHash-java
Union, intersection, and set cardinality in loglog space
Language: Java - Size: 572 KB - Last synced at: 27 days ago - Pushed at: almost 2 years ago - Stars: 56 - Forks: 10

guenthermi/fast_minh
Python package for fast MinHash calculation and operations
Language: C++ - Size: 19.5 KB - Last synced at: 27 days ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

markusorsi/mapchiral
Chiral version of the MinHashed Atom-Pair Fingerprint
Language: Python - Size: 323 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 19 - Forks: 5

lgautier/mashing-pumpkins
Minhash and maxhash library in Python, combining flexibility, expressivity, and performance.
Language: C - Size: 1.4 MB - Last synced at: 4 days ago - Pushed at: 4 months ago - Stars: 21 - Forks: 3

duhaime/minhash
Quickly estimate the similarity between many sets
Language: JavaScript - Size: 1010 KB - Last synced at: 11 days ago - Pushed at: over 2 years ago - Stars: 51 - Forks: 11

edawson/rkmh
Classify sequencing reads using MinHash.
Language: C++ - Size: 33.3 MB - Last synced at: 20 days ago - Pushed at: about 5 years ago - Stars: 48 - Forks: 4

shreyansh26/MinHash-Implemenation
A simple MinHash implementation based on the explanation in the Mining of Massive Datasets course by Stanford
Language: Python - Size: 7.4 MB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 0

oertl/treeminhash
TreeMinHash: Fast Sketching for Weighted Jaccard Similarity Estimation
Language: C++ - Size: 2.62 MB - Last synced at: 10 days ago - Pushed at: about 2 years ago - Stars: 14 - Forks: 3

oertl/probminhash
ProbMinHash – A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity
Language: C++ - Size: 6.26 MB - Last synced at: 10 days ago - Pushed at: over 4 years ago - Stars: 42 - Forks: 6

YaleDHLab/intertext 📦
Detect and visualize text reuse
Language: Python - Size: 3.11 MB - Last synced at: 5 months ago - Pushed at: 8 months ago - Stars: 115 - Forks: 10

davidsvy/Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Language: Python - Size: 188 KB - Last synced at: 5 months ago - Pushed at: over 3 years ago - Stars: 24 - Forks: 1

imzoc/fast-similarity-methods-rust 📦
Rust implementation of alignment-free similarity estimation methods.
Language: Rust - Size: 3.57 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

Forthoney/doc_sim
Approximate document similarity with Minhash + Locality Sensitive Hashing
Language: Ruby - Size: 48.8 KB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 2

esteinig/sketchy
Genomic neighbor typing of bacterial pathogens using MinHash :rat:
Language: Rust - Size: 20 MB - Last synced at: 5 days ago - Pushed at: over 2 years ago - Stars: 44 - Forks: 3

BlaCkinkGJ/catch-me-if-you-can
plagiarism detector
Language: Python - Size: 62.5 KB - Last synced at: 5 months ago - Pushed at: about 4 years ago - Stars: 23 - Forks: 3

sskender/analysis-of-massive-datasets
Analysis of Massive Datasets FER labs
Language: Python - Size: 19 MB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 2 - Forks: 0

codelibs/elasticsearch-minhash
Elasticsearch plugin for b-bit minhash algorism
Language: Java - Size: 250 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 60 - Forks: 14

andrewmcloud/consimilo
A Clojure library for querying large data-sets on similarity
Language: Clojure - Size: 536 KB - Last synced at: 9 days ago - Pushed at: about 6 years ago - Stars: 63 - Forks: 4

paul-sud/bigbed-jaccard
A tool to approximate the Jaccard similarity of bigBed files from functional genomic datasets
Language: Jupyter Notebook - Size: 777 KB - Last synced at: 11 months ago - Pushed at: almost 5 years ago - Stars: 1 - Forks: 0

ppw0/minhash
find similar text files quickly
Language: Python - Size: 53.7 KB - Last synced at: 5 months ago - Pushed at: almost 4 years ago - Stars: 6 - Forks: 1

edawson/mkmh
Generate kmers/minimizers/hashes/MinHash signatures, including with multiple kmer sizes.
Language: C++ - Size: 204 KB - Last synced at: 20 days ago - Pushed at: over 4 years ago - Stars: 24 - Forks: 2

vokter/vokter-scheduler
(WIP)
Size: 0 Bytes - Last synced at: 12 months ago - Pushed at: about 8 years ago - Stars: 0 - Forks: 0

vokter/vokter-client-java
Sample Jetty/Jersey2 server that interoperates with a running Vokter server (https://github.com/vokter/vokter).
Language: Java - Size: 7.81 KB - Last synced at: 12 months ago - Pushed at: almost 9 years ago - Stars: 0 - Forks: 0

vokter/vokter-server
(WIP) HTTP server that deploy distributes Vokter (https://github.com/vokter/vokter) through a REST API.
Size: 3.91 KB - Last synced at: 12 months ago - Pushed at: about 8 years ago - Stars: 0 - Forks: 0

hugofpaiva/mpei-p1 📦
Trabalho Prático da UC de Métodos Probabilísticos para Engenharia Informática, UA 2019/2020
Language: Java - Size: 39.9 MB - Last synced at: 12 months ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 1

sdevalapurkar/similar-questions
👯 Algorithms using Jaccard similarity to identify questions from a list that are similar to one another
Language: Python - Size: 13.6 MB - Last synced at: about 1 year ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 0

dynatrace-research/set-sketch-paper
SetSketch: Filling the Gap between MinHash and HyperLogLog
Language: C++ - Size: 23.7 MB - Last synced at: 12 months ago - Pushed at: over 3 years ago - Stars: 46 - Forks: 5

cristianovagos/bloomfilter
Bloom Filter and MinHash techniques built in MatLab
Language: Matlab - Size: 355 KB - Last synced at: about 1 year ago - Pushed at: about 8 years ago - Stars: 0 - Forks: 3

macmanes-lab/MCBS913_2019
This is the repo for the Spring 2019 version of MCBS913
Language: Python - Size: 1.89 MB - Last synced at: about 1 year ago - Pushed at: almost 6 years ago - Stars: 1 - Forks: 1

ekzhu/minhash-lsh
Minhash LSH in Golang
Language: Go - Size: 22.5 KB - Last synced at: 10 days ago - Pushed at: over 5 years ago - Stars: 25 - Forks: 15

pNre/Sketching
Collection of sketching algorithms in Swift
Language: Swift - Size: 52.7 KB - Last synced at: 4 days ago - Pushed at: over 4 years ago - Stars: 3 - Forks: 2

oertl/bagminhash
BagMinHash - Minwise Hashing Algorithm for Weighted Sets
Language: C++ - Size: 1.02 MB - Last synced at: 10 days ago - Pushed at: over 4 years ago - Stars: 26 - Forks: 6

XAH30/LSH-vs-Finesse
In this repository you can find an implementation of LSH (Local | Sensitive Hashing) and Finesse algorithms, designed to find similar data based on their hashes
Language: C++ - Size: 5.72 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

mbrg/py-hyperminhash
HyperLogLog with intersection
Language: Python - Size: 70.3 KB - Last synced at: 12 months ago - Pushed at: about 4 years ago - Stars: 4 - Forks: 0

fturati/floc-minhash-attacks
Implementation for the attacks of the paper "Locality-Sensitive Hashing Does Not Guarantee Privacy! Attacks on Google's FLoC and the MinHash Hierarchy System"
Language: Python - Size: 17 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

PauloMaced0/restaurant-recommender 📦
Development of an interactive system for restaurant recommendation, utilizing filtering algorithms like MinHash and Bloom Filter for analysis and personalized suggestions based on user evaluations.
Language: MATLAB - Size: 2.87 MB - Last synced at: 12 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

AmbarChatterjee/ADM_HW4_Group3
This repository contains code and analysis for a homework assignment on recommendation systems and clustering algorithms in Python. Implements techniques like minhash, LSH, feature engineering, dimensionality reduction, K-means and DBSCAN clustering.
Language: Jupyter Notebook - Size: 48.1 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

davidefiocco/dockerized-elasticsearch-duplicate-finder
Attempt to use MinHash to find duplicates in an Elasticsearch index
Language: Python - Size: 11.7 KB - Last synced at: 18 days ago - Pushed at: 12 months ago - Stars: 2 - Forks: 0

nekcht/minhash-lsh-evaluation
Assessing MinHash LSH for text similarity. Compares with kNN using BART embeddings as ground truth. Involves data preprocessing, shingle creation, LSH experiments. Findings inform LSH's efficiency in document similarity tasks, enhancing understanding of LSH techniques.
Language: Jupyter Notebook - Size: 367 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

EdDuarte/similarity-search-java
Easy-to-use Java similarity algorithms for text and numeric-series
Language: Java - Size: 149 KB - Last synced at: over 1 year ago - Pushed at: about 5 years ago - Stars: 18 - Forks: 10

will-rowe/groot
A resistome profiler for Graphing Resistance Out Of meTagenomes
Language: Go - Size: 12.5 MB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 61 - Forks: 6

mattilyra/LSH
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
Language: Python - Size: 513 KB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 259 - Forks: 72

nepiskopos/duplicate-questions-detection-lsh
Knowledge extraction through Data Analysis, including Locality Sensitive Hashing (LSH).
Language: Jupyter Notebook - Size: 423 KB - Last synced at: over 1 year ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

gurushida/mnemophonix
A simple audio fingerprinting system
Language: C - Size: 316 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 25 - Forks: 4

AIn0n/FMHD
Fast MinHash Distances algorithms collection
Language: C++ - Size: 288 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 4 - Forks: 1

HuangQiang/k-FreqItems
Massive Sparse Data Clustering Based on Frequent Items (SIGMOD 2023)
Language: Cuda - Size: 25.5 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 5 - Forks: 1

micts/jss
Fast Jaccard similarity search for abstract sets (documents, products, users, etc.) using MinHashing and Locality Sensitve Hashing
Language: Python - Size: 23.4 KB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 3 - Forks: 0

rigvedpatki/data-mining-assignment-1
Finding Similar Items: Textually Similar Documents
Language: TypeScript - Size: 267 KB - Last synced at: over 1 year ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 0

92amartins/minhash-example
MinHash Example
Language: Jupyter Notebook - Size: 2.93 KB - Last synced at: over 1 year ago - Pushed at: about 7 years ago - Stars: 1 - Forks: 0

CaroseKYS/minhash-test-java
对于minhash的测试程序
Language: Java - Size: 5.86 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

holopoj/FHCP
Implementation of the paper "Finding Highly Correlated Pairs with Powerful Pruning" in Java.
Language: Java - Size: 1.56 MB - Last synced at: over 1 year ago - Pushed at: about 8 years ago - Stars: 0 - Forks: 1

mariofv/DocSim
Minhash text analyzer developed during Algorithmics subject.
Language: C++ - Size: 43.1 MB - Last synced at: over 1 year ago - Pushed at: over 7 years ago - Stars: 0 - Forks: 1

lstasiak/Big-Data-Algorithms-exercises
Set of tasks solved in Big Data Algorithms course
Language: Scala - Size: 3.06 MB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

Cheng-Lin-Li/Spark
There are Python 2.7 codes and learning notes for Spark 2.1.1
Language: Python - Size: 2.62 MB - Last synced at: 14 days ago - Pushed at: over 6 years ago - Stars: 24 - Forks: 6

wherefortravel/minhash-node-rs
MinHash and LSH index written in Rust for Node.js
Language: Rust - Size: 207 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 12 - Forks: 1

privacy-lsh/floc-minhash
Implementation for the attacks of the paper "Locality-Sensitive Hashing Does Not Guarantee Privacy! Attacks on Google's FLoC and the MinHash Hierarchy System".
Language: Python - Size: 16.9 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

kmuinfosec/MUSEUM
Scalable and Multifaceted Search and Its Application for Malware Binary Files
Language: Python - Size: 461 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0

Mamiglia/ADM-LT_HW1
Homework 1 for the course Advanced Data Mining and Language Technology
Language: Jupyter Notebook - Size: 2.8 MB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

luizirber/2017-recomb
Poster presented at RECOMB 2017
Size: 2.66 MB - Last synced at: 4 days ago - Pushed at: almost 7 years ago - Stars: 7 - Forks: 3

ryputtam/Locality-Sensitive-Hashing-Plagiarism-Detection
Implementation of Locality Sensitive Hashing to detect plagiarism
Language: Jupyter Notebook - Size: 170 KB - Last synced at: 10 months ago - Pushed at: about 4 years ago - Stars: 2 - Forks: 1

steven-s/text-shingles
k-shingling for text to help compare similarity
Language: Python - Size: 11.7 KB - Last synced at: about 2 years ago - Pushed at: over 5 years ago - Stars: 15 - Forks: 11

luizirber/sourmash-rust 📦
Rust implementation of sourmash core functionality
Language: Rust - Size: 4.08 MB - Last synced at: 4 days ago - Pushed at: over 6 years ago - Stars: 9 - Forks: 0

vascoalramos/mpei 📦
Probability Methods for Informatics Engineering | UA 2018/2019
Language: Java - Size: 39.8 MB - Last synced at: about 2 years ago - Pushed at: about 5 years ago - Stars: 0 - Forks: 0

eduardosantoshf/search-repeated-news 📦
MPEI Project - Search repeated news (using Bloom filter) and similar news (using MinHash) from a news API.
Language: Java - Size: 32.3 MB - Last synced at: about 9 hours ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 1

zahraDehghanian97/MinHashing_Spark
In this repo. , I implement Cosine similarity and MinHashing function ( with and / or operator on band ) to find similarity to specific road in real Traffic dataset using PySpark.
Language: Jupyter Notebook - Size: 65.4 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

santurini/MinHash-LSH-From-Scratch
Implementing a simplified copy of Shazam application from scratch using MinHashing and LSH.
Language: Python - Size: 210 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 1

KarimLulu/locality-sensitive-hashing-knn
Approximate k-Nearest Neighbours in high-dimensional space via Locality Sensitive Hashing
Language: Jupyter Notebook - Size: 265 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

gibranfp/Sampled-MinHashing
A method to mine beyond-pairwise relationships using Min-Hashing for large-scale pattern discovery
Language: C - Size: 522 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 25 - Forks: 8

oma219/pacsketch
Network Anomaly Detection Using Probabilistic Data Structures
Language: C++ - Size: 12.8 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 0

travisbrady/flajolet
Probabilistic data structures for OCaml
Language: OCaml - Size: 220 KB - Last synced at: about 2 years ago - Pushed at: about 7 years ago - Stars: 38 - Forks: 3

TSunny007/Document-Similarity
Using Jaccard-Similarity and Minhashing to determine similarity between two text documents
Language: Jupyter Notebook - Size: 26.4 KB - Last synced at: about 2 years ago - Pushed at: about 7 years ago - Stars: 6 - Forks: 3

W3ndige/karton-similarity
Aurora karton for similiarity matching.
Language: Python - Size: 20.5 KB - Last synced at: about 2 years ago - Pushed at: almost 4 years ago - Stars: 2 - Forks: 0

haradama/pHash
Software to identify known plasmid sequence from metagenomic assembly using Minhash
Language: Go - Size: 15.9 MB - Last synced at: about 2 months ago - Pushed at: about 6 years ago - Stars: 3 - Forks: 0

steven-s/minhash-document-clusters
Minhash clustering of text documents
Language: Scala - Size: 33.2 KB - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 4 - Forks: 1

vokter/vokter
Document store that periodically checks for changes in web documents
Language: Java - Size: 120 MB - Last synced at: 12 months ago - Pushed at: over 2 years ago - Stars: 6 - Forks: 2

davidsbatista/MuSICo
A Minwise Hashing Method for Addressing Relationship Extraction from Text
Language: Java - Size: 37.4 MB - Last synced at: about 1 year ago - Pushed at: about 8 years ago - Stars: 5 - Forks: 2

fengxu1996/similarity_find
计算多个文本间相似度
Language: C++ - Size: 43.9 KB - Last synced at: over 1 year ago - Pushed at: over 6 years ago - Stars: 1 - Forks: 0

W3ndige/karton-minhash
Aurora karton for calculating minhash from input dataset.
Language: Python - Size: 16.6 KB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 0

pedrovt/mpei
Probability Methods for Informatics Engineering (University of Aveiro)
Language: Matlab - Size: 4.96 MB - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 1 - Forks: 0

FilipeLopesPires/SpellChecker
SpellChecker: an application to check for spell errors.
Language: Java - Size: 3.54 MB - Last synced at: 3 months ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 1

ahmetcandiroglu/reddit-analyzed
Finding similar subreddits using MinHash and SimRank algorithms
Language: Python - Size: 1.02 MB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 0

haradama/PlasmidPicker
Software to identify plasmid sequence data from metagenome using logistic regression and Minhash
Language: Python - Size: 133 MB - Last synced at: 10 days ago - Pushed at: over 6 years ago - Stars: 6 - Forks: 2

bufistov/plagiat-detector
Language: Python - Size: 10.7 KB - Last synced at: 5 months ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

hscspring/sto
MinHash and LSH Based Store and Query.
Language: Python - Size: 9.77 KB - Last synced at: 2 months ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0
