Ecosyste.ms: Repos
An open API service providing repository metadata for many open source software ecosystems.
GitHub topics: minhash
Callidon/bloom-filters
JS implementation of probabilistic data structures: Bloom Filter (and its derived), HyperLogLog, Count-Min Sketch, Top-K and MinHash
Language: TypeScript - Size: 7.93 MB - Last synced: 1 day ago - Pushed: 1 day ago - Stars: 348 - Forks: 38
sourmash-bio/wort
A database for signatures of public genomic sources
Language: Python - Size: 528 KB - Last synced: 6 days ago - Pushed: 5 months ago - Stars: 17 - Forks: 2
dnbaker/sketch
C++ Implementations of sketch data structures with SIMD Parallelism, including Python bindings
Language: C++ - Size: 4.43 MB - Last synced: 3 days ago - Pushed: 2 months ago - Stars: 149 - Forks: 14
ekzhu/datasketch
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
Language: Python - Size: 5.58 MB - Last synced: 8 days ago - Pushed: about 2 months ago - Stars: 2,360 - Forks: 288
h4sh5/bcddb
cross-architecture binary comparison database
Language: Python - Size: 252 KB - Last synced: 17 days ago - Pushed: 17 days ago - Stars: 7 - Forks: 2
guenthermi/fast_minh
Python package for fast MinHash calculation and operations
Language: C++ - Size: 16.6 KB - Last synced: 25 days ago - Pushed: 25 days ago - Stars: 0 - Forks: 0
vokter/vokter-scheduler
(WIP)
Size: 0 Bytes - Last synced: 29 days ago - Pushed: over 7 years ago - Stars: 0 - Forks: 0
vokter/vokter-client-java
Sample Jetty/Jersey2 server that interoperates with a running Vokter server (https://github.com/vokter/vokter).
Language: Java - Size: 7.81 KB - Last synced: 29 days ago - Pushed: almost 8 years ago - Stars: 0 - Forks: 0
vokter/vokter-server
(WIP) HTTP server that deploy distributes Vokter (https://github.com/vokter/vokter) through a REST API.
Size: 3.91 KB - Last synced: 29 days ago - Pushed: over 7 years ago - Stars: 0 - Forks: 0
hugofpaiva/mpei-p1 📦
Trabalho Prático da UC de MĂ©todos ProbabilĂsticos para Engenharia Informática, UA 2019/2020
Language: Java - Size: 39.9 MB - Last synced: 29 days ago - Pushed: about 3 years ago - Stars: 0 - Forks: 1
LiveRamp/HyperMinHash-java
Union, intersection, and set cardinality in loglog space
Language: Java - Size: 572 KB - Last synced: 20 days ago - Pushed: 11 months ago - Stars: 51 - Forks: 13
dynatrace-oss/hash4j
Dynatrace hash library for Java
Language: Java - Size: 40.4 MB - Last synced: about 1 month ago - Pushed: about 2 months ago - Stars: 73 - Forks: 9
davidsvy/Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Language: Python - Size: 188 KB - Last synced: 8 days ago - Pushed: over 2 years ago - Stars: 23 - Forks: 1
sourmash-bio/sourmash
Quickly search, compare, and analyze genomic and metagenomic data sets.
Language: Python - Size: 35 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 431 - Forks: 77
luizirber/sourmash-rust 📦
Rust implementation of sourmash core functionality
Language: Rust - Size: 4.08 MB - Last synced: about 1 month ago - Pushed: over 5 years ago - Stars: 9 - Forks: 0
sdevalapurkar/similar-questions
đź‘Ż Algorithms using Jaccard similarity to identify questions from a list that are similar to one another
Language: Python - Size: 13.6 MB - Last synced: about 1 month ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0
serega/gaoya
Locality Sensitive Hashing
Language: Rust - Size: 236 KB - Last synced: 26 days ago - Pushed: 11 months ago - Stars: 48 - Forks: 4
dynatrace-research/set-sketch-paper
SetSketch: Filling the Gap between MinHash and HyperLogLog
Language: C++ - Size: 23.7 MB - Last synced: 29 days ago - Pushed: almost 3 years ago - Stars: 46 - Forks: 5
duhaime/minhash
Quickly estimate the similarity between many sets
Language: JavaScript - Size: 1010 KB - Last synced: about 1 month ago - Pushed: over 1 year ago - Stars: 47 - Forks: 11
codelibs/elasticsearch-minhash
Elasticsearch plugin for b-bit minhash algorism
Language: Java - Size: 250 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 60 - Forks: 14
src-d/minhashcuda
Weighted MinHash implementation on CUDA (multi-gpu).
Language: C++ - Size: 89.8 KB - Last synced: about 1 month ago - Pushed: 6 months ago - Stars: 110 - Forks: 23
cristianovagos/bloomfilter
Bloom Filter and MinHash techniques built in MatLab
Language: Matlab - Size: 355 KB - Last synced: 2 months ago - Pushed: about 7 years ago - Stars: 0 - Forks: 3
macmanes-lab/MCBS913_2019
This is the repo for the Spring 2019 version of MCBS913
Language: Python - Size: 1.89 MB - Last synced: 2 months ago - Pushed: about 5 years ago - Stars: 1 - Forks: 1
ekzhu/minhash-lsh
Minhash LSH in Golang
Language: Go - Size: 22.5 KB - Last synced: 25 days ago - Pushed: over 4 years ago - Stars: 25 - Forks: 13
YaleDHLab/intertext
Detect and visualize text reuse
Language: Python - Size: 3.1 MB - Last synced: about 1 month ago - Pushed: about 1 year ago - Stars: 110 - Forks: 11
rushyaP/Locality-Sensitive-Hashing-Plagiarism-Detection
Implementation of Locality Sensitive Hashing to detect plagiarism
Language: Jupyter Notebook - Size: 170 KB - Last synced: 3 months ago - Pushed: about 3 years ago - Stars: 2 - Forks: 1
esteinig/sketchy
Genomic neighbor typing of bacterial pathogens using MinHash :rat:
Language: Rust - Size: 20 MB - Last synced: 17 days ago - Pushed: over 1 year ago - Stars: 43 - Forks: 3
codelibs/minhash
This provides tools for b-bit MinHash algorism.
Language: Java - Size: 46.9 KB - Last synced: about 1 month ago - Pushed: 5 months ago - Stars: 34 - Forks: 10
justinbt1/Akin
Python library for detecting near duplicate texts in a corpus at scale using Locality Sensitive Hashing, as described in chapter three of Mining Massive Datasets.
Language: Python - Size: 154 KB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 5 - Forks: 0
pNre/Sketching
Collection of sketching algorithms in Swift
Language: Swift - Size: 52.7 KB - Last synced: 11 days ago - Pushed: over 3 years ago - Stars: 3 - Forks: 2
XAH30/LSH-vs-Finesse
In this repository you can find an implementation of LSH (Local | Sensitive Hashing) and Finesse algorithms, designed to find similar data based on their hashes
Language: C++ - Size: 5.72 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 1 - Forks: 0
shaltielshmid/MinHashSharp
A Robust Library in C# for Similarity Estimation
Language: C# - Size: 39.1 KB - Last synced: 27 days ago - Pushed: 6 months ago - Stars: 1 - Forks: 0
mbrg/py-hyperminhash
HyperLogLog with intersection
Language: Python - Size: 70.3 KB - Last synced: 14 days ago - Pushed: about 3 years ago - Stars: 4 - Forks: 0
fturati/floc-minhash-attacks
Implementation for the attacks of the paper "Locality-Sensitive Hashing Does Not Guarantee Privacy! Attacks on Google's FLoC and the MinHash Hierarchy System"
Language: Python - Size: 17 MB - Last synced: 5 months ago - Pushed: 11 months ago - Stars: 0 - Forks: 0
lgautier/mashing-pumpkins
Minhash and maxhash library in Python, combining flexibility, expressivity, and performance.
Language: C - Size: 1.37 MB - Last synced: 21 days ago - Pushed: 5 months ago - Stars: 19 - Forks: 3
PauloMaced0/restaurant-recommender 📦
Development of an interactive system for restaurant recommendation, utilizing filtering algorithms like MinHash and Bloom Filter for analysis and personalized suggestions based on user evaluations.
Language: MATLAB - Size: 2.87 MB - Last synced: 26 days ago - Pushed: 5 months ago - Stars: 0 - Forks: 0
AmbarChatterjee/ADM_HW4_Group3
This repository contains code and analysis for a homework assignment on recommendation systems and clustering algorithms in Python. Implements techniques like minhash, LSH, feature engineering, dimensionality reduction, K-means and DBSCAN clustering.
Language: Jupyter Notebook - Size: 48.1 MB - Last synced: 4 months ago - Pushed: 5 months ago - Stars: 0 - Forks: 0
davidefiocco/dockerized-elasticsearch-duplicate-finder
Attempt to use MinHash to find duplicates in an Elasticsearch index
Language: Python - Size: 11.7 KB - Last synced: 20 days ago - Pushed: 20 days ago - Stars: 2 - Forks: 0
nekcht/minhash-lsh-evaluation
Assessing MinHash LSH for text similarity. Compares with kNN using BART embeddings as ground truth. Involves data preprocessing, shingle creation, LSH experiments. Findings inform LSH's efficiency in document similarity tasks, enhancing understanding of LSH techniques.
Language: Jupyter Notebook - Size: 367 KB - Last synced: 6 months ago - Pushed: 6 months ago - Stars: 0 - Forks: 0
BlaCkinkGJ/catch-me-if-you-can
plagiarism detector
Language: Python - Size: 62.5 KB - Last synced: about 2 months ago - Pushed: about 3 years ago - Stars: 22 - Forks: 3
EdDuarte/similarity-search-java
Easy-to-use Java similarity algorithms for text and numeric-series
Language: Java - Size: 149 KB - Last synced: 7 months ago - Pushed: over 4 years ago - Stars: 18 - Forks: 10
will-rowe/groot
A resistome profiler for Graphing Resistance Out Of meTagenomes
Language: Go - Size: 12.5 MB - Last synced: 7 months ago - Pushed: about 4 years ago - Stars: 61 - Forks: 6
oertl/bagminhash
BagMinHash - Minwise Hashing Algorithm for Weighted Sets
Language: C++ - Size: 1.02 MB - Last synced: 7 months ago - Pushed: over 3 years ago - Stars: 25 - Forks: 6
npredey/GeneNetworks
Language: Python - Size: 60.5 KB - Last synced: 7 months ago - Pushed: about 6 years ago - Stars: 0 - Forks: 0
mattilyra/LSH
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
Language: Python - Size: 513 KB - Last synced: 7 months ago - Pushed: 12 months ago - Stars: 259 - Forks: 72
haradama/PlasmidPicker
Software to identify plasmid sequence data from metagenome using logistic regression and Minhash
Language: Python - Size: 133 MB - Last synced: 7 months ago - Pushed: over 5 years ago - Stars: 6 - Forks: 2
edawson/rkmh
Classify sequencing reads using MinHash.
Language: C++ - Size: 33.3 MB - Last synced: 7 months ago - Pushed: about 4 years ago - Stars: 43 - Forks: 4
nepiskopos/duplicate-questions-detection-lsh
Knowledge extraction through Data Analysis, including Locality Sensitive Hashing (LSH).
Language: Jupyter Notebook - Size: 423 KB - Last synced: 7 months ago - Pushed: almost 2 years ago - Stars: 0 - Forks: 0
gurushida/mnemophonix
A simple audio fingerprinting system
Language: C - Size: 316 KB - Last synced: 8 months ago - Pushed: over 1 year ago - Stars: 25 - Forks: 4
AIn0n/FMHD
Fast MinHash Distances algorithms collection
Language: C++ - Size: 288 KB - Last synced: 4 months ago - Pushed: 8 months ago - Stars: 4 - Forks: 1
ppw0/minhash
find similar text files quickly
Language: Python - Size: 53.7 KB - Last synced: 12 days ago - Pushed: about 3 years ago - Stars: 6 - Forks: 2
HuangQiang/k-FreqItems
Massive Sparse Data Clustering Based on Frequent Items (SIGMOD 2023)
Language: Cuda - Size: 25.5 MB - Last synced: 8 months ago - Pushed: 8 months ago - Stars: 5 - Forks: 1
micts/jss
Fast Jaccard similarity search for abstract sets (documents, products, users, etc.) using MinHashing and Locality Sensitve Hashing
Language: Python - Size: 23.4 KB - Last synced: 9 months ago - Pushed: about 4 years ago - Stars: 3 - Forks: 0
sskender/analysis-of-massive-datasets
Analysis of Massive Datasets FER labs
Language: Python - Size: 19 MB - Last synced: 9 months ago - Pushed: almost 2 years ago - Stars: 1 - Forks: 0
Forthoney/doc_sim
Approximate document similarity with Minhash + Locality Sensitive Hashing
Language: Ruby - Size: 48.8 KB - Last synced: 14 days ago - Pushed: 8 months ago - Stars: 0 - Forks: 0
rigvedpatki/data-mining-assignment-1
Finding Similar Items: Textually Similar Documents
Language: TypeScript - Size: 267 KB - Last synced: 9 months ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0
92amartins/minhash-example
MinHash Example
Language: Jupyter Notebook - Size: 2.93 KB - Last synced: 9 months ago - Pushed: about 6 years ago - Stars: 1 - Forks: 0
CaroseKYS/minhash-test-java
对于minhash的测试程序
Language: Java - Size: 5.86 KB - Last synced: 10 months ago - Pushed: 10 months ago - Stars: 0 - Forks: 0
holopoj/FHCP
Implementation of the paper "Finding Highly Correlated Pairs with Powerful Pruning" in Java.
Language: Java - Size: 1.56 MB - Last synced: 10 months ago - Pushed: about 7 years ago - Stars: 0 - Forks: 1
anastasia/minhash
Language: Python - Size: 16.6 KB - Last synced: 10 months ago - Pushed: over 5 years ago - Stars: 2 - Forks: 1
mariofv/DocSim
Minhash text analyzer developed during Algorithmics subject.
Language: C++ - Size: 43.1 MB - Last synced: 10 months ago - Pushed: over 6 years ago - Stars: 0 - Forks: 1
dselivanov/LSHR
Locality Sensitive Hashing In R
Language: R - Size: 98.6 KB - Last synced: 10 months ago - Pushed: over 5 years ago - Stars: 39 - Forks: 13
lstasiak/Big-Data-Algorithms-exercises
Set of tasks solved in Big Data Algorithms course
Language: Scala - Size: 3.06 MB - Last synced: 10 months ago - Pushed: almost 3 years ago - Stars: 0 - Forks: 0
wherefortravel/minhash-node-rs
MinHash and LSH index written in Rust for Node.js
Language: Rust - Size: 207 KB - Last synced: 10 months ago - Pushed: 10 months ago - Stars: 12 - Forks: 1
bigmlcom/sketchy
Sketching Algorithms for Clojure (bloom filter, min-hash, hyper-loglog, count-min sketch)
Language: Clojure - Size: 147 KB - Last synced: about 1 month ago - Pushed: 12 months ago - Stars: 146 - Forks: 18
privacy-lsh/floc-minhash
Implementation for the attacks of the paper "Locality-Sensitive Hashing Does Not Guarantee Privacy! Attacks on Google's FLoC and the MinHash Hierarchy System".
Language: Python - Size: 16.9 MB - Last synced: 12 months ago - Pushed: 12 months ago - Stars: 0 - Forks: 0
kmuinfosec/MUSEUM
Scalable and Multifaceted Search and Its Application for Malware Binary Files
Language: Python - Size: 461 KB - Last synced: about 1 year ago - Pushed: about 1 year ago - Stars: 1 - Forks: 0
andrewmcloud/consimilo
A Clojure library for querying large data-sets on similarity
Language: Clojure - Size: 536 KB - Last synced: 13 days ago - Pushed: over 5 years ago - Stars: 62 - Forks: 4
Mamiglia/ADM-LT_HW1
Homework 1 for the course Advanced Data Mining and Language Technology
Language: Jupyter Notebook - Size: 2.8 MB - Last synced: about 1 month ago - Pushed: about 1 year ago - Stars: 0 - Forks: 0
shreyansh26/MinHash-Implemenation
A simple MinHash implementation based on the explanation in the Mining of Massive Datasets course by Stanford
Language: Python - Size: 7.4 MB - Last synced: about 1 month ago - Pushed: over 1 year ago - Stars: 1 - Forks: 0
luizirber/2017-recomb
Poster presented at RECOMB 2017
Size: 2.66 MB - Last synced: about 1 month ago - Pushed: almost 6 years ago - Stars: 7 - Forks: 3
oertl/probminhash
ProbMinHash – A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity
Language: C++ - Size: 6.26 MB - Last synced: about 1 year ago - Pushed: over 3 years ago - Stars: 33 - Forks: 3
edawson/mkmh
Generate kmers/minimizers/hashes/MinHash signatures, including with multiple kmer sizes.
Language: C++ - Size: 204 KB - Last synced: about 1 year ago - Pushed: over 3 years ago - Stars: 23 - Forks: 2
steven-s/text-shingles
k-shingling for text to help compare similarity
Language: Python - Size: 11.7 KB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 15 - Forks: 11
vascoalramos/mpei 📦
Probability Methods for Informatics Engineering | UA 2018/2019
Language: Java - Size: 39.8 MB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 0 - Forks: 0
eduardosantoshf/search-repeated-news 📦
MPEI Project - Search repeated news (using Bloom filter) and similar news (using MinHash) from a news API.
Language: Java - Size: 32.3 MB - Last synced: about 1 year ago - Pushed: almost 2 years ago - Stars: 0 - Forks: 1
zahraDehghanian97/MinHashing_Spark
In this repo. , I implement Cosine similarity and MinHashing function ( with and / or operator on band ) to find similarity to specific road in real Traffic dataset using PySpark.
Language: Jupyter Notebook - Size: 65.4 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 1 - Forks: 0
oertl/treeminhash
TreeMinHash: Fast Sketching for Weighted Jaccard Similarity Estimation
Language: C++ - Size: 2.62 MB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 12 - Forks: 3
santurini/MinHash-LSH-From-Scratch
Implementing a simplified copy of Shazam application from scratch using MinHashing and LSH.
Language: Python - Size: 210 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 0 - Forks: 1
KarimLulu/locality-sensitive-hashing-knn
Approximate k-Nearest Neighbours in high-dimensional space via Locality Sensitive Hashing
Language: Jupyter Notebook - Size: 265 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 0 - Forks: 0
gibranfp/Sampled-MinHashing
A method to mine beyond-pairwise relationships using Min-Hashing for large-scale pattern discovery
Language: C - Size: 522 KB - Last synced: 7 months ago - Pushed: over 2 years ago - Stars: 25 - Forks: 8
Cheng-Lin-Li/Spark
There are Python 2.7 codes and learning notes for Spark 2.1.1
Language: Python - Size: 2.62 MB - Last synced: about 1 year ago - Pushed: almost 6 years ago - Stars: 23 - Forks: 6
oma219/pacsketch
Network Anomaly Detection Using Probabilistic Data Structures
Language: C++ - Size: 12.8 MB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 2 - Forks: 0
travisbrady/flajolet
Probabilistic data structures for OCaml
Language: OCaml - Size: 220 KB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 38 - Forks: 3
TSunny007/Document-Similarity
Using Jaccard-Similarity and Minhashing to determine similarity between two text documents
Language: Jupyter Notebook - Size: 26.4 KB - Last synced: about 1 year ago - Pushed: about 6 years ago - Stars: 6 - Forks: 3
W3ndige/karton-similarity
Aurora karton for similiarity matching.
Language: Python - Size: 20.5 KB - Last synced: about 1 year ago - Pushed: almost 3 years ago - Stars: 2 - Forks: 0
haradama/pHash
Software to identify known plasmid sequence from metagenomic assembly using Minhash
Language: Go - Size: 15.9 MB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 3 - Forks: 0
steven-s/minhash-document-clusters
Minhash clustering of text documents
Language: Scala - Size: 33.2 KB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 4 - Forks: 1
vokter/vokter
Document store that periodically checks for changes in web documents
Language: Java - Size: 120 MB - Last synced: 28 days ago - Pushed: over 1 year ago - Stars: 6 - Forks: 2
davidsbatista/MuSICo
A Minwise Hashing Method for Addressing Relationship Extraction from Text
Language: Java - Size: 37.4 MB - Last synced: about 1 month ago - Pushed: about 7 years ago - Stars: 5 - Forks: 2
fengxu1996/similarity_find
计算多个文本间相似度
Language: C++ - Size: 43.9 KB - Last synced: 5 months ago - Pushed: almost 6 years ago - Stars: 1 - Forks: 0
W3ndige/karton-minhash
Aurora karton for calculating minhash from input dataset.
Language: Python - Size: 16.6 KB - Last synced: about 1 year ago - Pushed: about 3 years ago - Stars: 0 - Forks: 0
pedrovt/mpei
Probability Methods for Informatics Engineering (University of Aveiro)
Language: Matlab - Size: 4.96 MB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 1 - Forks: 0
FilipePires98/SpellChecker
SpellChecker: an application to check for spell errors.
Language: Java - Size: 3.54 MB - Last synced: about 1 year ago - Pushed: about 3 years ago - Stars: 0 - Forks: 1
ahmetcandiroglu/reddit-analyzed
Finding similar subreddits using MinHash and SimRank algorithms
Language: Python - Size: 1.02 MB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0
hscspring/sto
MinHash and LSH Based Store and Query.
Language: Python - Size: 9.77 KB - Last synced: about 1 year ago - Pushed: almost 2 years ago - Stars: 0 - Forks: 0
zxmeng/SimilarityDetection
Similarity Detection on Wikipedia Articles using MinHash and Random Projection implemented in Hadoop/Spark
Language: Java - Size: 69.5 MB - Last synced: about 1 year ago - Pushed: almost 6 years ago - Stars: 1 - Forks: 1
tkukurin/Lab.Bioinformatics
University work. Approximate aligner for long DNA sequences. Estimates Jaccard similarity from k-mers via minimizers and MinHash, then uses it as a sequence identity proxy.
Language: Java - Size: 90.3 MB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 0 - Forks: 0
joaocps/mpei-bloomfilter
Probabilistic methods for computer engineering - Final Project
Language: Java - Size: 596 KB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 0 - Forks: 0
DearMadMan/minhash
An implementation of the minhash algorithm in golang
Language: Go - Size: 2.93 KB - Last synced: 9 months ago - Pushed: almost 5 years ago - Stars: 2 - Forks: 0