Ecosyste.ms: Repos

An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: minhash

Callidon/bloom-filters

JS implementation of probabilistic data structures: Bloom Filter (and its derived), HyperLogLog, Count-Min Sketch, Top-K and MinHash

Language: TypeScript - Size: 7.93 MB - Last synced: 1 day ago - Pushed: 1 day ago - Stars: 348 - Forks: 38

sourmash-bio/wort

A database for signatures of public genomic sources

Language: Python - Size: 528 KB - Last synced: 6 days ago - Pushed: 5 months ago - Stars: 17 - Forks: 2

dnbaker/sketch

C++ Implementations of sketch data structures with SIMD Parallelism, including Python bindings

Language: C++ - Size: 4.43 MB - Last synced: 3 days ago - Pushed: 2 months ago - Stars: 149 - Forks: 14

ekzhu/datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW

Language: Python - Size: 5.58 MB - Last synced: 8 days ago - Pushed: about 2 months ago - Stars: 2,360 - Forks: 288

h4sh5/bcddb

cross-architecture binary comparison database

Language: Python - Size: 252 KB - Last synced: 17 days ago - Pushed: 17 days ago - Stars: 7 - Forks: 2

guenthermi/fast_minh

Python package for fast MinHash calculation and operations

Language: C++ - Size: 16.6 KB - Last synced: 25 days ago - Pushed: 25 days ago - Stars: 0 - Forks: 0

vokter/vokter-scheduler

(WIP)

Size: 0 Bytes - Last synced: 29 days ago - Pushed: over 7 years ago - Stars: 0 - Forks: 0

vokter/vokter-client-java

Sample Jetty/Jersey2 server that interoperates with a running Vokter server (https://github.com/vokter/vokter).

Language: Java - Size: 7.81 KB - Last synced: 29 days ago - Pushed: almost 8 years ago - Stars: 0 - Forks: 0

vokter/vokter-server

(WIP) HTTP server that deploy distributes Vokter (https://github.com/vokter/vokter) through a REST API.

Size: 3.91 KB - Last synced: 29 days ago - Pushed: over 7 years ago - Stars: 0 - Forks: 0

hugofpaiva/mpei-p1 📦

Trabalho Prático da UC de Métodos Probabilísticos para Engenharia Informática, UA 2019/2020

Language: Java - Size: 39.9 MB - Last synced: 29 days ago - Pushed: about 3 years ago - Stars: 0 - Forks: 1

LiveRamp/HyperMinHash-java

Union, intersection, and set cardinality in loglog space

Language: Java - Size: 572 KB - Last synced: 20 days ago - Pushed: 11 months ago - Stars: 51 - Forks: 13

dynatrace-oss/hash4j

Dynatrace hash library for Java

Language: Java - Size: 40.4 MB - Last synced: about 1 month ago - Pushed: about 2 months ago - Stars: 73 - Forks: 9

davidsvy/Neural-Scam-Artist

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Language: Python - Size: 188 KB - Last synced: 8 days ago - Pushed: over 2 years ago - Stars: 23 - Forks: 1

sourmash-bio/sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.

Language: Python - Size: 35 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 431 - Forks: 77

luizirber/sourmash-rust 📦

Rust implementation of sourmash core functionality

Language: Rust - Size: 4.08 MB - Last synced: about 1 month ago - Pushed: over 5 years ago - Stars: 9 - Forks: 0

sdevalapurkar/similar-questions

đź‘Ż Algorithms using Jaccard similarity to identify questions from a list that are similar to one another

Language: Python - Size: 13.6 MB - Last synced: about 1 month ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0

serega/gaoya

Locality Sensitive Hashing

Language: Rust - Size: 236 KB - Last synced: 26 days ago - Pushed: 11 months ago - Stars: 48 - Forks: 4

dynatrace-research/set-sketch-paper

SetSketch: Filling the Gap between MinHash and HyperLogLog

Language: C++ - Size: 23.7 MB - Last synced: 29 days ago - Pushed: almost 3 years ago - Stars: 46 - Forks: 5

duhaime/minhash

Quickly estimate the similarity between many sets

Language: JavaScript - Size: 1010 KB - Last synced: about 1 month ago - Pushed: over 1 year ago - Stars: 47 - Forks: 11

codelibs/elasticsearch-minhash

Elasticsearch plugin for b-bit minhash algorism

Language: Java - Size: 250 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 60 - Forks: 14

src-d/minhashcuda

Weighted MinHash implementation on CUDA (multi-gpu).

Language: C++ - Size: 89.8 KB - Last synced: about 1 month ago - Pushed: 6 months ago - Stars: 110 - Forks: 23

cristianovagos/bloomfilter

Bloom Filter and MinHash techniques built in MatLab

Language: Matlab - Size: 355 KB - Last synced: 2 months ago - Pushed: about 7 years ago - Stars: 0 - Forks: 3

macmanes-lab/MCBS913_2019

This is the repo for the Spring 2019 version of MCBS913

Language: Python - Size: 1.89 MB - Last synced: 2 months ago - Pushed: about 5 years ago - Stars: 1 - Forks: 1

ekzhu/minhash-lsh

Minhash LSH in Golang

Language: Go - Size: 22.5 KB - Last synced: 25 days ago - Pushed: over 4 years ago - Stars: 25 - Forks: 13

YaleDHLab/intertext

Detect and visualize text reuse

Language: Python - Size: 3.1 MB - Last synced: about 1 month ago - Pushed: about 1 year ago - Stars: 110 - Forks: 11

rushyaP/Locality-Sensitive-Hashing-Plagiarism-Detection

Implementation of Locality Sensitive Hashing to detect plagiarism

Language: Jupyter Notebook - Size: 170 KB - Last synced: 3 months ago - Pushed: about 3 years ago - Stars: 2 - Forks: 1

esteinig/sketchy

Genomic neighbor typing of bacterial pathogens using MinHash :rat:

Language: Rust - Size: 20 MB - Last synced: 17 days ago - Pushed: over 1 year ago - Stars: 43 - Forks: 3

codelibs/minhash

This provides tools for b-bit MinHash algorism.

Language: Java - Size: 46.9 KB - Last synced: about 1 month ago - Pushed: 5 months ago - Stars: 34 - Forks: 10

justinbt1/Akin

Python library for detecting near duplicate texts in a corpus at scale using Locality Sensitive Hashing, as described in chapter three of Mining Massive Datasets.

Language: Python - Size: 154 KB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 5 - Forks: 0

pNre/Sketching

Collection of sketching algorithms in Swift

Language: Swift - Size: 52.7 KB - Last synced: 11 days ago - Pushed: over 3 years ago - Stars: 3 - Forks: 2

XAH30/LSH-vs-Finesse

In this repository you can find an implementation of LSH (Local | Sensitive Hashing) and Finesse algorithms, designed to find similar data based on their hashes

Language: C++ - Size: 5.72 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 1 - Forks: 0

shaltielshmid/MinHashSharp

A Robust Library in C# for Similarity Estimation

Language: C# - Size: 39.1 KB - Last synced: 27 days ago - Pushed: 6 months ago - Stars: 1 - Forks: 0

mbrg/py-hyperminhash

HyperLogLog with intersection

Language: Python - Size: 70.3 KB - Last synced: 14 days ago - Pushed: about 3 years ago - Stars: 4 - Forks: 0

fturati/floc-minhash-attacks

Implementation for the attacks of the paper "Locality-Sensitive Hashing Does Not Guarantee Privacy! Attacks on Google's FLoC and the MinHash Hierarchy System"

Language: Python - Size: 17 MB - Last synced: 5 months ago - Pushed: 11 months ago - Stars: 0 - Forks: 0

lgautier/mashing-pumpkins

Minhash and maxhash library in Python, combining flexibility, expressivity, and performance.

Language: C - Size: 1.37 MB - Last synced: 21 days ago - Pushed: 5 months ago - Stars: 19 - Forks: 3

PauloMaced0/restaurant-recommender 📦

Development of an interactive system for restaurant recommendation, utilizing filtering algorithms like MinHash and Bloom Filter for analysis and personalized suggestions based on user evaluations.

Language: MATLAB - Size: 2.87 MB - Last synced: 26 days ago - Pushed: 5 months ago - Stars: 0 - Forks: 0

AmbarChatterjee/ADM_HW4_Group3

This repository contains code and analysis for a homework assignment on recommendation systems and clustering algorithms in Python. Implements techniques like minhash, LSH, feature engineering, dimensionality reduction, K-means and DBSCAN clustering.

Language: Jupyter Notebook - Size: 48.1 MB - Last synced: 4 months ago - Pushed: 5 months ago - Stars: 0 - Forks: 0

davidefiocco/dockerized-elasticsearch-duplicate-finder

Attempt to use MinHash to find duplicates in an Elasticsearch index

Language: Python - Size: 11.7 KB - Last synced: 20 days ago - Pushed: 20 days ago - Stars: 2 - Forks: 0

nekcht/minhash-lsh-evaluation

Assessing MinHash LSH for text similarity. Compares with kNN using BART embeddings as ground truth. Involves data preprocessing, shingle creation, LSH experiments. Findings inform LSH's efficiency in document similarity tasks, enhancing understanding of LSH techniques.

Language: Jupyter Notebook - Size: 367 KB - Last synced: 6 months ago - Pushed: 6 months ago - Stars: 0 - Forks: 0

BlaCkinkGJ/catch-me-if-you-can

plagiarism detector

Language: Python - Size: 62.5 KB - Last synced: about 2 months ago - Pushed: about 3 years ago - Stars: 22 - Forks: 3

EdDuarte/similarity-search-java

Easy-to-use Java similarity algorithms for text and numeric-series

Language: Java - Size: 149 KB - Last synced: 7 months ago - Pushed: over 4 years ago - Stars: 18 - Forks: 10

will-rowe/groot

A resistome profiler for Graphing Resistance Out Of meTagenomes

Language: Go - Size: 12.5 MB - Last synced: 7 months ago - Pushed: about 4 years ago - Stars: 61 - Forks: 6

oertl/bagminhash

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

Language: C++ - Size: 1.02 MB - Last synced: 7 months ago - Pushed: over 3 years ago - Stars: 25 - Forks: 6

npredey/GeneNetworks

Language: Python - Size: 60.5 KB - Last synced: 7 months ago - Pushed: about 6 years ago - Stars: 0 - Forks: 0

mattilyra/LSH

Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents

Language: Python - Size: 513 KB - Last synced: 7 months ago - Pushed: 12 months ago - Stars: 259 - Forks: 72

haradama/PlasmidPicker

Software to identify plasmid sequence data from metagenome using logistic regression and Minhash

Language: Python - Size: 133 MB - Last synced: 7 months ago - Pushed: over 5 years ago - Stars: 6 - Forks: 2

edawson/rkmh

Classify sequencing reads using MinHash.

Language: C++ - Size: 33.3 MB - Last synced: 7 months ago - Pushed: about 4 years ago - Stars: 43 - Forks: 4

nepiskopos/duplicate-questions-detection-lsh

Knowledge extraction through Data Analysis, including Locality Sensitive Hashing (LSH).

Language: Jupyter Notebook - Size: 423 KB - Last synced: 7 months ago - Pushed: almost 2 years ago - Stars: 0 - Forks: 0

gurushida/mnemophonix

A simple audio fingerprinting system

Language: C - Size: 316 KB - Last synced: 8 months ago - Pushed: over 1 year ago - Stars: 25 - Forks: 4

AIn0n/FMHD

Fast MinHash Distances algorithms collection

Language: C++ - Size: 288 KB - Last synced: 4 months ago - Pushed: 8 months ago - Stars: 4 - Forks: 1

ppw0/minhash

find similar text files quickly

Language: Python - Size: 53.7 KB - Last synced: 12 days ago - Pushed: about 3 years ago - Stars: 6 - Forks: 2

HuangQiang/k-FreqItems

Massive Sparse Data Clustering Based on Frequent Items (SIGMOD 2023)

Language: Cuda - Size: 25.5 MB - Last synced: 8 months ago - Pushed: 8 months ago - Stars: 5 - Forks: 1

micts/jss

Fast Jaccard similarity search for abstract sets (documents, products, users, etc.) using MinHashing and Locality Sensitve Hashing

Language: Python - Size: 23.4 KB - Last synced: 9 months ago - Pushed: about 4 years ago - Stars: 3 - Forks: 0

sskender/analysis-of-massive-datasets

Analysis of Massive Datasets FER labs

Language: Python - Size: 19 MB - Last synced: 9 months ago - Pushed: almost 2 years ago - Stars: 1 - Forks: 0

Forthoney/doc_sim

Approximate document similarity with Minhash + Locality Sensitive Hashing

Language: Ruby - Size: 48.8 KB - Last synced: 14 days ago - Pushed: 8 months ago - Stars: 0 - Forks: 0

rigvedpatki/data-mining-assignment-1

Finding Similar Items: Textually Similar Documents

Language: TypeScript - Size: 267 KB - Last synced: 9 months ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0

92amartins/minhash-example

MinHash Example

Language: Jupyter Notebook - Size: 2.93 KB - Last synced: 9 months ago - Pushed: about 6 years ago - Stars: 1 - Forks: 0

CaroseKYS/minhash-test-java

对于minhash的测试程序

Language: Java - Size: 5.86 KB - Last synced: 10 months ago - Pushed: 10 months ago - Stars: 0 - Forks: 0

holopoj/FHCP

Implementation of the paper "Finding Highly Correlated Pairs with Powerful Pruning" in Java.

Language: Java - Size: 1.56 MB - Last synced: 10 months ago - Pushed: about 7 years ago - Stars: 0 - Forks: 1

anastasia/minhash

Language: Python - Size: 16.6 KB - Last synced: 10 months ago - Pushed: over 5 years ago - Stars: 2 - Forks: 1

mariofv/DocSim

Minhash text analyzer developed during Algorithmics subject.

Language: C++ - Size: 43.1 MB - Last synced: 10 months ago - Pushed: over 6 years ago - Stars: 0 - Forks: 1

dselivanov/LSHR

Locality Sensitive Hashing In R

Language: R - Size: 98.6 KB - Last synced: 10 months ago - Pushed: over 5 years ago - Stars: 39 - Forks: 13

lstasiak/Big-Data-Algorithms-exercises

Set of tasks solved in Big Data Algorithms course

Language: Scala - Size: 3.06 MB - Last synced: 10 months ago - Pushed: almost 3 years ago - Stars: 0 - Forks: 0

wherefortravel/minhash-node-rs

MinHash and LSH index written in Rust for Node.js

Language: Rust - Size: 207 KB - Last synced: 10 months ago - Pushed: 10 months ago - Stars: 12 - Forks: 1

bigmlcom/sketchy

Sketching Algorithms for Clojure (bloom filter, min-hash, hyper-loglog, count-min sketch)

Language: Clojure - Size: 147 KB - Last synced: about 1 month ago - Pushed: 12 months ago - Stars: 146 - Forks: 18

privacy-lsh/floc-minhash

Implementation for the attacks of the paper "Locality-Sensitive Hashing Does Not Guarantee Privacy! Attacks on Google's FLoC and the MinHash Hierarchy System".

Language: Python - Size: 16.9 MB - Last synced: 12 months ago - Pushed: 12 months ago - Stars: 0 - Forks: 0

kmuinfosec/MUSEUM

Scalable and Multifaceted Search and Its Application for Malware Binary Files

Language: Python - Size: 461 KB - Last synced: about 1 year ago - Pushed: about 1 year ago - Stars: 1 - Forks: 0

andrewmcloud/consimilo

A Clojure library for querying large data-sets on similarity

Language: Clojure - Size: 536 KB - Last synced: 13 days ago - Pushed: over 5 years ago - Stars: 62 - Forks: 4

Mamiglia/ADM-LT_HW1

Homework 1 for the course Advanced Data Mining and Language Technology

Language: Jupyter Notebook - Size: 2.8 MB - Last synced: about 1 month ago - Pushed: about 1 year ago - Stars: 0 - Forks: 0

shreyansh26/MinHash-Implemenation

A simple MinHash implementation based on the explanation in the Mining of Massive Datasets course by Stanford

Language: Python - Size: 7.4 MB - Last synced: about 1 month ago - Pushed: over 1 year ago - Stars: 1 - Forks: 0

luizirber/2017-recomb

Poster presented at RECOMB 2017

Size: 2.66 MB - Last synced: about 1 month ago - Pushed: almost 6 years ago - Stars: 7 - Forks: 3

oertl/probminhash

ProbMinHash – A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity

Language: C++ - Size: 6.26 MB - Last synced: about 1 year ago - Pushed: over 3 years ago - Stars: 33 - Forks: 3

edawson/mkmh

Generate kmers/minimizers/hashes/MinHash signatures, including with multiple kmer sizes.

Language: C++ - Size: 204 KB - Last synced: about 1 year ago - Pushed: over 3 years ago - Stars: 23 - Forks: 2

steven-s/text-shingles

k-shingling for text to help compare similarity

Language: Python - Size: 11.7 KB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 15 - Forks: 11

vascoalramos/mpei 📦

Probability Methods for Informatics Engineering | UA 2018/2019

Language: Java - Size: 39.8 MB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 0 - Forks: 0

eduardosantoshf/search-repeated-news 📦

MPEI Project - Search repeated news (using Bloom filter) and similar news (using MinHash) from a news API.

Language: Java - Size: 32.3 MB - Last synced: about 1 year ago - Pushed: almost 2 years ago - Stars: 0 - Forks: 1

zahraDehghanian97/MinHashing_Spark

In this repo. , I implement Cosine similarity and MinHashing function ( with and / or operator on band ) to find similarity to specific road in real Traffic dataset using PySpark.

Language: Jupyter Notebook - Size: 65.4 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 1 - Forks: 0

oertl/treeminhash

TreeMinHash: Fast Sketching for Weighted Jaccard Similarity Estimation

Language: C++ - Size: 2.62 MB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 12 - Forks: 3

santurini/MinHash-LSH-From-Scratch

Implementing a simplified copy of Shazam application from scratch using MinHashing and LSH.

Language: Python - Size: 210 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 0 - Forks: 1

KarimLulu/locality-sensitive-hashing-knn

Approximate k-Nearest Neighbours in high-dimensional space via Locality Sensitive Hashing

Language: Jupyter Notebook - Size: 265 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 0 - Forks: 0

gibranfp/Sampled-MinHashing

A method to mine beyond-pairwise relationships using Min-Hashing for large-scale pattern discovery

Language: C - Size: 522 KB - Last synced: 7 months ago - Pushed: over 2 years ago - Stars: 25 - Forks: 8

Cheng-Lin-Li/Spark

There are Python 2.7 codes and learning notes for Spark 2.1.1

Language: Python - Size: 2.62 MB - Last synced: about 1 year ago - Pushed: almost 6 years ago - Stars: 23 - Forks: 6

oma219/pacsketch

Network Anomaly Detection Using Probabilistic Data Structures

Language: C++ - Size: 12.8 MB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 2 - Forks: 0

travisbrady/flajolet

Probabilistic data structures for OCaml

Language: OCaml - Size: 220 KB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 38 - Forks: 3

TSunny007/Document-Similarity

Using Jaccard-Similarity and Minhashing to determine similarity between two text documents

Language: Jupyter Notebook - Size: 26.4 KB - Last synced: about 1 year ago - Pushed: about 6 years ago - Stars: 6 - Forks: 3

W3ndige/karton-similarity

Aurora karton for similiarity matching.

Language: Python - Size: 20.5 KB - Last synced: about 1 year ago - Pushed: almost 3 years ago - Stars: 2 - Forks: 0

haradama/pHash

Software to identify known plasmid sequence from metagenomic assembly using Minhash

Language: Go - Size: 15.9 MB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 3 - Forks: 0

steven-s/minhash-document-clusters

Minhash clustering of text documents

Language: Scala - Size: 33.2 KB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 4 - Forks: 1

vokter/vokter

Document store that periodically checks for changes in web documents

Language: Java - Size: 120 MB - Last synced: 28 days ago - Pushed: over 1 year ago - Stars: 6 - Forks: 2

davidsbatista/MuSICo

A Minwise Hashing Method for Addressing Relationship Extraction from Text

Language: Java - Size: 37.4 MB - Last synced: about 1 month ago - Pushed: about 7 years ago - Stars: 5 - Forks: 2

fengxu1996/similarity_find

计算多个文本间相似度

Language: C++ - Size: 43.9 KB - Last synced: 5 months ago - Pushed: almost 6 years ago - Stars: 1 - Forks: 0

W3ndige/karton-minhash

Aurora karton for calculating minhash from input dataset.

Language: Python - Size: 16.6 KB - Last synced: about 1 year ago - Pushed: about 3 years ago - Stars: 0 - Forks: 0

pedrovt/mpei

Probability Methods for Informatics Engineering (University of Aveiro)

Language: Matlab - Size: 4.96 MB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 1 - Forks: 0

FilipePires98/SpellChecker

SpellChecker: an application to check for spell errors.

Language: Java - Size: 3.54 MB - Last synced: about 1 year ago - Pushed: about 3 years ago - Stars: 0 - Forks: 1

ahmetcandiroglu/reddit-analyzed

Finding similar subreddits using MinHash and SimRank algorithms

Language: Python - Size: 1.02 MB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0

hscspring/sto

MinHash and LSH Based Store and Query.

Language: Python - Size: 9.77 KB - Last synced: about 1 year ago - Pushed: almost 2 years ago - Stars: 0 - Forks: 0

zxmeng/SimilarityDetection

Similarity Detection on Wikipedia Articles using MinHash and Random Projection implemented in Hadoop/Spark

Language: Java - Size: 69.5 MB - Last synced: about 1 year ago - Pushed: almost 6 years ago - Stars: 1 - Forks: 1

tkukurin/Lab.Bioinformatics

University work. Approximate aligner for long DNA sequences. Estimates Jaccard similarity from k-mers via minimizers and MinHash, then uses it as a sequence identity proxy.

Language: Java - Size: 90.3 MB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 0 - Forks: 0

joaocps/mpei-bloomfilter

Probabilistic methods for computer engineering - Final Project

Language: Java - Size: 596 KB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 0 - Forks: 0

DearMadMan/minhash

An implementation of the minhash algorithm in golang

Language: Go - Size: 2.93 KB - Last synced: 9 months ago - Pushed: almost 5 years ago - Stars: 2 - Forks: 0

Related Keywords
minhash 106 lsh 29 locality-sensitive-hashing 23 jaccard-similarity 19 bloom-filter 18 similarity 14 minhash-lsh-algorithm 13 python 11 hyperloglog 11 java 11 bioinformatics 10 simhash 8 document-similarity 6 jaccard-similarity-estimation 6 lsh-algorithm 6 hashing 5 jaccard-distance 5 data-mining 5 count-min-sketch 5 text-diff 5 diffmatchpatch 4 differences-detected 4 notifications 4 jaccard 4 quartz 4 metagenomics 4 similarity-search 4 deduplication 4 sketch 4 minhash-sketches 4 cosine-similarity 4 sketching 4 hash-functions 3 hash 3 clustering 3 work-in-progress 3 plagiarism-detection 3 cardinality-estimation 3 matlab 3 machine-learning 3 elasticsearch 3 text-mining 3 minwise-hashing-algorithm 3 minwise-hashing 3 minhash-similarity 3 big-data 3 kmer 3 rust 3 shingling 3 fasta 3 search 3 spark 3 cosine-distance 3 sourmash 3 data-sketches 3 algorithm 2 approximate-nearest-neighbor-search 2 tf-idf 2 clojure 2 hamming-distance 2 probabilistic-programming 2 probability-distribution 2 random-number-generators 2 estimation 2 random-variables 2 jupyter-notebook 2 cybersecurity 2 malware 2 golang 2 contigs 2 logistic-regression 2 weighted-sets 2 metagenome 2 java-library 2 plasmids 2 duplicates 2 c 2 jaccard-index 2 privacy 2 numpy 2 statistics 2 lsh-implementation 2 map-reduce 2 nanopore 2 mapreduce 2 recommender-system 2 alignment 2 sketch-data-structures 2 cuckoo-filter 2 dropwizard 2 lsh-forest 2 jersey2 2 nlp 2 hyperloglog-sketches 2 tomcat 2 lsh-ensemble 2 distributed 2 hash-algorithm 2 jetty 2 rest-api 2