Topic: "simhash"
james-bowman/nlp
Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang
Language: Go - Size: 396 KB - Last synced at: 9 months ago - Pushed at: almost 4 years ago - Stars: 445 - Forks: 45

sean-public/python-hashes
Interesting (non-cryptographic) hashes implemented in pure Python.
Language: Python - Size: 29.3 KB - Last synced at: 8 months ago - Pushed at: over 3 years ago - Stars: 240 - Forks: 43

sing1ee/simhash-java
A simple implementation of simhash algorithm by java.
Language: Java - Size: 1.52 MB - Last synced at: 22 days ago - Pushed at: over 4 years ago - Stars: 155 - Forks: 80

bbalet/stopwords
Removes most frequent words (stop words) from a text content. Based on a Curated list of language statistics.
Language: Go - Size: 89.8 KB - Last synced at: 10 months ago - Pushed at: almost 2 years ago - Stars: 137 - Forks: 25

dynatrace-oss/hash4j
Dynatrace hash library for Java
Language: Java - Size: 37 MB - Last synced at: about 8 hours ago - Pushed at: about 9 hours ago - Stars: 103 - Forks: 11

serega/gaoya
Locality Sensitive Hashing
Language: Rust - Size: 236 KB - Last synced at: 13 days ago - Pushed at: almost 2 years ago - Stars: 72 - Forks: 7

vkandy/simhash-js
Simhash implementation in Javascript
Language: JavaScript - Size: 49.8 KB - Last synced at: about 2 years ago - Pushed at: almost 8 years ago - Stars: 37 - Forks: 15

hybridtheory/floc-simhash
A fast python implementation of the SimHash algorithm.
Language: Python - Size: 27.3 KB - Last synced at: 30 days ago - Pushed at: over 3 years ago - Stars: 27 - Forks: 7

NETkiddy/simhash_similarity
A text similarity by simhash
Language: Go - Size: 6.84 KB - Last synced at: 10 months ago - Pushed at: about 6 years ago - Stars: 24 - Forks: 9

memosstilvi/simhash
Open Source Implementation of Simhash in Python
Language: Python - Size: 4.88 KB - Last synced at: 3 months ago - Pushed at: over 7 years ago - Stars: 24 - Forks: 5

KeremZaman/semantic-sh
semantic-sh is a SimHash implementation to detect and group similar texts by taking power of word vectors and transformer-based language models (BERT).
Language: Python - Size: 40 KB - Last synced at: about 2 years ago - Pushed at: over 4 years ago - Stars: 23 - Forks: 3

holsee/spirit_fingers
Elixir SimHash NIFs written in Rust
Language: Elixir - Size: 3.36 MB - Last synced at: 3 days ago - Pushed at: about 2 years ago - Stars: 19 - Forks: 2

nnnet/superminhash
SuperMinHash: A New Minwise Hashing Algorithm for Jaccard Similarity Estimation, Simhash and SimhashIndex
Language: Python - Size: 19.5 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 19 - Forks: 7

Marcnuth/deduplication
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
Language: Python - Size: 22.5 KB - Last synced at: 19 days ago - Pushed at: over 1 year ago - Stars: 18 - Forks: 6

liuaiting/Financial-News-Analysis
招商银行FinTech-复赛-财经新闻分析
Language: Python - Size: 86.9 MB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 17 - Forks: 6

haoyuhu/gosimhash
A simhasher for Chinese documents implemented by golang, simply translated from yanyiwu/gosimhash
Language: Go - Size: 3.97 MB - Last synced at: 8 days ago - Pushed at: over 7 years ago - Stars: 17 - Forks: 5

preciz/similarity
A library for cosine similarity & simhash calculation
Language: Elixir - Size: 47.9 KB - Last synced at: 22 days ago - Pushed at: 9 months ago - Stars: 16 - Forks: 2

zyocum/dedup
Find duplicate text files.
Language: Python - Size: 19.5 KB - Last synced at: 4 days ago - Pushed at: 3 months ago - Stars: 14 - Forks: 3

armchairtheorist/simhash2
A rewrite of Bookmate's simhash gem, which is an implementation of Moses Charikar's simhashes in Ruby.
Language: Ruby - Size: 27.3 KB - Last synced at: about 3 hours ago - Pushed at: over 6 years ago - Stars: 14 - Forks: 3

jinshuai86/Spider
基于Java的多线程爬虫框架
Language: Java - Size: 338 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 11 - Forks: 5

mokeeqian/copydetector
基于springboot和Google开源simhash算法实现的作业查重/抄袭检测/文本相似度分析可视化系统,,集成jplag、MOSS、singleCloud工具套件进行多方位查重 Ref: https://github.com/ALuShu/checksystem
Language: JavaScript - Size: 71.6 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 11 - Forks: 2

oduwsdl/off-topic-memento-toolkit
This system evaluates a collection of mementos (archived web pages) to determine which are off topic. The collection can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.
Language: Python - Size: 93.7 MB - Last synced at: 7 days ago - Pushed at: over 3 years ago - Stars: 9 - Forks: 3

ALuShu/checksystem
基于simHash的Web作业查重系统
Language: JavaScript - Size: 4.42 MB - Last synced at: almost 2 years ago - Pushed at: almost 3 years ago - Stars: 8 - Forks: 0

Xenia101/KeyStroke-Dynamics
⌨️ User Verification based on Keystroke Dynamics / Two-factor Authentication technology based on Key-Stroke
Language: Python - Size: 562 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 6 - Forks: 2

Derek-Wds/Code_Plagiarism_Detection
Code plagiarism system based on Simhash and Nicad.
Language: Python - Size: 40 MB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 6 - Forks: 4

xblanc33/simhash-js Fork of vkandy/simhash-js
Simhash implementation in Javascript
Language: JavaScript - Size: 51.8 KB - Last synced at: 2 days ago - Pushed at: almost 8 years ago - Stars: 5 - Forks: 6

jiangnanboy/text-de-duplication
text de-duplication 文本去重
Size: 12.7 KB - Last synced at: about 1 month ago - Pushed at: almost 5 years ago - Stars: 4 - Forks: 1

hengfeiyang/simhash
a Golang implementation of Simhash Algorithm
Language: Go - Size: 1.95 KB - Last synced at: about 2 months ago - Pushed at: over 6 years ago - Stars: 3 - Forks: 1

Qyokizzzz/simhash
The extended version of simhash supports fingerprint extraction of documents and images.
Language: Python - Size: 551 KB - Last synced at: 20 days ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 0

shenwei356/simhash-eval
Language: Go - Size: 2.91 MB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 1

sskender/analysis-of-massive-datasets
Analysis of Massive Datasets FER labs
Language: Python - Size: 19 MB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 2 - Forks: 0

php-lsys/simhash
simhash for php extension : 判断文本相似度
Language: C - Size: 19.5 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 2

lifefloating/contentcore
爬虫内容处理服务(自用)
Language: Python - Size: 87.9 KB - Last synced at: about 1 month ago - Pushed at: almost 5 years ago - Stars: 2 - Forks: 0

LuoZijun/rust-jieba
Rust jieba
Language: Rust - Size: 1.97 MB - Last synced at: about 13 hours ago - Pushed at: over 6 years ago - Stars: 2 - Forks: 0

linyshdhhcb/BERT-SimHashHomeworkCheck-Backend
基于 SimHash 与 BERT 的高校学生作业查重系统,通过结合 SimHash 算法和 BERT-Base-Chinese 模型、Vue3、Spring Boot3、EasyExcel、HanLP,实现智能查重。支持文件批量处理,历史作业比对,自动生成详细的 Excel 查重报告。集成 Jaccard、海明距离、Hash、余弦、图片和加权相似度算法,精准评估文件相似性。
Language: Java - Size: 1.32 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

tomfran/crawler
A web crawler written in Rust
Language: Rust - Size: 5.33 MB - Last synced at: about 2 months ago - Pushed at: 11 months ago - Stars: 1 - Forks: 0

XAH30/LSH-vs-Finesse
In this repository you can find an implementation of LSH (Local | Sensitive Hashing) and Finesse algorithms, designed to find similar data based on their hashes
Language: C++ - Size: 5.72 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

zyocum/simphon
Proof-of-concept for measuring similarity of phoneme sequences using locality sensitive hashing (LSH).
Language: Jupyter Notebook - Size: 1.23 MB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

manmolecular/history-fp
:feet: Create a behavioral fingerprint based on your zsh command line history
Language: Python - Size: 6.84 KB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

justinfargnoli/simhash
A barebones implementation of the simhash data sketching algorithm.
Language: Go - Size: 7.81 KB - Last synced at: 10 months ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

Xenia101/Illegal-Copyright-Detection-System-WEB-
Illegal Copyright Detection System WEB
Language: Python - Size: 2.74 MB - Last synced at: 12 months ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 0

rihenperry/csuci-mscs-thesis-dist-web-crawler
documents my master's level thesis work on building continous, topical web crawler based on mercator 1999
Language: TeX - Size: 27.4 MB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 0

nemosharma6/event-coding
event coding using spark and stanford-core-nlp
Language: Scala - Size: 3.85 MB - Last synced at: about 2 years ago - Pushed at: almost 6 years ago - Stars: 1 - Forks: 0

mgunn001/tmvis Fork of oduwsdl/tmvis
A Research Project Thumbnail Visualization to summarize the webpage changes over time
Language: JavaScript - Size: 5.57 MB - Last synced at: about 1 year ago - Pushed at: over 6 years ago - Stars: 1 - Forks: 2

igrgurina/SimHash
College project (Analysis of massive data sets) - C# implementation of big data algorithms (2017/2018)
Language: C# - Size: 10.7 KB - Last synced at: about 2 years ago - Pushed at: almost 7 years ago - Stars: 1 - Forks: 0

qingniufly/scala-simhash
Simhash algorithm using Jcseg for word segment, jenkins-hash for hash. Written in Scala
Language: Scala - Size: 2.01 MB - Last synced at: about 2 years ago - Pushed at: about 8 years ago - Stars: 1 - Forks: 1

natasa-dz/nosql-bigdata-engine
A high-performance key-value store in Go with LSM tree, compaction algorithms, rate limiting, and support for probabilistic data structures like Bloom Filter and SimHash. It also features range scan and list operations with pagination.
Language: Go - Size: 294 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

eggdropsoap/tilsh
tilsh implements the TLSH locality-sensitive hash algorithm suite
Language: JavaScript - Size: 25.4 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

ErfanMomeniii/simhash
A lightweight Go package implementing Charikar's Simhash algorithm for generating hash fingerprints and calculating similarity, ideal for deduplication and content fingerprinting
Language: Go - Size: 11.7 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

Moe131/webcrawler
Python web crawler designed to scrape websites
Language: Python - Size: 3.52 MB - Last synced at: 12 days ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

MajaJuri/Analiza-velikih-skupova-podataka
Implementacija algoritama predstavljenih na predmetu Analiza velikih skupova podataka (AVSP)
Language: Java - Size: 1.03 MB - Last synced at: 16 days ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

fturati/floc-minhash-attacks
Implementation for the attacks of the paper "Locality-Sensitive Hashing Does Not Guarantee Privacy! Attacks on Google's FLoC and the MinHash Hierarchy System"
Language: Python - Size: 17 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

privacy-lsh/floc-minhash
Implementation for the attacks of the paper "Locality-Sensitive Hashing Does Not Guarantee Privacy! Attacks on Google's FLoC and the MinHash Hierarchy System".
Language: Python - Size: 16.9 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

nepiskopos/duplicate-questions-detection-lsh
Knowledge extraction through Data Analysis, including Locality Sensitive Hashing (LSH).
Language: Jupyter Notebook - Size: 423 KB - Last synced at: over 1 year ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

innerNULL/osimhash
A deduplication lib built Over [SIMHASH](https://github.com/yanyiwu/simhash).
Language: C++ - Size: 33.2 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

dbrcina/AVSP-FER-2020-21
Lab solutions for Analysis of Massive Datasets ("Analiza velikih skupova podataka") course at FER 2020/21
Language: Java - Size: 1.32 MB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

long-gong/datasets-E2H
Datasets Euclidean to Hamming Conversion
Language: C++ - Size: 186 MB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 1

Chant00/simhash Fork of 1e0ng/simhash
A Python Implementation of Simhash Algorithm
Language: Python - Size: 1.69 MB - Last synced at: 8 months ago - Pushed at: about 5 years ago - Stars: 0 - Forks: 0

fpopic/avsp
(Class) Big Data Analysis Course Assignments
Language: Java - Size: 28.2 MB - Last synced at: about 2 months ago - Pushed at: about 8 years ago - Stars: 0 - Forks: 0
