An open API service providing repository metadata for many open source software ecosystems.

Topic: "simhash"

james-bowman/nlp

Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Language: Go - Size: 396 KB - Last synced at: 9 months ago - Pushed at: almost 4 years ago - Stars: 445 - Forks: 45

sean-public/python-hashes

Interesting (non-cryptographic) hashes implemented in pure Python.

Language: Python - Size: 29.3 KB - Last synced at: 8 months ago - Pushed at: over 3 years ago - Stars: 240 - Forks: 43

sing1ee/simhash-java

A simple implementation of simhash algorithm by java.

Language: Java - Size: 1.52 MB - Last synced at: 22 days ago - Pushed at: over 4 years ago - Stars: 155 - Forks: 80

bbalet/stopwords

Removes most frequent words (stop words) from a text content. Based on a Curated list of language statistics.

Language: Go - Size: 89.8 KB - Last synced at: 10 months ago - Pushed at: almost 2 years ago - Stars: 137 - Forks: 25

dynatrace-oss/hash4j

Dynatrace hash library for Java

Language: Java - Size: 37 MB - Last synced at: about 8 hours ago - Pushed at: about 9 hours ago - Stars: 103 - Forks: 11

serega/gaoya

Locality Sensitive Hashing

Language: Rust - Size: 236 KB - Last synced at: 13 days ago - Pushed at: almost 2 years ago - Stars: 72 - Forks: 7

vkandy/simhash-js

Simhash implementation in Javascript

Language: JavaScript - Size: 49.8 KB - Last synced at: about 2 years ago - Pushed at: almost 8 years ago - Stars: 37 - Forks: 15

hybridtheory/floc-simhash

A fast python implementation of the SimHash algorithm.

Language: Python - Size: 27.3 KB - Last synced at: 30 days ago - Pushed at: over 3 years ago - Stars: 27 - Forks: 7

NETkiddy/simhash_similarity

A text similarity by simhash

Language: Go - Size: 6.84 KB - Last synced at: 10 months ago - Pushed at: about 6 years ago - Stars: 24 - Forks: 9

memosstilvi/simhash

Open Source Implementation of Simhash in Python

Language: Python - Size: 4.88 KB - Last synced at: 3 months ago - Pushed at: over 7 years ago - Stars: 24 - Forks: 5

KeremZaman/semantic-sh

semantic-sh is a SimHash implementation to detect and group similar texts by taking power of word vectors and transformer-based language models (BERT).

Language: Python - Size: 40 KB - Last synced at: about 2 years ago - Pushed at: over 4 years ago - Stars: 23 - Forks: 3

holsee/spirit_fingers

Elixir SimHash NIFs written in Rust

Language: Elixir - Size: 3.36 MB - Last synced at: 3 days ago - Pushed at: about 2 years ago - Stars: 19 - Forks: 2

nnnet/superminhash

SuperMinHash: A New Minwise Hashing Algorithm for Jaccard Similarity Estimation, Simhash and SimhashIndex

Language: Python - Size: 19.5 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 19 - Forks: 7

Marcnuth/deduplication

Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.

Language: Python - Size: 22.5 KB - Last synced at: 19 days ago - Pushed at: over 1 year ago - Stars: 18 - Forks: 6

liuaiting/Financial-News-Analysis

招商银行FinTech-复赛-财经新闻分析

Language: Python - Size: 86.9 MB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 17 - Forks: 6

haoyuhu/gosimhash

A simhasher for Chinese documents implemented by golang, simply translated from yanyiwu/gosimhash

Language: Go - Size: 3.97 MB - Last synced at: 8 days ago - Pushed at: over 7 years ago - Stars: 17 - Forks: 5

preciz/similarity

A library for cosine similarity & simhash calculation

Language: Elixir - Size: 47.9 KB - Last synced at: 22 days ago - Pushed at: 9 months ago - Stars: 16 - Forks: 2

zyocum/dedup

Find duplicate text files.

Language: Python - Size: 19.5 KB - Last synced at: 4 days ago - Pushed at: 3 months ago - Stars: 14 - Forks: 3

armchairtheorist/simhash2

A rewrite of Bookmate's simhash gem, which is an implementation of Moses Charikar's simhashes in Ruby.

Language: Ruby - Size: 27.3 KB - Last synced at: about 3 hours ago - Pushed at: over 6 years ago - Stars: 14 - Forks: 3

jinshuai86/Spider

基于Java的多线程爬虫框架

Language: Java - Size: 338 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 11 - Forks: 5

mokeeqian/copydetector

基于springboot和Google开源simhash算法实现的作业查重/抄袭检测/文本相似度分析可视化系统,,集成jplag、MOSS、singleCloud工具套件进行多方位查重 Ref: https://github.com/ALuShu/checksystem

Language: JavaScript - Size: 71.6 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 11 - Forks: 2

oduwsdl/off-topic-memento-toolkit

This system evaluates a collection of mementos (archived web pages) to determine which are off topic. The collection can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.

Language: Python - Size: 93.7 MB - Last synced at: 7 days ago - Pushed at: over 3 years ago - Stars: 9 - Forks: 3

ALuShu/checksystem

基于simHash的Web作业查重系统

Language: JavaScript - Size: 4.42 MB - Last synced at: almost 2 years ago - Pushed at: almost 3 years ago - Stars: 8 - Forks: 0

Xenia101/KeyStroke-Dynamics

⌨️ User Verification based on Keystroke Dynamics / Two-factor Authentication technology based on Key-Stroke

Language: Python - Size: 562 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 6 - Forks: 2

Derek-Wds/Code_Plagiarism_Detection

Code plagiarism system based on Simhash and Nicad.

Language: Python - Size: 40 MB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 6 - Forks: 4

xblanc33/simhash-js Fork of vkandy/simhash-js

Simhash implementation in Javascript

Language: JavaScript - Size: 51.8 KB - Last synced at: 2 days ago - Pushed at: almost 8 years ago - Stars: 5 - Forks: 6

jiangnanboy/text-de-duplication

text de-duplication 文本去重

Size: 12.7 KB - Last synced at: about 1 month ago - Pushed at: almost 5 years ago - Stars: 4 - Forks: 1

hengfeiyang/simhash

a Golang implementation of Simhash Algorithm

Language: Go - Size: 1.95 KB - Last synced at: about 2 months ago - Pushed at: over 6 years ago - Stars: 3 - Forks: 1

Qyokizzzz/simhash

The extended version of simhash supports fingerprint extraction of documents and images.

Language: Python - Size: 551 KB - Last synced at: 20 days ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 0

shenwei356/simhash-eval

Language: Go - Size: 2.91 MB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 1

sskender/analysis-of-massive-datasets

Analysis of Massive Datasets FER labs

Language: Python - Size: 19 MB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 2 - Forks: 0

php-lsys/simhash

simhash for php extension : 判断文本相似度

Language: C - Size: 19.5 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 2

lifefloating/contentcore

爬虫内容处理服务(自用)

Language: Python - Size: 87.9 KB - Last synced at: about 1 month ago - Pushed at: almost 5 years ago - Stars: 2 - Forks: 0

LuoZijun/rust-jieba

Rust jieba

Language: Rust - Size: 1.97 MB - Last synced at: about 13 hours ago - Pushed at: over 6 years ago - Stars: 2 - Forks: 0

linyshdhhcb/BERT-SimHashHomeworkCheck-Backend

基于 SimHash 与 BERT 的高校学生作业查重系统,通过结合 SimHash 算法和 BERT-Base-Chinese 模型、Vue3、Spring Boot3、EasyExcel、HanLP,实现智能查重。支持文件批量处理,历史作业比对,自动生成详细的 Excel 查重报告。集成 Jaccard、海明距离、Hash、余弦、图片和加权相似度算法,精准评估文件相似性。

Language: Java - Size: 1.32 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

tomfran/crawler

A web crawler written in Rust

Language: Rust - Size: 5.33 MB - Last synced at: about 2 months ago - Pushed at: 11 months ago - Stars: 1 - Forks: 0

XAH30/LSH-vs-Finesse

In this repository you can find an implementation of LSH (Local | Sensitive Hashing) and Finesse algorithms, designed to find similar data based on their hashes

Language: C++ - Size: 5.72 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

zyocum/simphon

Proof-of-concept for measuring similarity of phoneme sequences using locality sensitive hashing (LSH).

Language: Jupyter Notebook - Size: 1.23 MB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

manmolecular/history-fp

:feet: Create a behavioral fingerprint based on your zsh command line history

Language: Python - Size: 6.84 KB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

justinfargnoli/simhash

A barebones implementation of the simhash data sketching algorithm.

Language: Go - Size: 7.81 KB - Last synced at: 10 months ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

Xenia101/Illegal-Copyright-Detection-System-WEB-

Illegal Copyright Detection System WEB

Language: Python - Size: 2.74 MB - Last synced at: 12 months ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 0

rihenperry/csuci-mscs-thesis-dist-web-crawler

documents my master's level thesis work on building continous, topical web crawler based on mercator 1999

Language: TeX - Size: 27.4 MB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 0

nemosharma6/event-coding

event coding using spark and stanford-core-nlp

Language: Scala - Size: 3.85 MB - Last synced at: about 2 years ago - Pushed at: almost 6 years ago - Stars: 1 - Forks: 0

mgunn001/tmvis Fork of oduwsdl/tmvis

A Research Project Thumbnail Visualization to summarize the webpage changes over time

Language: JavaScript - Size: 5.57 MB - Last synced at: about 1 year ago - Pushed at: over 6 years ago - Stars: 1 - Forks: 2

igrgurina/SimHash

College project (Analysis of massive data sets) - C# implementation of big data algorithms (2017/2018)

Language: C# - Size: 10.7 KB - Last synced at: about 2 years ago - Pushed at: almost 7 years ago - Stars: 1 - Forks: 0

qingniufly/scala-simhash

Simhash algorithm using Jcseg for word segment, jenkins-hash for hash. Written in Scala

Language: Scala - Size: 2.01 MB - Last synced at: about 2 years ago - Pushed at: about 8 years ago - Stars: 1 - Forks: 1

natasa-dz/nosql-bigdata-engine

A high-performance key-value store in Go with LSM tree, compaction algorithms, rate limiting, and support for probabilistic data structures like Bloom Filter and SimHash. It also features range scan and list operations with pagination.

Language: Go - Size: 294 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

eggdropsoap/tilsh

tilsh implements the TLSH locality-sensitive hash algorithm suite

Language: JavaScript - Size: 25.4 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

ErfanMomeniii/simhash

A lightweight Go package implementing Charikar's Simhash algorithm for generating hash fingerprints and calculating similarity, ideal for deduplication and content fingerprinting

Language: Go - Size: 11.7 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

Moe131/webcrawler

Python web crawler designed to scrape websites

Language: Python - Size: 3.52 MB - Last synced at: 12 days ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

MajaJuri/Analiza-velikih-skupova-podataka

Implementacija algoritama predstavljenih na predmetu Analiza velikih skupova podataka (AVSP)

Language: Java - Size: 1.03 MB - Last synced at: 16 days ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

fturati/floc-minhash-attacks

Implementation for the attacks of the paper "Locality-Sensitive Hashing Does Not Guarantee Privacy! Attacks on Google's FLoC and the MinHash Hierarchy System"

Language: Python - Size: 17 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

privacy-lsh/floc-minhash

Implementation for the attacks of the paper "Locality-Sensitive Hashing Does Not Guarantee Privacy! Attacks on Google's FLoC and the MinHash Hierarchy System".

Language: Python - Size: 16.9 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

nepiskopos/duplicate-questions-detection-lsh

Knowledge extraction through Data Analysis, including Locality Sensitive Hashing (LSH).

Language: Jupyter Notebook - Size: 423 KB - Last synced at: over 1 year ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

innerNULL/osimhash

A deduplication lib built Over [SIMHASH](https://github.com/yanyiwu/simhash).

Language: C++ - Size: 33.2 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

dbrcina/AVSP-FER-2020-21

Lab solutions for Analysis of Massive Datasets ("Analiza velikih skupova podataka") course at FER 2020/21

Language: Java - Size: 1.32 MB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

long-gong/datasets-E2H

Datasets Euclidean to Hamming Conversion

Language: C++ - Size: 186 MB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 1

Chant00/simhash Fork of 1e0ng/simhash

A Python Implementation of Simhash Algorithm

Language: Python - Size: 1.69 MB - Last synced at: 8 months ago - Pushed at: about 5 years ago - Stars: 0 - Forks: 0

fpopic/avsp

(Class) Big Data Analysis Course Assignments

Language: Java - Size: 28.2 MB - Last synced at: about 2 months ago - Pushed at: about 8 years ago - Stars: 0 - Forks: 0