GitHub topics: document-similarity

Repositories

piskvorky/gensim

Topic Modelling for Humans

Language: Python - Size: 102 MB - Last synced at: 1 day ago - Pushed at: about 1 month ago - Stars: 16,287 - Forks: 4,416

alpturkoral/LSA

Topic modeling and document similarity using Latent Semantic Analysis (LSA) in Python

Language: Jupyter Notebook - Size: 10.7 KB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Assessing MinHash LSH for text similarity. Compares with kNN using BART embeddings as ground truth. Involves data preprocessing, shingle creation, LSH experiments. Findings inform LSH's efficiency in document similarity tasks, enhancing understanding of LSH techniques.

Language: Jupyter Notebook - Size: 370 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

zbmed-semtec/hybrid-pre-doc2vec-doc-relevance

Hybrid approach combining dictionary-based NER and doc2vec

Language: Jupyter Notebook - Size: 23.9 MB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

zbmed-semtec/word2doc2vec-doc-relevance

An approach exploring and assessing literature-based doc-2-doc recommendations using word2vec combined with doc2vec, and applying it to TREC and RELISH datasets

Language: Python - Size: 13.2 MB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

zbmed-semtec/doc2vec-doc-relevance

An approach exploring and assessing literature-based doc-2-doc recommendations using a doc2vec and applying to the RELISH dataset.

Language: Python - Size: 9.55 MB - Last synced at: 3 months ago - Pushed at: 10 months ago - Stars: 1 - Forks: 0

maryamzaman30/Question-Answering-with-Transformers

Natural Language Processing Internship - Elevvo Pathways

Language: Jupyter Notebook - Size: 469 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

andrewmcloud/consimilo

A Clojure library for querying large data-sets on similarity

Language: Clojure - Size: 536 KB - Last synced at: about 1 month ago - Pushed at: almost 7 years ago - Stars: 65 - Forks: 4

waghmareaniket/document-similarity-in-langchain-openai

OpenAI's embedding model to convert a query and a list of cricket-related documents into vector representations. It calculates cosine similarity between the query and each document to find the most relevant one. Finally, it prints the query, the most similar document, and their similarity score.

Language: Python - Size: 4.88 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

tahirkorma/langchain-models

A modular LangChain-based repository integrating Chat Models, traditional LLMs, and Embedding Models from OpenAI, Anthropic, Google, Hugging Face, and Local Inference. Built with the latest LangChain version for rapid prototyping of modern NLP and AI agent workflows.

Language: Python - Size: 8.79 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

lukacupic/PDF-Document-Management-and-Search-System

Bachelor's Thesis at FER, University of Zagreb, 2018.

Language: Java - Size: 56 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 1

allenai/aspire

Repo for Aspire - A scientific document similarity model based on matching fine-grained aspects of scientific papers.

Language: Python - Size: 268 KB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 54 - Forks: 6

oborchers/Fast_Sentence_Embeddings

Compute Sentence Embeddings Fast!

Language: Jupyter Notebook - Size: 2.86 MB - Last synced at: 6 months ago - Pushed at: almost 3 years ago - Stars: 623 - Forks: 84

izikeros/sentence-plagiarism

Compare sentences from input document with all sentences from reference documents - find very similar ones.

Language: Python - Size: 263 KB - Last synced at: about 2 months ago - Pushed at: 6 months ago - Stars: 3 - Forks: 0

axyc777/Plagiarism-Checker

Plagiarism Checker is a Flask-based web application that allows users to upload .txt or .docx files and checks for plagiarism using advanced text comparison and optional Google Search (via SERP API). It uses Natural Language Processing (NLP), cosine similarity, and keyword extraction to intelligently compare documents or check them online.

Language: Python - Size: 17.6 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

hiropppe/text-models

Topic Modeling in Cython

Language: Jupyter Notebook - Size: 48.9 MB - Last synced at: 6 months ago - Pushed at: 8 months ago - Stars: 1 - Forks: 0

IlyaGusev/tgcontest

Telegram Data Clustering contest solution by Mindful Squirrel

Language: HTML - Size: 14.1 MB - Last synced at: 8 months ago - Pushed at: over 3 years ago - Stars: 96 - Forks: 25

eggdropsoap/tilsh

tilsh implements the TLSH locality-sensitive hash algorithm suite

Language: JavaScript - Size: 25.4 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

zbmed-semtec/wmd-word2vec

An approach exploring and assessing literature-based doc-2-doc recommendations using word2vec and word mover's distance and applying it to RELISH dataset.

Language: Python - Size: 1.1 MB - Last synced at: 3 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Mohammed-3tef/Document_Similarity

A C++ program to measure the similarity between two text documents using efficient algorithms like cosine similarity, with support for preprocessing and customization.

Language: C++ - Size: 149 KB - Last synced at: 4 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

zayedrais/DocumentSearchEngine

Document Search Engine project with TF-IDF abd Google universal sentence encoder model

Language: Jupyter Notebook - Size: 28.6 MB - Last synced at: 7 months ago - Pushed at: over 2 years ago - Stars: 53 - Forks: 24

shreyansh26/MinHash-Implemenation

A simple MinHash implementation based on the explanation in the Mining of Massive Datasets course by Stanford

Language: Python - Size: 7.4 MB - Last synced at: 6 months ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 0

Forthoney/doc_sim

Approximate document similarity with Minhash + Locality Sensitive Hashing

Language: Ruby - Size: 48.8 KB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 2

DrKenReid/Generalized-Analysis-of-Text-Data

A comprehensive toolkit for analyzing text data using various AI and NLP techniques, including topic modeling, sentiment analysis, and text classification, demonstrated on the 20 Newsgroups dataset.

Language: Jupyter Notebook - Size: 1.45 MB - Last synced at: 4 months ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

meenavyas/Misc

Contains interesting projects like Cat face detection, cat face recognition, code generation, Building chatbot, finding similar documents, image segmentation, UCI credit card, anomaly detection, MNIST etc.

Language: Jupyter Notebook - Size: 47.6 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 18 - Forks: 36

developersaintt/Document-Similarity

was curious about how plagiarism checker works, ended up learning about something completely different 😂

Language: Python - Size: 7.81 KB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0

NourKamaly/TheArtInOurWorlds-NASA-Space-Apps

NASA space apps 2022 local winner (Cairo). This project is the solution designed for the NASA space apps challenge hackathon 2022 by team NASART solving challenge: The Art in Our Worlds.

Language: Jupyter Notebook - Size: 96.5 MB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 4

Vincent96034/DocumentSimilarity

Q3 of Final Project Assignment of the course 'Foundations of Data Science' @ CBS

Language: Python - Size: 4.14 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

Dipesh-Pokhrel/doc_similarity

Similarity between two documents.

Language: Python - Size: 3.91 KB - Last synced at: almost 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

parvez86/Smart-Recruitment-System

A simple Django-based resume ranker website where recruiters post their jobs and candidates applies for their desired vacancies. The system gets the document similarity between the job description and the candidate resumes, generates similarity scores using the KNN model, and rank or shortlist the candidate resumes.

Language: HTML - Size: 175 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 3 - Forks: 0

captv89/findSimilarDocs

A PoC on document comparison using various methods in NLP

Language: Jupyter Notebook - Size: 382 KB - Last synced at: 4 months ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

EslamElbassel/Indexing-and-Documents-Similarity

Measures the similarity between documents by calculating Jaccard similarity between documents and provide a similarity score based on how similar the sentences are compared to each other

Language: Java - Size: 7.81 KB - Last synced at: almost 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

AustinZuniga/Auto-tagging-of-Theses-and-Dissertations-of-Bicol-University-Searching-and-Matching-

A system for automatic tagging of metadata of theses and dissertations from Bicol University

Language: Python - Size: 22.5 KB - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 1 - Forks: 0

priyanka-ddit/NLP

This repository will demonstrate how to explore spiritual world using NLP techniques like, sentiment analysis, topic modeling, information retrieval and text summarization.

Language: Jupyter Notebook - Size: 3.05 MB - Last synced at: about 2 years ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0

abhilampard/Simple-Plagiarism-Checker

Web Application for checking the similarity between query and document using the concept of Cosine Similarity.

Language: Python - Size: 6.84 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 79 - Forks: 59

now-youre-gittin-it/nlp-workplace-comedies

NLP on American workplace comedy TV pilot transcripts using multiple NLP libraries in Python.

Language: Jupyter Notebook - Size: 17.2 MB - Last synced at: over 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

malteos/wikipedia-article-recommendations

Survey data and Python code for the ICADL 2021 paper "A Qualitative Evaluation of User Preference for Link-based vs. Text-based Recommendations of Wikipedia Articles"

Language: Jupyter Notebook - Size: 746 KB - Last synced at: over 1 year ago - Pushed at: about 4 years ago - Stars: 5 - Forks: 0

Siddhantmest/Categorizing-amazon-products

Classifying products into categories using NLP techniques

Language: Jupyter Notebook - Size: 556 KB - Last synced at: over 2 years ago - Pushed at: almost 4 years ago - Stars: 0 - Forks: 0

shmsi/document-ranking

Document ranking word embeddings

Language: Python - Size: 43 KB - Last synced at: over 2 years ago - Pushed at: almost 6 years ago - Stars: 1 - Forks: 0

MSVCode/doc-similarity

Simple document similarity module implemented in NodeJS

Language: JavaScript - Size: 3.91 KB - Last synced at: about 1 month ago - Pushed at: almost 8 years ago - Stars: 2 - Forks: 2

ribbas/Highlite

Document comparison tool

Language: Python - Size: 8.66 MB - Last synced at: over 2 years ago - Pushed at: about 8 years ago - Stars: 0 - Forks: 1

shrebox/Natural-Language-Processing

Compilation of Natural Language Processing (NLP) codes. BONUS: Link to Information Retrieval (IR) codes compilation. (checkout the readme)

Language: Python - Size: 1.85 MB - Last synced at: over 2 years ago - Pushed at: about 4 years ago - Stars: 12 - Forks: 0

mdietrichstein/ir-search-engine-rust

Rust-based text search engine from scratch supporting multiple document similarity metrics (TF-IDF, BM25, BM25VA)

Language: Rust - Size: 132 KB - Last synced at: about 2 months ago - Pushed at: over 4 years ago - Stars: 5 - Forks: 0

zuliani99/All-Pairs-Docs-Similarity

Given a set of documents and the minimum required similarity threshold find the number of document pairs that exceed the threshold

Language: Jupyter Notebook - Size: 14.1 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

nunososorio/docxmatch

DocxMatch is a Streamlit app that analyzes the similarity between Word files.

Language: Python - Size: 43.9 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 1

nicoDs96/Document-Similarity-using-Python-and-PySpark

Document Similarity with Apache Spark using Locality Sesitive Hashing and Python

Language: Jupyter Notebook - Size: 444 KB - Last synced at: over 2 years ago - Pushed at: about 3 years ago - Stars: 6 - Forks: 1

massanishi/document_similarity_algorithms_experiments

Document similarity algorithms experiment - Jaccard, TF-IDF, Doc2vec, USE, and BERT.

Language: Python - Size: 28.3 KB - Last synced at: almost 3 years ago - Pushed at: over 5 years ago - Stars: 67 - Forks: 29

Sarthakjain1206/Intelligent_Document_Finder

Document Search Engine Tool

Language: Python - Size: 56.3 MB - Last synced at: over 2 years ago - Pushed at: almost 3 years ago - Stars: 57 - Forks: 13

biovino1/BuffettLetters

NLP of Warren Buffett's annual letter to shareholders

Language: Jupyter Notebook - Size: 10.9 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

maxoodf/tgnews

Telegram Data Clustering Contest (Bossy Gnu's submission )

Language: C++ - Size: 41 KB - Last synced at: 8 months ago - Pushed at: almost 5 years ago - Stars: 4 - Forks: 2

TSunny007/StackOverflowAnalytics

Data mining on stack overflow Q/A data to understand the landscape of languages and developers in computer science

Language: Jupyter Notebook - Size: 21.9 MB - Last synced at: almost 3 years ago - Pushed at: over 7 years ago - Stars: 1 - Forks: 1

Sarthakjain1206/Intelligent-Document-Finder

A tool which can find your any document using semantic search

Language: Python - Size: 43.5 MB - Last synced at: over 2 years ago - Pushed at: almost 3 years ago - Stars: 7 - Forks: 1

PolunLin/doc_similiarty

Language: HTML - Size: 229 KB - Last synced at: 8 months ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

tejaspradhan/AI-based-Hiring-Platform

A Two-ended Hiring web application built using flask. The application uses document similarity techniques for recommendation.

Language: HTML - Size: 4.51 MB - Last synced at: over 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 3

sabareeswarans11/SearchEngine_InvertedIndex

Information Retrieval: Document Similarity Measure Pre-processing to Build Document Vectors for Web Page Content Analysis.

Language: Jupyter Notebook - Size: 2.22 MB - Last synced at: over 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

TSunny007/Document-Similarity

Using Jaccard-Similarity and Minhashing to determine similarity between two text documents

Language: Jupyter Notebook - Size: 26.4 KB - Last synced at: almost 3 years ago - Pushed at: over 7 years ago - Stars: 6 - Forks: 3

omarabdelaz1z/Inverted-Index-and-Document-Similarity

Language: Python - Size: 113 KB - Last synced at: over 2 years ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0

jungsoh/wordvecs-word-analogy-by-document-similarity

Use of word embeddings and document similarity to solve word analogy problems

Language: Python - Size: 65.1 MB - Last synced at: over 2 years ago - Pushed at: almost 4 years ago - Stars: 0 - Forks: 0

mohammaduzair9/Document-Searching

Document searching from queries using Inverted index

Language: Python - Size: 717 KB - Last synced at: almost 3 years ago - Pushed at: over 7 years ago - Stars: 2 - Forks: 0

iboraham/job-finder

The framework that finds a perfect job match for you provided through scraped data from indeed.co.uk.

Language: Python - Size: 19.7 MB - Last synced at: over 2 years ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0

adriamoya/bcpnews

Classifying news articles with deep learning to build an automatic newsletter

Language: Jupyter Notebook - Size: 70 MB - Last synced at: over 2 years ago - Pushed at: over 7 years ago - Stars: 1 - Forks: 0

1tangerine1day/search_engine

a search engine for Pubmed artitcal

Language: JavaScript - Size: 5.91 MB - Last synced at: over 2 years ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0

topcat/pubmed-docsim

Code to train a LSI model using Pubmed OA medical documents and to use pre-trained Pubmed models on your own corpus for document similarity.

Language: Python - Size: 4.88 KB - Last synced at: over 2 years ago - Pushed at: almost 7 years ago - Stars: 2 - Forks: 1

tifaniwarnita/Document-Similarity

Document similarity using cosine distance, tf-idf, and latent semantic analysis.

Language: R - Size: 51.4 MB - Last synced at: almost 3 years ago - Pushed at: almost 9 years ago - Stars: 0 - Forks: 5

yadhu98/Document-Similarity-using-Python

This is a program used to check document similarity using Natural Language Tool Kit,using Cosine Similarity.

Language: Python - Size: 3.91 KB - Last synced at: over 1 year ago - Pushed at: over 7 years ago - Stars: 1 - Forks: 1

Bit-Nation/notary

The Bitnation Jurisdiction Public Notary DApp

Language: JavaScript - Size: 139 KB - Last synced at: 6 months ago - Pushed at: about 7 years ago - Stars: 2 - Forks: 1

jgiere/DocGraph

Index documents in Apache Solr and see similarities in the document's contents.

Language: Java - Size: 243 KB - Last synced at: about 2 years ago - Pushed at: about 8 years ago - Stars: 0 - Forks: 0

JeetThakare/NaturalLangProcessing

NLP Projects

Language: Python - Size: 514 KB - Last synced at: almost 3 years ago - Pushed at: about 8 years ago - Stars: 0 - Forks: 0

zeitunik/Big-Data

Big data homework solutions

Language: Python - Size: 146 KB - Last synced at: over 2 years ago - Pushed at: over 8 years ago - Stars: 0 - Forks: 0

Related Keywords

python 16 nlp 13 cosine-similarity 12 machine-learning 10 natural-language-processing 10 tf-idf 9 topic-modeling 7 information-retrieval 7 minhash 6 word-embeddings 5 search-engine 5 plagiarism-detection 5 nltk 4 jaccard-similarity 4 lsh 4 phase-one 4 ontoclue 4 deep-learning 4 doc2doc-relevance 4 cpp 3 flask 3 nltk-python 3 text-summarization 3 document-embeddings 3 inverted-index 3 latent-semantic-analysis 3 pos-tagging 3 semantic-search 3 data-science 3 fasttext 3 tensorflow 3 pandas 3 document-clustering 3 word2vec 3 locality-sensitive-hashing 3 ner 2 similarity 2 latent-dirichlet-allocation 2 langchain 2 scrapy 2 doc2vec 2 pyspark 2 jupyter-notebook 2 tfidf-vectorizer 2 search 2 tfidf 2 text-analysis 2 scikit-learn 2 nlp-machine-learning 2 django 2 document-search 2 knn 2 data-mining 2 gensim 2 word-similarity 2 cython 2 sentence-similarity 2 matplotlib 2 clustering 2 plagiarism-checker 2 python-flask 2 universal-sentence-encoder 2 sentiment-analysis 2 cosine-distance 2 dependency-parser 1 live-website-scraping 1 command-line-tool 1 bert-model 1 local-winner 1 nasa 1 nasa-space-apps-cairo-2022 1 nasa-spaceapps-challenge 1 bachelor-thesis 1 nasa-spaceapps-challenge-2022 1 tsne 1 network-visualization 1 speech-to-text 1 stable-diffusion 1 text-to-image 1 text-classification 1 text-to-speech 1 pca 1 the-art-in-our-worlds 1 keyword-extraction 1 newsgroups 1 linear-discriminant-analysis 1 artificial-intelligence 1 semester-1 1 jobsearch 1 cbs 1 mongodb 1 autotagging 1 exploratory-data-analysis 1 extracting-features 1 sklearn 1 sentime 1 data-cleaning 1 text-mining 1 beir 1 newspaper 1