Topic: "url-normalization"
sindresorhus/normalize-url
Normalize a URL
Language: JavaScript - Size: 124 KB - Last synced at: 5 days ago - Pushed at: about 1 year ago - Stars: 847 - Forks: 123

patternhelloworld/url-knife
Extract and decompose (fuzzy) URLs (including emails, which are conceptually a part of URLs) in texts with Area-Pattern-based modularity
Language: TypeScript - Size: 958 KB - Last synced at: 15 days ago - Pushed at: 4 months ago - Stars: 352 - Forks: 15

adbar/courlan
Clean, filter and sample URLs to optimize data collection โ Python & command-line โ Deduplication, spam, content and language filters
Language: Python - Size: 547 KB - Last synced at: 14 days ago - Pushed at: 5 months ago - Stars: 137 - Forks: 9

xojoc/cleanurl
Remove clutter from URLs and return a canonicalized version
Language: Python - Size: 77.1 KB - Last synced at: 9 months ago - Pushed at: 12 months ago - Stars: 17 - Forks: 3

hanover-computing/canonicize-url
Get a stable, canonical version of any URL, with DNS and HTTPS checks, redirects, tracker stripping, and canonical link extraction!
Language: JavaScript - Size: 1.57 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 12 - Forks: 0

vladkens/url-normalize
๐๐งน Normalize URLs to a standardized form. HTTPS by default, flexible configuration, custom protocols, domain extraction, humazing URL, and punycode support. Both CJS & ESM modules available.
Language: TypeScript - Size: 159 KB - Last synced at: 13 days ago - Pushed at: 3 months ago - Stars: 9 - Forks: 2

seroperson/urlopt4s
Allows you to remove ad/tracking query params from a given URL in Scala
Language: Scala - Size: 77.1 KB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 8 - Forks: 0

toimik/UrlNormalization
URL normalizer to canonicalize (standardize) the text representation of a URL to determine if differently-formatted URLs are identical
Language: C# - Size: 57.6 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 5 - Forks: 0

chipslays/php-url-fingerprint
๐ Pathor is a PHP library for normalizing, analyzing, and comparing URLs.
Language: PHP - Size: 21.5 KB - Last synced at: 16 days ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

simonpierreboucher/Crawler
A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.
Language: Python - Size: 87.9 KB - Last synced at: about 2 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

Manu-sh/http_normalizer
http url normalization for web crawlers
Language: C++ - Size: 481 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

Atanukumardey/NLP-NoteBooks
Natural Language Precessing related notebooks (Machine Learning)
Language: Jupyter Notebook - Size: 301 KB - Last synced at: almost 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0
