Topic: "web-data-extraction"
MohamedHmini/iww
AI based web-wrapper for web-content-extraction
Language: Python - Size: 59.2 MB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 100 - Forks: 14

neurons-me/this.url
The this.url class is designed to fetch and parse URL data, returning an object with structured information that can then be used for machine learning algorithms in a database or other storage.
Language: JavaScript - Size: 2.08 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 60 - Forks: 0

lightfeed/extractor
Use LLMs to robustly extract structured data from HTML and markdown
Language: TypeScript - Size: 181 KB - Last synced at: about 21 hours ago - Pushed at: about 21 hours ago - Stars: 37 - Forks: 3

luminati-io/java-web-scraping
Quick guide with code example how to use Java for web scraping
Size: 201 KB - Last synced at: about 2 months ago - Pushed at: 6 months ago - Stars: 16 - Forks: 4

dstark5/gnews-scraper
GNewsScraper is a TypeScript package that scrapes article data from Google News based on a keyword or phrase. It returns the results as an array of JSON objects, making it convenient to access and use the scraped information
Language: TypeScript - Size: 153 KB - Last synced at: 14 days ago - Pushed at: almost 2 years ago - Stars: 12 - Forks: 3

jjonescz/awe
AI-based web extractor
Language: Python - Size: 2.16 MB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 11 - Forks: 2

Boomslet/Web_Crawler
Open-source web crawler
Language: Python - Size: 34.2 KB - Last synced at: over 2 years ago - Pushed at: almost 7 years ago - Stars: 9 - Forks: 6

wbsg-uni-mannheim/WDCFramework
Java Framework which is used by the Web Data Commons project to extract Microdata, Microformats and RDFa data, Web graphs, and HTML tables from the web crawls provided by the Common Crawl Foundation.
Language: Java - Size: 2.46 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 8 - Forks: 1

kaizenplatform/FacebookInsightsConnector
The Tableau Web Data Connector for Facebook Insights API
Language: JavaScript - Size: 225 KB - Last synced at: about 2 months ago - Pushed at: almost 8 years ago - Stars: 8 - Forks: 4

lekhmanrus/real-shot-pdf
RealShotPDF is a Chrome extension designed to simplify the process of creating PDF documents from web content. The extension allows users to navigate through selected webpages, parse and display links in a tree view, and generate PDFs for the chosen pages. It operates locally without sending any data to external servers.
Language: TypeScript - Size: 406 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 6 - Forks: 1

DemonMartin/scrappey-wrapper
An API wrapper for Scrappey.com written in Node.js (cloudflare bypass & solver)
Language: JavaScript - Size: 61.5 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 6 - Forks: 0

lightfeed/sdk
Lightfeed SDK to search and filter web data
Language: Python - Size: 124 KB - Last synced at: about 20 hours ago - Pushed at: about 21 hours ago - Stars: 5 - Forks: 1

ranajahanzaib/wdx
A web data extraction library written in golang.
Language: Go - Size: 82 KB - Last synced at: about 14 hours ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 0

oxpath/oxpath
OXPath from Oxford
Language: Java - Size: 4.69 MB - Last synced at: over 2 years ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 1

hoxhaeris/get_muitiple
Get and process multiple resources from web, using asyncio (aiohttp) to fetch the data and multiprocessing/multithreading for processing it.
Language: Python - Size: 18.6 KB - Last synced at: over 2 years ago - Pushed at: over 4 years ago - Stars: 2 - Forks: 0

wbsg-uni-mannheim/wdc-page
This repository contains the source files of the Web Data Commons website and is used to maintain the site. The Web Data Commons project extracts structured data from the Common Crawl
Language: HTML - Size: 57.3 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 1

yumeangelica/store_data_extractor
A Python-based web data extractor designed to monitor online stores and track product updates in real-time. This project is developed as a standalone module but is also part of the larger jirai_sweeties project, where it integrates with additional features.
Language: Python - Size: 0 Bytes - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

BigDataIA-Spring2025-4/DAMG7245_Assignment01
A Streamlit-based app with a FastAPI backend for extracting structured data (text, images, tables) from websites and PDFs. Processed data is stored in AWS S3 and rendered in a markdown-standardized format. APIs are deployed on Google Cloud Run Service
Language: Jupyter Notebook - Size: 90.7 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

mibrahimbashir/customer_reviews
A Comprehensive Script To Extract Customer Reviews For Machine Learning
Language: Python - Size: 10.7 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

gonzalopezgil/scraping-interface
Python-based desktop app for effortless web scraping
Language: Python - Size: 4.15 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

sc10ntech/extract-site-metadata
Metadata extractor for the sprawling web ⚙️
Language: TypeScript - Size: 2.69 MB - Last synced at: 19 days ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 1

wbsg-uni-mannheim/StructuredDataProfiler
Java project for profiling the results of the yearly Web Data Commons extraction of structured data with RDFa, Microdata, Microformat, and Embedded JSON-LD annotations.
Language: Java - Size: 101 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

wbsg-uni-mannheim/schemaorg-tables
This repository contains the code and data download links to reproduce the building process of the 2021 Schema.org Table Corpus.
Language: Python - Size: 26.4 KB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 0

chelvanai/Web-data-scrap
Web data scrpe by scrapy
Language: Python - Size: 9.77 KB - Last synced at: almost 2 years ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0

dariga-sm/Word-Frequency-in-Moby-Dick
Scrape the novel Moby Dick from the website Project Gutenberg using the Python package requests. Then you'll extract words from this web data using BeautifulSoup. Finally, we'll dive into analyzing the distribution of words using the Natural Language ToolKit (nltk)
Language: HTML - Size: 1.01 MB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0
