An open API service providing repository metadata for many open source software ecosystems.

Topic: "web-data-extraction"

MohamedHmini/iww

AI based web-wrapper for web-content-extraction

Language: Python - Size: 59.2 MB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 100 - Forks: 14

neurons-me/this.url

The this.url class is designed to fetch and parse URL data, returning an object with structured information that can then be used for machine learning algorithms in a database or other storage.

Language: JavaScript - Size: 2.08 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 60 - Forks: 0

lightfeed/extractor

Use LLMs to robustly extract structured data from HTML and markdown

Language: TypeScript - Size: 181 KB - Last synced at: about 21 hours ago - Pushed at: about 21 hours ago - Stars: 37 - Forks: 3

luminati-io/java-web-scraping

Quick guide with code example how to use Java for web scraping

Size: 201 KB - Last synced at: about 2 months ago - Pushed at: 6 months ago - Stars: 16 - Forks: 4

dstark5/gnews-scraper

GNewsScraper is a TypeScript package that scrapes article data from Google News based on a keyword or phrase. It returns the results as an array of JSON objects, making it convenient to access and use the scraped information

Language: TypeScript - Size: 153 KB - Last synced at: 14 days ago - Pushed at: almost 2 years ago - Stars: 12 - Forks: 3

jjonescz/awe

AI-based web extractor

Language: Python - Size: 2.16 MB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 11 - Forks: 2

Boomslet/Web_Crawler

Open-source web crawler

Language: Python - Size: 34.2 KB - Last synced at: over 2 years ago - Pushed at: almost 7 years ago - Stars: 9 - Forks: 6

wbsg-uni-mannheim/WDCFramework

Java Framework which is used by the Web Data Commons project to extract Microdata, Microformats and RDFa data, Web graphs, and HTML tables from the web crawls provided by the Common Crawl Foundation.

Language: Java - Size: 2.46 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 8 - Forks: 1

kaizenplatform/FacebookInsightsConnector

The Tableau Web Data Connector for Facebook Insights API

Language: JavaScript - Size: 225 KB - Last synced at: about 2 months ago - Pushed at: almost 8 years ago - Stars: 8 - Forks: 4

lekhmanrus/real-shot-pdf

RealShotPDF is a Chrome extension designed to simplify the process of creating PDF documents from web content. The extension allows users to navigate through selected webpages, parse and display links in a tree view, and generate PDFs for the chosen pages. It operates locally without sending any data to external servers.

Language: TypeScript - Size: 406 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 6 - Forks: 1

DemonMartin/scrappey-wrapper

An API wrapper for Scrappey.com written in Node.js (cloudflare bypass & solver)

Language: JavaScript - Size: 61.5 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 6 - Forks: 0

lightfeed/sdk

Lightfeed SDK to search and filter web data

Language: Python - Size: 124 KB - Last synced at: about 20 hours ago - Pushed at: about 21 hours ago - Stars: 5 - Forks: 1

ranajahanzaib/wdx

A web data extraction library written in golang.

Language: Go - Size: 82 KB - Last synced at: about 14 hours ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 0

oxpath/oxpath

OXPath from Oxford

Language: Java - Size: 4.69 MB - Last synced at: over 2 years ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 1

hoxhaeris/get_muitiple

Get and process multiple resources from web, using asyncio (aiohttp) to fetch the data and multiprocessing/multithreading for processing it.

Language: Python - Size: 18.6 KB - Last synced at: over 2 years ago - Pushed at: over 4 years ago - Stars: 2 - Forks: 0

wbsg-uni-mannheim/wdc-page

This repository contains the source files of the Web Data Commons website and is used to maintain the site. The Web Data Commons project extracts structured data from the Common Crawl

Language: HTML - Size: 57.3 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 1

yumeangelica/store_data_extractor

A Python-based web data extractor designed to monitor online stores and track product updates in real-time. This project is developed as a standalone module but is also part of the larger jirai_sweeties project, where it integrates with additional features.

Language: Python - Size: 0 Bytes - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

BigDataIA-Spring2025-4/DAMG7245_Assignment01

A Streamlit-based app with a FastAPI backend for extracting structured data (text, images, tables) from websites and PDFs. Processed data is stored in AWS S3 and rendered in a markdown-standardized format. APIs are deployed on Google Cloud Run Service

Language: Jupyter Notebook - Size: 90.7 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

mibrahimbashir/customer_reviews

A Comprehensive Script To Extract Customer Reviews For Machine Learning

Language: Python - Size: 10.7 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

gonzalopezgil/scraping-interface

Python-based desktop app for effortless web scraping

Language: Python - Size: 4.15 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

sc10ntech/extract-site-metadata

Metadata extractor for the sprawling web ⚙️

Language: TypeScript - Size: 2.69 MB - Last synced at: 19 days ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 1

wbsg-uni-mannheim/StructuredDataProfiler

Java project for profiling the results of the yearly Web Data Commons extraction of structured data with RDFa, Microdata, Microformat, and Embedded JSON-LD annotations.

Language: Java - Size: 101 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

wbsg-uni-mannheim/schemaorg-tables

This repository contains the code and data download links to reproduce the building process of the 2021 Schema.org Table Corpus.

Language: Python - Size: 26.4 KB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 0

chelvanai/Web-data-scrap

Web data scrpe by scrapy

Language: Python - Size: 9.77 KB - Last synced at: almost 2 years ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0

dariga-sm/Word-Frequency-in-Moby-Dick

Scrape the novel Moby Dick from the website Project Gutenberg using the Python package requests. Then you'll extract words from this web data using BeautifulSoup. Finally, we'll dive into analyzing the distribution of words using the Natural Language ToolKit (nltk)

Language: HTML - Size: 1.01 MB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0