An open API service providing repository metadata for many open source software ecosystems.

Topic: "web-data-extraction"

firecrawl/firecrawl

🔥 The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data

Language: TypeScript - Size: 75.4 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 67,706 - Forks: 5,258

ScrapeGraphAI/Scrapegraph-ai

Python scraper based on AI

Language: Python - Size: 15.2 MB - Last synced at: 4 days ago - Pushed at: 13 days ago - Stars: 21,791 - Forks: 1,893

MohamedHmini/iww

AI based web-wrapper for web-content-extraction

Language: Python - Size: 59.2 MB - Last synced at: 25 days ago - Pushed at: almost 3 years ago - Stars: 100 - Forks: 14

neurons-me/this.url

The this.url class is designed to fetch and parse URL data, returning an object with structured information that can then be used for machine learning algorithms in a database or other storage.

Language: JavaScript - Size: 2.08 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 59 - Forks: 0

lightfeed/extractor

Using LLMs and AI Browser Automation to Robustly Extract Web Data

Language: TypeScript - Size: 245 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 48 - Forks: 4

luminati-io/java-web-scraping

Quick guide with code example how to use Java for web scraping

Size: 201 KB - Last synced at: 7 months ago - Pushed at: 11 months ago - Stars: 16 - Forks: 4

dstark5/gnews-scraper

GNewsScraper is a TypeScript package that scrapes article data from Google News based on a keyword or phrase. It returns the results as an array of JSON objects, making it convenient to access and use the scraped information

Language: TypeScript - Size: 153 KB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 13 - Forks: 3

jjonescz/awe

AI-based web extractor

Language: Python - Size: 2.16 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 12 - Forks: 2

Boomslet/Web_Crawler

Open-source web crawler

Language: Python - Size: 34.2 KB - Last synced at: over 2 years ago - Pushed at: over 7 years ago - Stars: 9 - Forks: 6

SaurabhSSB/BookMiner

A pipeline to scrape, extract, and analyze book data from web pages to insights.

Language: HTML - Size: 280 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 8 - Forks: 0

wbsg-uni-mannheim/WDCFramework

Java Framework which is used by the Web Data Commons project to extract Microdata, Microformats and RDFa data, Web graphs, and HTML tables from the web crawls provided by the Common Crawl Foundation.

Language: Java - Size: 2.46 MB - Last synced at: over 2 years ago - Pushed at: almost 3 years ago - Stars: 8 - Forks: 1

kaizenplatform/FacebookInsightsConnector

The Tableau Web Data Connector for Facebook Insights API

Language: JavaScript - Size: 225 KB - Last synced at: about 1 month ago - Pushed at: over 8 years ago - Stars: 8 - Forks: 4

lekhmanrus/real-shot-pdf

RealShotPDF is a Chrome extension designed to simplify the process of creating PDF documents from web content. The extension allows users to navigate through selected webpages, parse and display links in a tree view, and generate PDFs for the chosen pages. It operates locally without sending any data to external servers.

Language: TypeScript - Size: 406 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 6 - Forks: 1

DemonMartin/scrappey-wrapper

An API wrapper for Scrappey.com written in Node.js (cloudflare bypass & solver)

Language: JavaScript - Size: 61.5 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 6 - Forks: 0

lightfeed/sdk

Lightfeed SDK to search and filter web data

Language: Python - Size: 128 KB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 5 - Forks: 1

ranajahanzaib/wdx

A web data extraction library written in golang.

Language: Go - Size: 82 KB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 2 - Forks: 0

oxpath/oxpath

OXPath from Oxford

Language: Java - Size: 4.69 MB - Last synced at: over 2 years ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 1

hoxhaeris/get_muitiple

Get and process multiple resources from web, using asyncio (aiohttp) to fetch the data and multiprocessing/multithreading for processing it.

Language: Python - Size: 18.6 KB - Last synced at: over 2 years ago - Pushed at: over 4 years ago - Stars: 2 - Forks: 0

wbsg-uni-mannheim/wdc-page

This repository contains the source files of the Web Data Commons website and is used to maintain the site. The Web Data Commons project extracts structured data from the Common Crawl

Language: HTML - Size: 57.3 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 1 - Forks: 1

nichoxLashall/shopee-scraper

Shopee product data scraper, API-based

Language: Python - Size: 7.81 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

Starc123914/playwright-scraper

automated web data extraction using browsers

Language: JavaScript - Size: 20.5 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

Ramun-123/dealroom-scraper

Dealroom data extraction tool

Language: Python - Size: 0 Bytes - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

Evan5764/zoopla-co-uk-scraper

UK property data crawler

Language: Python - Size: 0 Bytes - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

wyattowalsh/proxywhirl

rotating proxy system

Language: Python - Size: 13.9 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

lengoctam449-cloud/backlink-crawler

Backlink crawler for web data

Language: Python - Size: 1.66 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

yumeangelica/store_data_extractor

A Python-based web data extractor designed to monitor online stores and track product updates in real-time. This project is developed as a standalone module but is also part of the larger jirai_sweeties project, where it integrates with additional features.

Language: Python - Size: 27.3 KB - Last synced at: 9 days ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

BigDataIA-Spring2025-4/DAMG7245_Assignment01

A Streamlit-based app with a FastAPI backend for extracting structured data (text, images, tables) from websites and PDFs. Processed data is stored in AWS S3 and rendered in a markdown-standardized format. APIs are deployed on Google Cloud Run Service

Language: Jupyter Notebook - Size: 90.7 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

mibrahimbashir/customer_reviews

A Comprehensive Script To Extract Customer Reviews For Machine Learning

Language: Python - Size: 10.7 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

gonzalopezgil/scraping-interface

Python-based desktop app for effortless web scraping

Language: Python - Size: 4.15 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

sc10ntech/extract-site-metadata

Metadata extractor for the sprawling web ⚙️

Language: TypeScript - Size: 2.69 MB - Last synced at: about 2 months ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 1

wbsg-uni-mannheim/StructuredDataProfiler

Java project for profiling the results of the yearly Web Data Commons extraction of structured data with RDFa, Microdata, Microformat, and Embedded JSON-LD annotations.

Language: Java - Size: 101 KB - Last synced at: over 2 years ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

wbsg-uni-mannheim/schemaorg-tables

This repository contains the code and data download links to reproduce the building process of the 2021 Schema.org Table Corpus.

Language: Python - Size: 26.4 KB - Last synced at: over 2 years ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 0

chelvanai/Web-data-scrap

Web data scrpe by scrapy

Language: Python - Size: 9.77 KB - Last synced at: over 2 years ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 0

dariga-sm/Word-Frequency-in-Moby-Dick

Scrape the novel Moby Dick from the website Project Gutenberg using the Python package requests. Then you'll extract words from this web data using BeautifulSoup. Finally, we'll dive into analyzing the distribution of words using the Natural Language ToolKit (nltk)

Language: HTML - Size: 1.01 MB - Last synced at: over 1 year ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 0

Related Topics
web-scraping 12 python 7 data-extraction 7 crawler 5 webscraping 5 llm 4 python3 4 scraping 3 rag 3 scrapy 3 web-crawling 3 web-crawler 3 ai-agents 3 markdown 3 data-pipeline 3 schema-org 3 scraper 3 data 2 llm-scraper 2 api-scraping 2 llm-extraction 2 knowledge-base 2 etl 2 beautifulsoup 2 data-engineering 2 web-search 2 web-scraper 2 information-extraction 2 microdata 2 ai-crawler 2 metadata-extraction 2 ai 2 json-ld 2 web 2 web-data 2 ai-scraping 2 ai-search 2 html-to-markdown 2 gpt-integration 1 knowledgebase 1 link-parsing 1 local-data-processing 1 pdf 1 proxy-scraper 1 proxy-list 1 web-content-capture 1 pdf-merger 1 proxy-checker 1 pdf-downloader 1 dataextraction 1 pdf-generation 1 webpage-to-pdf 1 proxy 1 pdf-generator 1 project-portfolio 1 go-scraper 1 mongodb 1 nextjs 1 data-mining 1 library 1 web-content-extractor 1 web-mining 1 business-intelligence 1 data-integration 1 embedding-search 1 extract 1 structured-data 1 vector-database 1 web-data-management 1 ai-assistant 1 angular 1 browser-extension 1 chrome-extension 1 data-preservation 1 proxy-server 1 firefox-scraping 1 headless-browser-scraping 1 javascript-data-scraper 1 nodejs-crawler 1 playwright-automation 1 website-data-collection 1 category-search 1 e-commerce-data 1 flash-sales 1 product-details 1 product-ratings 1 python-scraper 1 shop-details 1 shopee-scraping 1 article-extractor 1 google-gemini 1 html-parser 1 nlp 1 openai 1 rss-feed 1 web-extraction 1 automated-scraper 1 large-language-model 1 scraping-python 1 web-crawlers 1