An open API service providing repository metadata for many open source software ecosystems.

GitHub / omar-elmaria / python_scrapy_airflow_pipeline

This repo contains a full-fledged Python-based script that scrapes a JavaScript-rendered website, cleans the data, and pushes the results to a cloud-based database. The workflow is orchestrated on Airflow to run automatically

JSON API: http://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/omar-elmaria%2Fpython_scrapy_airflow_pipeline
PURL: pkg:github/omar-elmaria/python_scrapy_airflow_pipeline

Stars: 3
Forks: 0
Open issues: 0

License: None
Language: Python
Size: 179 KB
Dependencies parsed at: Pending

Created at: over 2 years ago
Updated at: about 2 years ago
Pushed at: over 2 years ago
Last synced at: about 2 years ago

Topics: airflow, anti-bot, data-mining, dynamic-websites, javascript-rendered-websites, proxy-api, proxy-scraper, python, scrapy, spiders, web-crawling, web-scraping

    Loading...