An open API service providing repository metadata for many open source software ecosystems.

GitHub / aksh-patel1 / parallel-web-scraper-on-cloud

This project demonstrates an event-driven architecture for parallel web scraping and processing tasks using AWS services. The scraper job, running on AWS Batch, collects data from multiple web pages simultaneously and stores it in S3. The processing job, triggered by AWS EventBridge, efficiently processes the scraped data and updates Google-Sheet.

JSON API: http://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aksh-patel1%2Fparallel-web-scraper-on-cloud
PURL: pkg:github/aksh-patel1/parallel-web-scraper-on-cloud

Stars: 2
Forks: 0
Open issues: 0

License: None
Language: Python
Size: 9.77 KB
Dependencies parsed at: Pending

Created at: about 1 year ago
Updated at: 4 months ago
Pushed at: about 1 year ago
Last synced at: 6 days ago

Topics: aws, aws-batch, aws-ecr, aws-eventbridge, aws-s3, data-preprocessing, docker, event-driven-architecture, eventdrivenarchitecture, python, web-scraping

    Loading...