Website scraping tool

- or -

Post a project like this

Ended at: 15/03/2024

Fixed Price

£363(approx. $467)

Posted: 5 months ago
Proposals: 44
Remote
#4148435
OPPORTUNITY
Expired

+ have already sent a proposal.

Description

Experience Level: Expert

# Composite Use Case for myClerk.ai Web Scraping Tool Development

## Project Overview
The myClerk.ai project aims to automate the collection, organization, and monthly update of documents from approximately 10,827 UK council websites, including 10,450 parish and town councils and 377 larger councils. This initiative seeks to make council documents easily accessible and searchable, covering essential materials such as constitutional documents, terms of reference, minutes, and planning documents.

## Objectives
- **Automate Document Extraction:** Develop a scraping tool to automate the retrieval of PDF documents across varied council websites, accounting for the unique structure and content of each site.
- **Efficient Data Organization:** Utilize council reference codes to systematically organize documents on a web server.
- **Monthly Updates:** Implement a mechanism to capture new documents on a monthly basis without duplicating existing files.
- **Link Monitoring and Notifications:** Create a system to track and report broken links and facilitate updates or notifications to site administrators.
- **Data Categorization for Larger Councils:** Classify documents on larger council websites for more efficient retrieval and analysis.

## Database Structure
The development leverages a hybrid database approach:
- **Relational Database (PostgreSQL):** Hosts a comprehensive list of councils and their metadata, crucial for guiding the scraping tool to the correct websites for document extraction.
- **Vector Database:** Reserved for storing processed text from PDFs for content-based searches, but note that this element is separate from the scraping tool task.

## Suggested Technologies
- **Web Scraping and Data Organization:** Python, with libraries such as BeautifulSoup, Scrapy, and Requests for web scraping and automation. AWS S3 for document storage and PostgreSQL on AWS RDS for data management.
- **Server and Hosting:** AWS Lambda for cost-effective routine downloading tasks and Amazon Aurora Serverless for RDS to dynamically adjust computational capacity.
- **Notification System:** AWS Lambda and SNS for monitoring and identifying broken links, sending notifications for action.

## Crawling and Scraping Process
- **Crawling:** Implement a depth-controlled crawler to navigate each council's website, identifying webpages with PDF links at all levels.
- **Scraping and Downloading:** Post-crawling, the tool will scrape the identified PDFs, checking against previous downloads to avoid duplication. The tool is designed to adapt to the diverse web structures of council sites, ensuring comprehensive document retrieval.

## Monthly Update Cycle
- The tool will perform a complete cycle each month, identifying and downloading new or updated documents based on changes in file details, thereby keeping the database current without accumulating duplicates.

## Development and Testing
- Prior to full deployment, the scraper will undergo a testing phase on a selection of websites to refine its operation, gradually scaling up to include the full range of targeted sites.

Timescale: basic model for testing to be delivered as soon as possible, a number of weeks can be allowed for the full model, including deploying it to the host web server and connecting to the database.

New Proposal

Clarification Board Ask a Question

There are no clarification messages.

Description

Insides

New Proposal

Clarification Board Ask a Question