Website scraping tool
- or -
Post a project like this£363(approx. $454)
- Posted:
- Proposals: 44
- Remote
- #4148435
- OPPORTUNITY
- Expired
Experienced Virtual Assistant with Expertise in Data Mining, Lead Generation, and Customer Support
Bay Minette
Software Engineer | IT Professional | AI developer | Mobile App Developer and more !
Sydney
8783082551457267069626840654705777571021467241583814635681491128288071844514310889557
Description
Experience Level: Expert
# Composite Use Case for myClerk.ai Web Scraping Tool Development
## Project Overview
The myClerk.ai project aims to automate the collection, organization, and monthly update of documents from approximately 10,827 UK council websites, including 10,450 parish and town councils and 377 larger councils. This initiative seeks to make council documents easily accessible and searchable, covering essential materials such as constitutional documents, terms of reference, minutes, and planning documents.
## Objectives
- **Automate Document Extraction:** Develop a scraping tool to automate the retrieval of PDF documents across varied council websites, accounting for the unique structure and content of each site.
- **Efficient Data Organization:** Utilize council reference codes to systematically organize documents on a web server.
- **Monthly Updates:** Implement a mechanism to capture new documents on a monthly basis without duplicating existing files.
- **Link Monitoring and Notifications:** Create a system to track and report broken links and facilitate updates or notifications to site administrators.
- **Data Categorization for Larger Councils:** Classify documents on larger council websites for more efficient retrieval and analysis.
## Database Structure
The development leverages a hybrid database approach:
- **Relational Database (PostgreSQL):** Hosts a comprehensive list of councils and their metadata, crucial for guiding the scraping tool to the correct websites for document extraction.
- **Vector Database:** Reserved for storing processed text from PDFs for content-based searches, but note that this element is separate from the scraping tool task.
## Suggested Technologies
- **Web Scraping and Data Organization:** Python, with libraries such as BeautifulSoup, Scrapy, and Requests for web scraping and automation. AWS S3 for document storage and PostgreSQL on AWS RDS for data management.
- **Server and Hosting:** AWS Lambda for cost-effective routine downloading tasks and Amazon Aurora Serverless for RDS to dynamically adjust computational capacity.
- **Notification System:** AWS Lambda and SNS for monitoring and identifying broken links, sending notifications for action.
## Crawling and Scraping Process
- **Crawling:** Implement a depth-controlled crawler to navigate each council's website, identifying webpages with PDF links at all levels.
- **Scraping and Downloading:** Post-crawling, the tool will scrape the identified PDFs, checking against previous downloads to avoid duplication. The tool is designed to adapt to the diverse web structures of council sites, ensuring comprehensive document retrieval.
## Monthly Update Cycle
- The tool will perform a complete cycle each month, identifying and downloading new or updated documents based on changes in file details, thereby keeping the database current without accumulating duplicates.
## Development and Testing
- Prior to full deployment, the scraper will undergo a testing phase on a selection of websites to refine its operation, gradually scaling up to include the full range of targeted sites.
Timescale: basic model for testing to be delivered as soon as possible, a number of weeks can be allowed for the full model, including deploying it to the host web server and connecting to the database.
## Project Overview
The myClerk.ai project aims to automate the collection, organization, and monthly update of documents from approximately 10,827 UK council websites, including 10,450 parish and town councils and 377 larger councils. This initiative seeks to make council documents easily accessible and searchable, covering essential materials such as constitutional documents, terms of reference, minutes, and planning documents.
## Objectives
- **Automate Document Extraction:** Develop a scraping tool to automate the retrieval of PDF documents across varied council websites, accounting for the unique structure and content of each site.
- **Efficient Data Organization:** Utilize council reference codes to systematically organize documents on a web server.
- **Monthly Updates:** Implement a mechanism to capture new documents on a monthly basis without duplicating existing files.
- **Link Monitoring and Notifications:** Create a system to track and report broken links and facilitate updates or notifications to site administrators.
- **Data Categorization for Larger Councils:** Classify documents on larger council websites for more efficient retrieval and analysis.
## Database Structure
The development leverages a hybrid database approach:
- **Relational Database (PostgreSQL):** Hosts a comprehensive list of councils and their metadata, crucial for guiding the scraping tool to the correct websites for document extraction.
- **Vector Database:** Reserved for storing processed text from PDFs for content-based searches, but note that this element is separate from the scraping tool task.
## Suggested Technologies
- **Web Scraping and Data Organization:** Python, with libraries such as BeautifulSoup, Scrapy, and Requests for web scraping and automation. AWS S3 for document storage and PostgreSQL on AWS RDS for data management.
- **Server and Hosting:** AWS Lambda for cost-effective routine downloading tasks and Amazon Aurora Serverless for RDS to dynamically adjust computational capacity.
- **Notification System:** AWS Lambda and SNS for monitoring and identifying broken links, sending notifications for action.
## Crawling and Scraping Process
- **Crawling:** Implement a depth-controlled crawler to navigate each council's website, identifying webpages with PDF links at all levels.
- **Scraping and Downloading:** Post-crawling, the tool will scrape the identified PDFs, checking against previous downloads to avoid duplication. The tool is designed to adapt to the diverse web structures of council sites, ensuring comprehensive document retrieval.
## Monthly Update Cycle
- The tool will perform a complete cycle each month, identifying and downloading new or updated documents based on changes in file details, thereby keeping the database current without accumulating duplicates.
## Development and Testing
- Prior to full deployment, the scraper will undergo a testing phase on a selection of websites to refine its operation, gradually scaling up to include the full range of targeted sites.
Timescale: basic model for testing to be delivered as soon as possible, a number of weeks can be allowed for the full model, including deploying it to the host web server and connecting to the database.
Insides
100% (25)Projects Completed
18
Freelancers worked with
15
Projects awarded
40%
Last project
8 Feb 2024
United Kingdom
New Proposal
Login to your account and send a proposal now to get this project.
Log inClarification Board Ask a Question
-
There are no clarification messages.
We collect cookies to enable the proper functioning and security of our website, and to enhance your experience. By clicking on 'Accept All Cookies', you consent to the use of these cookies. You can change your 'Cookies Settings' at any time. For more information, please read ourCookie Policy
Cookie Settings
Accept All Cookies