Experienced Python Developers for large web scraping project

- or -

Post a project like this

Ended at: 16/09/2021

Per Hour

£23_/hr(approx. $31_/hr)

Posted: 4 years ago
Proposals: 18
Remote
#3369517
Expired

+ have already sent a proposal.

Description

Experience Level: Expert

Essential Skills...
Python 3
Web Scraping
Git
pytest
Mechanize Python Package
Selenium webdriver / Headless Chrome
Azure devops

Desirable skills
Azure
Service Bus
Docker
Kubernetes (AKS)
REST
Microservice Architecture

Start Date
ASAP
Full time

Duration of Project
Depends on the speed of the developer but we expect somewhere between 2-6 weeks.

Desired Experience
Previous experience of scraping websites
Understanding of the DOM, HTML elements, Browser Developer Tools
Building Python Web Apis and Services
Working with Microsoft Azure and Devops

Main Task

Fix Broken Web Scrapers.
We have approximately 430 web scrapers crawling British local council website for planning applications. Currently, about 260 scrapers are working and we need to get the remaining fixed within the next 4 to 6 weeks. We also expect the usual unit tests around code and some end to end tests for some happy path scenarios.
Much of the code is based on the old repo: https://github.com/aspeakman/UKPlanning . Please review it to get a taste of the work involved. For some scapers its just a case of fixing a url, others will need to be rewritten.

Secondary Task
Refactor the code to meet best practices.
This isn’t the first priority and shouldn’t get in the way for completing the main task. But new code should be written to best practices and if there’s time and budget we can refactor the existing code.

Software Architecture

The overall architecture can be broken down into 2 systems: The Crawlers and the Search Engine.

The Crawlers
A .NET Orchestrator Service/Api runs every day at 4am and tells the crawlers to start crawling via a POST request. The Python Crawler crawls planning opportunities in the construction industry and publishes the opportunities to a topic. A .NET service then transforms the data to a normalised shape and publishes to another topic. Then a third .NET service receives the transformed Opportunity and writes it to the database. The Python developer is only expected to work on the Python Crawler and no knowledge of .NET is necessary.

The Search Engine
The Search Engine indexes the Opportunities in Elastic Search, which is made available through a .NET Web Api and used in the Front End.

The Python Crawler
The Python Crawler consists of 3 Docker Containers: The Api, The Redis Queue, and The Worker. We can scale up the number of Workers so that we can crawl many websites in parallel.
The Orchestrator runs every day at 4am (CRON job) and sends a list of councils to the Python Crawler Api to crawl.
The Python Crawler Api adds each council as a Job to the Redis Queue.
The Python Worker dequeues a Job from the Redis Queue and crawls the specified council.
The Worker scrapes the Planning Opportunities and publishes them to a Service Bus topic.
The Worker also sends a LastCrawledTime to the Orchestrator Api to track when it last crawled (it does this in check points to help long running jobs).

New Proposal

Clarification Board Ask a Question

There are no clarification messages.

Description

Marcus P.

New Proposal

Clarification Board Ask a Question