Python app on Linux to Scrape URLs for Content and Response Codes

- or -

Post a project like this

Ended at: 06/04/2014

Fixed Price

Posted: 12 years ago
Proposals: 4
Remote
#437679
Expired

have already sent a proposal.

Description

Experience Level: Intermediate

Estimated project duration: 3 - 4 weeks

Description of requirements/functionality: Summary
Required is a Python application that can quickly check hundreds of thousands of urls for headers and content, make decisions about the data it receives and ultimately store the results.

Technology
The platform is Ubuntu 12.04.03 and the preferred language is Python 2.7. RAM, CPU and bandwidth should not pose and significant barriers and can be upgraded on the server if providing a bottleneck.

The source data is stored in a MySQL database and the results will ultimately need to be accessed by the same MySQL database.

Data
The list of urls will potentially contain thousands of urls per domain as well as one url per domain. Data will be supplied for testing.

Requirements
1. Check if the website is active. Ping is essential, however, some servers have Ping disabled, therefore, alternatives must be found when Ping is unsuccessful.

2. Record the server response code. This should be established for every url. Should the response be 301 or 302 redirected then the new url should be followed and used. Any redirected urls will need to replace the original url.

3. Establish if a link is within the page source code. The program should find a link on a page using provided unique elements href, rel attribute and anchor text or if image href, rel attr and alt tag to determine if the link exists on the page for the current url.

4. Scrape the content to identify all the external links on the and record along with the type of link, eg text or image, and the rel="nofollow" attribute if known.

5. Scrape the content to identify the location of the link on the page. Possible locations include: header, footer, sidebar or body. Most frameworks and CMS' have standard templates. Others follow naming conventions. Additionally a list of rules could be created for the wildcards.

I understand that this is the most challenging aspect of the script and quotes without this part will be considered.

Considerations
The script will be accessing websites that do not use the English language, therefore, the application should be able to handle any character set found on the web.

Speed is critical. It should be a multi-process application that will intelligently process the urls, for example, if a domain does not respond the script should not attempt to check its urls. Additionally if the location of a link is in a cross site location, for example, the footer, it may be unnecessary to check every url in as much detail. However, server response code is essential for every url. Database interactions should be reduced and replaced with bulk updates/inserts if impacting on the speed of the application or the database performance for other applications.

The application should not hog the systems resources as the server is shared with other applications.

Accuracy is also vital.
The code may be expanded on by other developers; therefore, it should also be well commented and human readable following established coding conventions such as descriptive naming of variables and methods.
Specific technologies required: Python
OS requirements: Linux
Extra notes:

New Proposal

Clarification Board Ask a Question

There are no clarification messages.

Description

Sean H.

New Proposal

Clarification Board Ask a Question