Need a URL crawler built for large sites

  • Posted:
  • Proposals: 6
  • Remote
  • #661312
  • Awarded
Alok A.
Saeb A.Ehsan K.Tony S.Lhassan B. + 1 other have already sent a proposal.
  • 1

Description

Experience Level: Expert
General information for the business: URL Crawler
Kind of development: New program from scratch
Description of requirements/functionality: Looking for someone to build a fairly simple program to crawl sites and extract URL lists.

Basically the program will accept input of a starting point URL (e.g. bbc.com), then when started it will extract all URLs on that page and add them to a list. It will then recursively retrieve the top URL from the list, extract URLs and add them all to the list and so on.
The program will need to be able to crawl some of the largest sites such as bbc.co.uk (14.5 million pages) so the URL list will need to be saved to disk as RAM will not be big enough.

Would also like options -
- ability to select how many parallel threads are running (e.g. 1-100)
- for each page, the ability to find out how many clicks away from the root URL it is
- for each page, the ability to find out how many internal links are pointing to it
- ability to export all external links pointing to 404 domains

As an additional function, the program can carry out a whois check on the 404 domains to see if they are registered so export registered 404s and non-registered 404s
Extra notes:

New Proposal

Create an account now and send a proposal now to get this job.

Sign up

Clarification Board Ask a Question

    There are no clarification messages.