Need a URL crawler built for large sites
- or -
Post a project like this3358
£150(approx. $188)
- Posted:
- Proposals: 6
- Remote
- #661312
- Awarded
Developer: .NET, C#, ASP.NET Web Development, SQL, HTML, CSS, JavaScript, JQuery, AJAX, .NET Developer, Web Developer, C# Development
Fremont
scrapingsolution.com, IT Consultant, Python Developer, Process Automation, Desktop Applications, Web Scraping, Web Crawling
Newport Pagnell
812936145624177369317579731417846666
Description
Experience Level: Expert
General information for the business: URL Crawler
Kind of development: New program from scratch
Description of requirements/functionality: Looking for someone to build a fairly simple program to crawl sites and extract URL lists.
Basically the program will accept input of a starting point URL (e.g. bbc.com), then when started it will extract all URLs on that page and add them to a list. It will then recursively retrieve the top URL from the list, extract URLs and add them all to the list and so on.
The program will need to be able to crawl some of the largest sites such as bbc.co.uk (14.5 million pages) so the URL list will need to be saved to disk as RAM will not be big enough.
Would also like options -
- ability to select how many parallel threads are running (e.g. 1-100)
- for each page, the ability to find out how many clicks away from the root URL it is
- for each page, the ability to find out how many internal links are pointing to it
- ability to export all external links pointing to 404 domains
As an additional function, the program can carry out a whois check on the 404 domains to see if they are registered so export registered 404s and non-registered 404s
Extra notes:
Kind of development: New program from scratch
Description of requirements/functionality: Looking for someone to build a fairly simple program to crawl sites and extract URL lists.
Basically the program will accept input of a starting point URL (e.g. bbc.com), then when started it will extract all URLs on that page and add them to a list. It will then recursively retrieve the top URL from the list, extract URLs and add them all to the list and so on.
The program will need to be able to crawl some of the largest sites such as bbc.co.uk (14.5 million pages) so the URL list will need to be saved to disk as RAM will not be big enough.
Would also like options -
- ability to select how many parallel threads are running (e.g. 1-100)
- for each page, the ability to find out how many clicks away from the root URL it is
- for each page, the ability to find out how many internal links are pointing to it
- ability to export all external links pointing to 404 domains
As an additional function, the program can carry out a whois check on the 404 domains to see if they are registered so export registered 404s and non-registered 404s
Extra notes:
Jack B.
99% (183)Projects Completed
265
Freelancers worked with
179
Projects awarded
57%
Last project
2 Apr 2024
United Kingdom
New Proposal
Login to your account and send a proposal now to get this project.
Log inClarification Board Ask a Question
-
There are no clarification messages.
We collect cookies to enable the proper functioning and security of our website, and to enhance your experience. By clicking on 'Accept All Cookies', you consent to the use of these cookies. You can change your 'Cookies Settings' at any time. For more information, please read ourCookie Policy
Cookie Settings
Accept All Cookies