PHP+JS spider development

  • Posted:
  • Proposals: 3
  • Remote
  • #103894
  • Archived
Sasidhar D.Vladimir P.Pradeep Y. have already sent a proposal.
  • 1

Description

Experience Level: Expert
To create a javascript and php based spider generator for a linux web server with cron jobs enabled. It will be acceptable if the initial operation works for firefox only.

In broad outline :

We want a spider generator that allows parts of a page to be selected interactively by someone who does not have knowledge of the DOM or HTML4/5 elements. This could be by manipulating the CSS border or background for the elements, like 'inspect element' and similar tools do in firefox.

The process of selection to then allow, by an overlaid form, the naming of items, identification of repeating groups within a page, identification of links of a certain style as subpages which also need to be spidered, for additional data for the current page and maybe for dependent data items.

Once selections are completed a spidering set of rules to be generated, along with user selections of frequency and timing, e.g. daily from 02:00.

A spidering engine to be delivered that can take spidering rules and can spider at the time and frequency required. All spidered pages to be gzipped and saved to disk. Timing to be by cron job.

There are billions of pages on the web. We expect this spider to work for the purpose of extracting data from Sportinglife.com, racingpost.com, and pedigreequery.com only for the initial delivery. We have archived copies of pages from these sites which are in slightly different formats and will use these for our acceptance testing.

All code to be written with sufficient clarity, no obfuscated code, so that we can maintain it in the future if we do not ask you to as separate requests in the future.

We suggest the following for some parts of the development :

To allow pasting/typing of a url into a form for page manipulation, along with a name for the mysql table to be constructed if needed so far - the page may be for inspection of linked pages. (1)

To load page with extra javascript to allow operation in a similar fashion to the Web Developer plugin for Firefox -> CSS -> View Style information as far as outlining moused-over elements.

As an element is moused-over an outline appears and when clicked the content can be selected from a overlaid menu by the user as a per page data item, a per page identifier that may be used to index the data, a repeating group, or a data item in a repeating group, or a new page group which will open a new iteration from (1) above..

After selection of item type the item can be given a datafield name that will be used for the mysql database column

This will need to recognise repeating groups in a page - either table rows (which could be single or multiple row groups) - or other html elements P, DIV, …., ARTICLE and other HTML5 elements. The level of repeat by class, id format, e.g. "tr1", "tr2", … "tr23", or only common parent.

The item selected can then be named, and whether this is to be used as an identifier.

At the backend the database needs to have the rules stored to define how to spider this page.

We can provide working code for spidering, gzipping results

New Proposal

Create an account now and send a proposal now to get this project.

Sign up

Clarification Board Ask a Question

    There are no clarification messages.