Scrape the data from google news, yahoo news,

- or -

Post a project like this

Ends in (days)

4409

Fixed Price

£200(approx. $267)

Posted: 12 years ago
Proposals: 5
Remote
#397085
Awarded

have already sent a proposal.

Description

Experience Level: Expert

Estimated project duration: 1 - 2 weeks

General information for the business: Big Data Analysis
Kind of development: Customization of existing program
Description of requirements/functionality: Requirement Spec.

I would like the java project that can scrape the data from google news, yahoo news, and bing news for analysis/statistics.

1.1 Skill Set Required:
If you do not have the following skills, PLEASE do not apply!

1.1.1 JAVA
This is the language the application must be written in.
Since the source code must be delivered (as an eclipse java project),
I plan to maintain the code myself and as a statistician my java not very good so...
please use OOA/OOD and the highest coding standards.

1.1.2 Selenium Webdriver - The Tool to be used for scraping.
1.1.3 HTML, CSS, XPATH - To select the nodes from html document.
1.1.4 REGEX - To further process the nodes for output once they have been selected.
1.1.5 XML - Output of processed data is going to be in XML.

2.1 Software Requirements:
Must tick off 'all' of the points in the Software Requirement.

2.1.1 Must be Java application.
2.1.2 Must be Jave console application (No fancy GUI).

2.2.1 Must output logs to standard output.
2.2.2 Must also output logs to file using log4j.

2.3.1 Must us Selenium Server Standalone jar for scraping.
2.3.2 Must us the Selenium FirefoxDriver WebDriver.
2.3.3 Must must be called SEScrape.
2.3.4 Starting point must be the Factbook eclipse project you have already delivered i.e. Factbook must still work!
2.3.5 Must inherit from the same base class as Factbook.

2.4.1 Must have following command line arguments:

Usage: SEScrape -option
where options include:

2.4.1.1 -url= url to be scripted
2.4.1.2 -list= filename with urls to be script
2.4.1.2 -pages= number of pages to scrape, if not presend 1 is assumed.
2.4.1.3 -version print product version and exit
2.4.1.4 -? -help print this help message

2.5.1 Must output the following components:

2.5.1.1 xml file called {domain}/se.{md5(url)}.xml for each url (see chapter 2.6.1)
2.5.1.2 image file called {domain}/se.{md5(url)}.{md5(link)}.{ext} for each url (image associated with seach item if any)

2.6.1 Uploaded is sample output xml file if we run "SEScrape -url 'https://www.google.co.uk/#q=brazil+beef&tbm=nws' -pages 5".
2.6.2 Please peruse sample output xml file (it has some comments) and the url and if all is not clear please ask questions.

3.1. Deliverables
3.1.1 XML output from 1 random google news seach of my selection.
3.1.2 Once happy with 3.1.1, then same output for yahoo news and bing news.
3.1.3 Once happy with 3.1.2, then Java Source code as eclipse project with instuctions how to build/run.
OS requirements: Windows
Extra notes:

New Proposal

Clarification Board Ask a Question

There are no clarification messages.

Description

Tony W.

New Proposal

Clarification Board Ask a Question