Scrape the data from google news, yahoo news,
- or -
Post a project like this3736
£200(approx. $251)
- Posted:
- Proposals: 5
- Remote
- #397085
- Awarded
Description
Experience Level: Expert
Estimated project duration: 1 - 2 weeks
General information for the business: Big Data Analysis
Kind of development: Customization of existing program
Description of requirements/functionality: Requirement Spec.
I would like the java project that can scrape the data from google news, yahoo news, and bing news for analysis/statistics.
1.1 Skill Set Required:
If you do not have the following skills, PLEASE do not apply!
1.1.1 JAVA
This is the language the application must be written in.
Since the source code must be delivered (as an eclipse java project),
I plan to maintain the code myself and as a statistician my java not very good so...
please use OOA/OOD and the highest coding standards.
1.1.2 Selenium Webdriver - The Tool to be used for scraping.
1.1.3 HTML, CSS, XPATH - To select the nodes from html document.
1.1.4 REGEX - To further process the nodes for output once they have been selected.
1.1.5 XML - Output of processed data is going to be in XML.
2.1 Software Requirements:
Must tick off 'all' of the points in the Software Requirement.
2.1.1 Must be Java application.
2.1.2 Must be Jave console application (No fancy GUI).
2.2.1 Must output logs to standard output.
2.2.2 Must also output logs to file using log4j.
2.3.1 Must us Selenium Server Standalone jar for scraping.
2.3.2 Must us the Selenium FirefoxDriver WebDriver.
2.3.3 Must must be called SEScrape.
2.3.4 Starting point must be the Factbook eclipse project you have already delivered i.e. Factbook must still work!
2.3.5 Must inherit from the same base class as Factbook.
2.4.1 Must have following command line arguments:
Usage: SEScrape -option
where options include:
2.4.1.1 -url= url to be scripted
2.4.1.2 -list= filename with urls to be script
2.4.1.2 -pages= number of pages to scrape, if not presend 1 is assumed.
2.4.1.3 -version print product version and exit
2.4.1.4 -? -help print this help message
2.5.1 Must output the following components:
2.5.1.1 xml file called {domain}/se.{md5(url)}.xml for each url (see chapter 2.6.1)
2.5.1.2 image file called {domain}/se.{md5(url)}.{md5(link)}.{ext} for each url (image associated with seach item if any)
2.6.1 Uploaded is sample output xml file if we run "SEScrape -url 'https://www.google.co.uk/#q=brazil+beef&tbm=nws' -pages 5".
2.6.2 Please peruse sample output xml file (it has some comments) and the url and if all is not clear please ask questions.
3.1. Deliverables
3.1.1 XML output from 1 random google news seach of my selection.
3.1.2 Once happy with 3.1.1, then same output for yahoo news and bing news.
3.1.3 Once happy with 3.1.2, then Java Source code as eclipse project with instuctions how to build/run.
OS requirements: Windows
Extra notes:
Kind of development: Customization of existing program
Description of requirements/functionality: Requirement Spec.
I would like the java project that can scrape the data from google news, yahoo news, and bing news for analysis/statistics.
1.1 Skill Set Required:
If you do not have the following skills, PLEASE do not apply!
1.1.1 JAVA
This is the language the application must be written in.
Since the source code must be delivered (as an eclipse java project),
I plan to maintain the code myself and as a statistician my java not very good so...
please use OOA/OOD and the highest coding standards.
1.1.2 Selenium Webdriver - The Tool to be used for scraping.
1.1.3 HTML, CSS, XPATH - To select the nodes from html document.
1.1.4 REGEX - To further process the nodes for output once they have been selected.
1.1.5 XML - Output of processed data is going to be in XML.
2.1 Software Requirements:
Must tick off 'all' of the points in the Software Requirement.
2.1.1 Must be Java application.
2.1.2 Must be Jave console application (No fancy GUI).
2.2.1 Must output logs to standard output.
2.2.2 Must also output logs to file using log4j.
2.3.1 Must us Selenium Server Standalone jar for scraping.
2.3.2 Must us the Selenium FirefoxDriver WebDriver.
2.3.3 Must must be called SEScrape.
2.3.4 Starting point must be the Factbook eclipse project you have already delivered i.e. Factbook must still work!
2.3.5 Must inherit from the same base class as Factbook.
2.4.1 Must have following command line arguments:
Usage: SEScrape -option
where options include:
2.4.1.1 -url= url to be scripted
2.4.1.2 -list= filename with urls to be script
2.4.1.2 -pages= number of pages to scrape, if not presend 1 is assumed.
2.4.1.3 -version print product version and exit
2.4.1.4 -? -help print this help message
2.5.1 Must output the following components:
2.5.1.1 xml file called {domain}/se.{md5(url)}.xml for each url (see chapter 2.6.1)
2.5.1.2 image file called {domain}/se.{md5(url)}.{md5(link)}.{ext} for each url (image associated with seach item if any)
2.6.1 Uploaded is sample output xml file if we run "SEScrape -url 'https://www.google.co.uk/#q=brazil+beef&tbm=nws' -pages 5".
2.6.2 Please peruse sample output xml file (it has some comments) and the url and if all is not clear please ask questions.
3.1. Deliverables
3.1.1 XML output from 1 random google news seach of my selection.
3.1.2 Once happy with 3.1.1, then same output for yahoo news and bing news.
3.1.3 Once happy with 3.1.2, then Java Source code as eclipse project with instuctions how to build/run.
OS requirements: Windows
Extra notes:
Tony W.
100% (2)Projects Completed
5
Freelancers worked with
4
Projects awarded
67%
Last project
10 Oct 2020
United Kingdom
New Proposal
Login to your account and send a proposal now to get this project.
Log inClarification Board Ask a Question
-
There are no clarification messages.
We collect cookies to enable the proper functioning and security of our website, and to enhance your experience. By clicking on 'Accept All Cookies', you consent to the use of these cookies. You can change your 'Cookies Settings' at any time. For more information, please read ourCookie Policy
Cookie Settings
Accept All Cookies