
Scrape a website and extract company data and save to CSV data files & download items
- or -
Post a project like this4223
£50(approx. $67)
- Posted:
- Proposals: 3
- Remote
- #498125
- Awarded
Description
Experience Level: Intermediate
Estimated project duration: 1 - 2 weeks
General information for the business: Wine Merchant looking to source new products
Kind of development: New program from scratch
Scrape a specific website
To obtain the exhibitors details and classification systems then record them in CSV (comma and double quote delimited) datefiles (removing any conflicting ")
Description of requirements/functionality: To provide a script to scrape the following URL:
http://2014.londonwinefair.com/zone/ExhibitorList/Exhibitors/All
Ideally using Python and standard scraping libraries (Beautiful Soup etc), but willing to consider PHP.
To create a database as follows and download the linked files (logos and documents) to the local PC
To not overload the target website and be generally be sympathetic as a scraping script.
Please ask any questions about the specification and requirements once your have reviewed the database structure required and the target website. Project needed by 14th June 2014 latest.
thanks
CMS and Admin requirements: To log any errors to a separate "error_dd-mm-yy_hh;mm:ss.txt" with full details of any errors and the point in the code that caused or found the error (ie 404, time out, access denied).
To provide working code with basic annotation to allow user modification for future re-use
Specific technologies required: Ideally using Python and standard scraping libraries (Beautiful Soup etc), but willing to consider PHP.
OS requirements: Windows, Linux
Extra notes: Table 1 - Company Details
-Fields
=from main page
1. Company Key (use the 5 digit code used by the website ie 12277 for "13 Jul. Plantaze")
2. Company Name
3. Company Logo Link full size (to be saved as Company key in Company directory under Base Directory)
4. Stand numbers (as 1 field "," separated)
5. Linkedin URL
6. Twitter handle
7. Facebook
8. exhib-details (text field no html)
= from sub page
9. Contact Name (fn)
10. Telephone
11. Website (web)
12. email
Table 2 - tradeInformation
1. Info Key
2. Item Description as elements below (or any others in the dataset)
examples: "On trade supplier"; "Multiple retail supplier"; "The UK independent retail sector"; "Seeking Agent"; "Seeking Distributor"; "Meet the Winemaker"
Table 3 - Company to Trade Information
1. Company Key
2. Info Key
Table 4 - Company Documents
1. Company Key
2. Document Name link
3. Document Description text
> All Documents to be saved as original name in own Company key directory under "Company" under Base Directory
Table 5 - Company Categories
1. Company Key
2. Category ID
Table 6 - Categories Table
1. Category ID
2. Category Name
> Can be obtained from http://2014.londonwinefair.com/zone/ExhibitorList (click all categories)
> need to remove "(" ... ")" from end of description
Table 7 - Brands
1. Brand Key (as used by the website)
2. Brand Name
3. Brand Logo full size (named using ID)
Table 8 - Brand to Company Link
1. Company Key
2. Brand Key
I would consider the data to be un-normalised for "tradeInformation" and "Categories" without a new key, and the same for "Brands" but retaining the website BrandID.
Alternatively would consider the data going straight into a MySql/MariaSQL database
Directory structure:
Base Directory (definable in script by modification of a variable)
"Company" Directory
(each Company ID)
ie 12277
"Brand" Directory
(each Brand ID)
Database Tables 1-8
ie
base/company/12277/12277.png
base/company/12277/document names as original .pdf or whatever
base/brand/5846/5846.png
base/table1.csv
Kind of development: New program from scratch
Scrape a specific website
To obtain the exhibitors details and classification systems then record them in CSV (comma and double quote delimited) datefiles (removing any conflicting ")
Description of requirements/functionality: To provide a script to scrape the following URL:
http://2014.londonwinefair.com/zone/ExhibitorList/Exhibitors/All
Ideally using Python and standard scraping libraries (Beautiful Soup etc), but willing to consider PHP.
To create a database as follows and download the linked files (logos and documents) to the local PC
To not overload the target website and be generally be sympathetic as a scraping script.
Please ask any questions about the specification and requirements once your have reviewed the database structure required and the target website. Project needed by 14th June 2014 latest.
thanks
CMS and Admin requirements: To log any errors to a separate "error_dd-mm-yy_hh;mm:ss.txt" with full details of any errors and the point in the code that caused or found the error (ie 404, time out, access denied).
To provide working code with basic annotation to allow user modification for future re-use
Specific technologies required: Ideally using Python and standard scraping libraries (Beautiful Soup etc), but willing to consider PHP.
OS requirements: Windows, Linux
Extra notes: Table 1 - Company Details
-Fields
=from main page
1. Company Key (use the 5 digit code used by the website ie 12277 for "13 Jul. Plantaze")
2. Company Name
3. Company Logo Link full size (to be saved as Company key in Company directory under Base Directory)
4. Stand numbers (as 1 field "," separated)
5. Linkedin URL
6. Twitter handle
7. Facebook
8. exhib-details (text field no html)
= from sub page
9. Contact Name (fn)
10. Telephone
11. Website (web)
12. email
Table 2 - tradeInformation
1. Info Key
2. Item Description as elements below (or any others in the dataset)
examples: "On trade supplier"; "Multiple retail supplier"; "The UK independent retail sector"; "Seeking Agent"; "Seeking Distributor"; "Meet the Winemaker"
Table 3 - Company to Trade Information
1. Company Key
2. Info Key
Table 4 - Company Documents
1. Company Key
2. Document Name link
3. Document Description text
> All Documents to be saved as original name in own Company key directory under "Company" under Base Directory
Table 5 - Company Categories
1. Company Key
2. Category ID
Table 6 - Categories Table
1. Category ID
2. Category Name
> Can be obtained from http://2014.londonwinefair.com/zone/ExhibitorList (click all categories)
> need to remove "(" ... ")" from end of description
Table 7 - Brands
1. Brand Key (as used by the website)
2. Brand Name
3. Brand Logo full size (named using ID)
Table 8 - Brand to Company Link
1. Company Key
2. Brand Key
I would consider the data to be un-normalised for "tradeInformation" and "Categories" without a new key, and the same for "Brands" but retaining the website BrandID.
Alternatively would consider the data going straight into a MySql/MariaSQL database
Directory structure:
Base Directory (definable in script by modification of a variable)
"Company" Directory
(each Company ID)
ie 12277
"Brand" Directory
(each Brand ID)
Database Tables 1-8
ie
base/company/12277/12277.png
base/company/12277/document names as original .pdf or whatever
base/brand/5846/5846.png
base/table1.csv
Richard L.
99% (23)Projects Completed
10
Freelancers worked with
17
Projects awarded
75%
Last project
20 Sep 2019
United Kingdom
New Proposal
Login to your account and send a proposal now to get this project.
Log inClarification Board Ask a Question
-
There are no clarification messages.
We collect cookies to enable the proper functioning and security of our website, and to enhance your experience. By clicking on 'Accept All Cookies', you consent to the use of these cookies. You can change your 'Cookies Settings' at any time. For more information, please read ourCookie Policy
Cookie Settings
Accept All Cookies