Scrape a website and extract company data and save to CSV data files & download items

- or -

Post a project like this

Ends in (days)

4368

Fixed Price

£50(approx. $67)

Posted: 12 years ago
Proposals: 3
Remote
#498125
Awarded

have already sent a proposal.

Description

Experience Level: Intermediate

Estimated project duration: 1 - 2 weeks

General information for the business: Wine Merchant looking to source new products
Kind of development: New program from scratch

Scrape a specific website
To obtain the exhibitors details and classification systems then record them in CSV (comma and double quote delimited) datefiles (removing any conflicting ")
Description of requirements/functionality: To provide a script to scrape the following URL:
http://2014.londonwinefair.com/zone/ExhibitorList/Exhibitors/All
Ideally using Python and standard scraping libraries (Beautiful Soup etc), but willing to consider PHP.
To create a database as follows and download the linked files (logos and documents) to the local PC
To not overload the target website and be generally be sympathetic as a scraping script.

Please ask any questions about the specification and requirements once your have reviewed the database structure required and the target website. Project needed by 14th June 2014 latest.

thanks
CMS and Admin requirements: To log any errors to a separate "error_dd-mm-yy_hh;mm:ss.txt" with full details of any errors and the point in the code that caused or found the error (ie 404, time out, access denied).
To provide working code with basic annotation to allow user modification for future re-use
Specific technologies required: Ideally using Python and standard scraping libraries (Beautiful Soup etc), but willing to consider PHP.
OS requirements: Windows, Linux
Extra notes: Table 1 - Company Details
-Fields
=from main page
1. Company Key (use the 5 digit code used by the website ie 12277 for "13 Jul. Plantaze")
2. Company Name
3. Company Logo Link full size (to be saved as Company key in Company directory under Base Directory)
4. Stand numbers (as 1 field "," separated)
5. Linkedin URL
6. Twitter handle
7. Facebook
8. exhib-details (text field no html)
= from sub page
9. Contact Name (fn)
10. Telephone
11. Website (web)
12. email

Table 2 - tradeInformation
1. Info Key
2. Item Description as elements below (or any others in the dataset)
examples: "On trade supplier"; "Multiple retail supplier"; "The UK independent retail sector"; "Seeking Agent"; "Seeking Distributor"; "Meet the Winemaker"

Table 3 - Company to Trade Information
1. Company Key
2. Info Key

Table 4 - Company Documents
1. Company Key
2. Document Name link
3. Document Description text
> All Documents to be saved as original name in own Company key directory under "Company" under Base Directory

Table 5 - Company Categories
1. Company Key
2. Category ID

Table 6 - Categories Table
1. Category ID
2. Category Name
> Can be obtained from http://2014.londonwinefair.com/zone/ExhibitorList (click all categories)
> need to remove "(" ... ")" from end of description

Table 7 - Brands
1. Brand Key (as used by the website)
2. Brand Name
3. Brand Logo full size (named using ID)

Table 8 - Brand to Company Link
1. Company Key
2. Brand Key

I would consider the data to be un-normalised for "tradeInformation" and "Categories" without a new key, and the same for "Brands" but retaining the website BrandID.
Alternatively would consider the data going straight into a MySql/MariaSQL database

Directory structure:
Base Directory (definable in script by modification of a variable)
"Company" Directory
(each Company ID)
ie 12277
"Brand" Directory
(each Brand ID)
Database Tables 1-8

ie
base/company/12277/12277.png
base/company/12277/document names as original .pdf or whatever
base/brand/5846/5846.png
base/table1.csv

New Proposal

Clarification Board Ask a Question

There are no clarification messages.

Description

Richard L.

New Proposal

Clarification Board Ask a Question