Website crawl and extraction in JSON
- or -
Post a project like this2953
$$
- Posted:
- Proposals: 6
- Remote
- #1064807
- Awarded
I run Centu. I am a web and software developer, project manager, and law graduate.
Manchester
Data Mining/Web Scraping/Software Testing/Selenium Webdriver/SOAPUI/Excel Macro/VB/Rest API
Bangalore
685930743560985927100195810792051129426
Description
Experience Level: Intermediate
General information for the business: JSON to be used as reference for an app
Kind of development: New program from scratch
Description of every module: - Crawl and extraction of every outlet categorised under "Coffee & Tea Shops" across the entire UK on yelp.co.uk
- Extraction of 10 random or most recent image URLs from Instagram for each outlet.
- Export as JSON
Detail of exact data required can be found in the attached JSON.
Description of requirements/functionality: Deliverables...
Format as JSON, pre-defined and attached (compressed.json)
For any data points that are missing, leave empty.
For Yelp mine...
Search by location found here - http://www.yelp.co.uk/locations
Exclude all locations outside the UK.
Each page has an 'edit business' link where you will find a lot of data as editable fields, which should make it easier.
E.g. URL - https://www.yelp.co.uk/biz_attribute?biz_id=CDxeSJg5Ro-6GxCDqYlhdg
Remove any spaces in postcode.
Opening and closing times - format in 24hour clock, no colon if possible.
priceRange - record as on the website... £, ££, £££, ££££
category - take from first string under the div with class "category-name"
catType - take from the the second string in that div, following
starCount - take icon the second class name (e.g. stars_4 and stars_4_half)
All other data should be self-explanatory.
For Instagram mine...
Use the name of each outlet to define a hashtag search (e.g. https://www.instagram.com/explore/tags/figandsparrow/)
Omit all spaces from the outlet name in the search.
Omit any special characters from the outlet name in the search (!, $, %, ^, *, .)
Replace '&' and '+' with the word 'and' in the search.
Retrieve images with a minimum resolution of 612 x 612, up to 1080 x 1080 to avoid picking up profile images.
Retrieve direct URL (e.g. https://scontent-lhr3-1.cdninstagram.com/t51.2885-15/s640x640/sh0.08/e35/12750376_181821128856635_1693363454_n.jpg?ig_cache_key=MTE4OTE0NjQ4NjQ3MTA2NTkzOA%3D%3D.2)
Specific technologies required: Python / PHP / etc...
Extra notes: I have attached two JSON files.
compressed.json - please use this format for the deliverable.
type.json - I have defined the type for each object here.
Feel free to make any suggestions to the JSON format.
Kind of development: New program from scratch
Description of every module: - Crawl and extraction of every outlet categorised under "Coffee & Tea Shops" across the entire UK on yelp.co.uk
- Extraction of 10 random or most recent image URLs from Instagram for each outlet.
- Export as JSON
Detail of exact data required can be found in the attached JSON.
Description of requirements/functionality: Deliverables...
Format as JSON, pre-defined and attached (compressed.json)
For any data points that are missing, leave empty.
For Yelp mine...
Search by location found here - http://www.yelp.co.uk/locations
Exclude all locations outside the UK.
Each page has an 'edit business' link where you will find a lot of data as editable fields, which should make it easier.
E.g. URL - https://www.yelp.co.uk/biz_attribute?biz_id=CDxeSJg5Ro-6GxCDqYlhdg
Remove any spaces in postcode.
Opening and closing times - format in 24hour clock, no colon if possible.
priceRange - record as on the website... £, ££, £££, ££££
category - take from first string under the div with class "category-name"
catType - take from the the second string in that div, following
starCount - take icon the second class name (e.g. stars_4 and stars_4_half)
All other data should be self-explanatory.
For Instagram mine...
Use the name of each outlet to define a hashtag search (e.g. https://www.instagram.com/explore/tags/figandsparrow/)
Omit all spaces from the outlet name in the search.
Omit any special characters from the outlet name in the search (!, $, %, ^, *, .)
Replace '&' and '+' with the word 'and' in the search.
Retrieve images with a minimum resolution of 612 x 612, up to 1080 x 1080 to avoid picking up profile images.
Retrieve direct URL (e.g. https://scontent-lhr3-1.cdninstagram.com/t51.2885-15/s640x640/sh0.08/e35/12750376_181821128856635_1693363454_n.jpg?ig_cache_key=MTE4OTE0NjQ4NjQ3MTA2NTkzOA%3D%3D.2)
Specific technologies required: Python / PHP / etc...
Extra notes: I have attached two JSON files.
compressed.json - please use this format for the deliverable.
type.json - I have defined the type for each object here.
Feel free to make any suggestions to the JSON format.
Al F.
0% (0)Projects Completed
-
Freelancers worked with
-
Projects awarded
100%
Last project
26 Apr 2024
United Kingdom
New Proposal
Login to your account and send a proposal now to get this project.
Log inClarification Board Ask a Question
-
There are no clarification messages.
We collect cookies to enable the proper functioning and security of our website, and to enhance your experience. By clicking on 'Accept All Cookies', you consent to the use of these cookies. You can change your 'Cookies Settings' at any time. For more information, please read ourCookie Policy
Cookie Settings
Accept All Cookies