Automatic scrapping in DRUPAL 7 (+PYTHON ?)

- or -

Post a project like this

Ends in (days)

2124

Fixed Price

Posted: 6 years ago
Proposals: 3
Remote
#2047716
Awarded

have already sent a proposal.

Description

Experience Level: Intermediate

The goal is to be able to add and update some Drupal nodes based on the content from external websites.
This is kind of web scrapping for Drupal, creating some new nodes from different listings

Importation will be based on 2 CSV files that include the parameters to help importing the correct fields :
- one file to describe which startup list URL we'd like to crawl : for example listing-1.csv (attached)
- one file to describe which elements of the startup we need to import with the corresponding drupal node field.
Example xpath-fields-1.csv

The idea is that this development could then be adapted to import new data from another startup list so it must be adaptable.

Example:
In our case, we need to get the startup list from 'https://angel.co/companies?locations[]=1717-France&company_types[]=SaaS&company_types[]=Startup' and update automatically on a periodic manner some drupal nodes.

Data that will be provided:
- node template of a drupal 'startup' node type
- URL to scrap periodically (cf files attached with parameters) :
https://angel.co/companies?locations[]=1717-France&company_types[]=SaaS&company_types[]=Startup

For each startup listed, we need to be able to import data into a drupal node

For example for the startup https://angel.co/appsfire, the file xpath-fields-1.csv gives the data we need to import and corresponding drupal fields:
- Title
- image
- startup description
- City + tags + number of employees + URL + social netwoks links: in our case it would import in different fields 'PAris', ' iOS · Mobile · Android · Mobile Advertising', 11-50 employees, appsfire.com, http://twitter.com/appsfire, https://www.facebook.com/appsfire,http://www.linkedin.com/company/appsfire.com

- Founder name:
Ouriel Ohayon
- Funding: we should ideally sum all investments, for example 3 600 000$ + 1 000 000$ in the case of appsfire.

** IMPORTANT **
-Web scrapping should have a delay so it doesn't get blacklisted by web sites (harvesting startups list could be spread on multiples hours or days)

-For the scraping, we need solution coptabile with AJAX website.
Seems that Python library (selenium, scrapy) can do the job but we are open to suggestions

NOTES:
-Ideally, we'd like a solution based on existing Drupal modules, for example Feeds to perform this mission.
The developper should be autonomous to setup his own test site.

New Proposal

Clarification Board Ask a Question

18 Jun 2018

Mathieu, please share website links.
18 Jun 2018

Mathieu , please share website link.