Data extraction / scraping from site, manipulation and import into Salesforce.com
- or -
Post a project like this£240(approx. $301)
- Posted:
- Proposals: 8
- Remote
- #2293634
- OPPORTUNITY
- Expired
Mark Mindlin is a Web Scraping & Automation Expert, Dexi Bot Developer & Data Engineer
Brooklyn
Professional Graphic designer, photo editor, clipping path, images background removes, Photo cleanup/ touch up, Image Manipulation
Medina
12504081695620227525122755012414474241775725528412576641
Description
Experience Level: Expert
Estimated project duration: less than 1 week
This project is to extract and clean data from a legacy website and import it into Salesforce.com. An extractor will need to be built, and you must be able to manipulate large volumes of data (approx. 2.5 million rows).
Please read through all of the information below and review the example documents provided before responding.
The steps would be:
• Build a bot to extract data from legacy site. We estimate there is approx. 2.5 million rows of data on the legacy site
• Clean and format data, ensuring all formatting standards for Salesforce data imports are followed
• Use unique identifier (member number) to find new system record identifier (Salesforce record id)
• QA data by cross checking that ‘name’ and ‘email’ in the new system matches the ‘name’ extracted from the legacy system when using the unique identifier as a reference. Indicate an error if either does not match.
• Run a formula to return the correct ‘Diary record’ value from the inputs that will be provided
• Batch clean and formatted data into csv files of 20,000 rows each that are ready for import into Salesforce- these will need to be provided
• Create ‘diary’ records- any members with activities pre June 2017 will need a diary record created in Salesforce
• Create activity records in Salesforce, associating them the correct diary (more information in Excel spreadsheet)
The output csvs should follow formatting standards for Bulk API processing in the Salesforce dataloader. Examples of guidance from Salesforce on formatting (please not these are not exhaustive):
• https://developer.salesforce.com/docs/atlas.en-us.198.0.api_asynch.meta/api_asynch/datafiles_csv_valid_record_rows.htm
• https://help.salesforce.com/articleView?id=supported_data_types.htm&type=5
• All data must be trimmed and bad characters removed
Based on my assessment of the site, the extraction bot will probably need to do the below- I have attached a PDF with each screen and what actions would be required:
• Go to http://cpd.rcslt.org/admin/list_users.php?search=1
• Input credentials
• Click Login
• From CSV dataset, input into search: Salesforce RCSLT database member number (Loop through each row of this dataset for the following sets). There are approx. 30k member numbers, and not all are expected to return results.
• Click Search
• If member is not found in legacy database: record an error as ‘Member not found’ at this step
• If member is found in legacy database: Page returned will display between 1-3 links to separate 'Learning Diaries', each navigating to detail page(s) where the data sits.
o Loop through each ‘Learning diary link’
o If no diaries are found under the account, record error ‘No diaries found’ at this step
o If there are multiple pages to the diary, links to all pages are listed at the header and footer page (there is no forward/back pagination)
From the first page (which may be the only page), extract ‘special circumstances’
For all diaries, loop through each page and
• Export the ‘activities’ table
• Append membership number to each row
Please reference ‘New CPD Diary’ in your proposal to confirm that you have read through this information.
Example input and outputs are attached.
Please read through all of the information below and review the example documents provided before responding.
The steps would be:
• Build a bot to extract data from legacy site. We estimate there is approx. 2.5 million rows of data on the legacy site
• Clean and format data, ensuring all formatting standards for Salesforce data imports are followed
• Use unique identifier (member number) to find new system record identifier (Salesforce record id)
• QA data by cross checking that ‘name’ and ‘email’ in the new system matches the ‘name’ extracted from the legacy system when using the unique identifier as a reference. Indicate an error if either does not match.
• Run a formula to return the correct ‘Diary record’ value from the inputs that will be provided
• Batch clean and formatted data into csv files of 20,000 rows each that are ready for import into Salesforce- these will need to be provided
• Create ‘diary’ records- any members with activities pre June 2017 will need a diary record created in Salesforce
• Create activity records in Salesforce, associating them the correct diary (more information in Excel spreadsheet)
The output csvs should follow formatting standards for Bulk API processing in the Salesforce dataloader. Examples of guidance from Salesforce on formatting (please not these are not exhaustive):
• https://developer.salesforce.com/docs/atlas.en-us.198.0.api_asynch.meta/api_asynch/datafiles_csv_valid_record_rows.htm
• https://help.salesforce.com/articleView?id=supported_data_types.htm&type=5
• All data must be trimmed and bad characters removed
Based on my assessment of the site, the extraction bot will probably need to do the below- I have attached a PDF with each screen and what actions would be required:
• Go to http://cpd.rcslt.org/admin/list_users.php?search=1
• Input credentials
• Click Login
• From CSV dataset, input into search: Salesforce RCSLT database member number (Loop through each row of this dataset for the following sets). There are approx. 30k member numbers, and not all are expected to return results.
• Click Search
• If member is not found in legacy database: record an error as ‘Member not found’ at this step
• If member is found in legacy database: Page returned will display between 1-3 links to separate 'Learning Diaries', each navigating to detail page(s) where the data sits.
o Loop through each ‘Learning diary link’
o If no diaries are found under the account, record error ‘No diaries found’ at this step
o If there are multiple pages to the diary, links to all pages are listed at the header and footer page (there is no forward/back pagination)
From the first page (which may be the only page), extract ‘special circumstances’
For all diaries, loop through each page and
• Export the ‘activities’ table
• Append membership number to each row
Please reference ‘New CPD Diary’ in your proposal to confirm that you have read through this information.
Example input and outputs are attached.
Siobhan D.
96% (5)Projects Completed
5
Freelancers worked with
5
Projects awarded
25%
Last project
8 Nov 2021
United Kingdom
New Proposal
Login to your account and send a proposal now to get this project.
Log inClarification Board Ask a Question
-
Hello Siobhan,
Do you have direct access to the legacy database?
Is your budget flexible?
Regards,
M. C.
-
Hi Siobhan
The web portal must use a database in the back-end to store all the data that you want to migrate. Do you not have access to the source?
Regards
Ian
Constructive Force Ltd
765731765538
We collect cookies to enable the proper functioning and security of our website, and to enhance your experience. By clicking on 'Accept All Cookies', you consent to the use of these cookies. You can change your 'Cookies Settings' at any time. For more information, please read ourCookie Policy
Cookie Settings
Accept All Cookies