Script needed to identify duplicate real estate entries in mySQL database

- or -

Post a project like this

Ends in (days)

3352

Fixed Price

£200(approx. $249)

Posted: 9 years ago
Proposals: 2
Remote
#665823
Awarded

have already sent a proposal.

Description

Experience Level: Expert

General information for the business: gathering and presenting real estate data
Database management system (DBMS): MySQL
Description of requirements/functionality: I have data from 3 separate property sources. In total, there are around 500,000 rows of data.
There are duplicates from each source becuase the data is received three times a week so there are duplicates from each source.
There is around 10,000 unique entries from each source.
There is also crossover between the sources, so the same property may be listed on all three sources.

The aim of this project is to
create an adaptable and scaleable script which will identify duplicates and store the data is a manageable manner.

Database is mysql hosted on Amazon
I would prefer PHP (or Python) programming languages
The scripts need to be able to run from command line via cronjob

I have provided 1000 rows from each of the three sources.
In the future, more sources will be added so consideration needs to be given to that.

*** Part 1
Import the 3 csv files into "table1".
Provide a script which will enable me to do this in future.
The script needs to be able to import these csv files even if the columns are in a different order to those provided.
Needs to be adaptable in case I add extra columns in the future (eg in a config file)

*** Part 2
Create a second table ("table2").
Do a lookup of the data in table1 to find each unique property on each portal and make an entry in table2. Eg if the property was on all 3 sources then there will be 3 entries in table2 for the property.

To identiy properties in table1, it will be possible to use:
for source1 - can find the id from image
for source2 - needs done another way
for source3 - can use the id

The aim of this table is to list when a property becomes available on a portal (using the crawldate column) and then when it is no longer advertised on the portal.

In addition to whatever other columns we need, table2 should have columns for
crawlsource
crawlsourceid
date start listing
date end listing
difference between start and end (iin days)

I will take advice on how much we should duplicate the data from table1.

Note: The date end listing field should only be updated if it is more than 7 days since it was listed in a CSV.

*** Part 3
This is the trickiest part of the project. It involves identifying duplicates between sources.

Create "table3" and find only the unique properties across all crawlsources. Eg if the property was listed on all 3 crawl sources, then it should just have one entry in Table3.

*** Part 4
Once identified unqiue properties, fill in the missing data.
This can be a separate script to run separately.

Examples of data which we will need to fill in:
format the address correctly (using address1, address2, city, postcode fields)
latitude/longitude
postcode

Using the work in Part 2 and 3, it will be possible to use data from other crawlsources (eg one source has the lat/long for all properties so if a property was listed on there and we can identify it in Part 3, then we can fill in the missing data.)
Extra notes:

New Proposal

Clarification Board Ask a Question

There are no clarification messages.