Script needed to identify duplicate real estate entries in mySQL database

  • Posted:
  • Proposals: 1
  • Remote
  • #690388
  • Awarded
has already sent a proposal.
  • 1


Experience Level: Expert
General information for the business: gathering and presenting real estate data
Database management system (DBMS): MySQL
I have data from 4 sources, but we will start off just comparing source2 and source1 (2 sources). The system needs to scale easily when the other 2 sources are added.

The aim of this project is to
create an adaptable and scaleable script which will identify duplicates and store the data is a manageable manner.

Database is mysql hosted on Amazon
I would prefer PHP (or Python) programming languages
The scripts need to be able to run from command line via cronjob

I have already populated the database with all the data up to 4 Feb from source2 and source1 sources (the table has around 250k lines in it)

*** Part 1
Provide a script which will run via command line and import CSVs from sources into table called "rawdata".
The script needs to be able to import these csv files even if the columns are in a different order to those provided.
Needs to be adaptable in case I add extra columns in the future (eg in a config file)

UPDATE: all properties now have coordinates.

*** Part 2
UPDATE: this is easy now because I have allocated a "crawlsourceid" to every entry.

Create a second table ("table2") which contains each each unique property from each source. eg if a property has 10 entries in "rawdata" from source2, it will just have one entry in table2. If the same property was also in source1, then there will be another line in table2 for the source1 property.

The aim of this table is to list when a property becomes available on a portal (using the crawldate column) and then when it is no longer advertised on the portal.

In addition to whatever other columns we need, table2 should have columns for
1. date start listing : the first time the property appeared on the source (crawldate in rawdata table)
2. date end listing : the last date the property appeared on the source. Only to be updated 7 days after the last time it appeared. Eg, if todays date is 2015-02-07 and it last appeared in rawdata on date 2015-02-04, then do not update this field (as it appeared in rawdata only 3 days ago.
3. ttl column: the in days between 1 and 2 above.

When this table is updated, it should update records with any new information. For example, it is likely that data like image and price may be updated during the course of the property listing, so these would need updating.

*** Part 3
This part of the project involves identifying duplicates BETWEEN sources.

Create "table3" and populate it with only unique properties which are in table2. For example, the same property may be in multiple sources (so will have multiple rows in table2. We want table3 to only have 1 row per property (but creating some reference so it can be determined what properties have been grouped together).

To do this, I think a process along the lines of identifying likely duplicates based on price and beds and the start of the coordinates.
select * from rawdata where price = 750 and beds = 4 and latitude like '56.16%' and longitude like '-3.15%'

Then a series of conditionals could look at the parsing parts of the "agent" and "desciption" columns to see if they match. if one of the two matches, then we could make the assumption that we're confident they are the property.

Importantly, we need to log why we matched and, if we didn't match, what stage did we decide not to match (so we can improve hte logic in future) .

Table3 needs to have extra columns so that the correct address can be input (house, street, town, city, county, postcode)

Note: one set of coordinates will be slightly more accurate than any other source. Choose the most accurate coordinates (based on most decimal places)

*** Part 4
EDIT: changed from original.
Script which will query an API (to be selected - will take your recommendations) using the coordinates and fill in the address details (see above)

The crawlsourceid COULD be shared between source2 and source1 . They both number sequentially. So any queries should include both crawlsourceid and crawlsource otherwise problems will occur.

New Proposal

Create an account now and send a proposal now to get this job.

Sign up

Clarification Board Ask a Question

    There are no clarification messages.