So I have two MYSQL tables in a DB which aren't connected (two different data sources) [around 500,000 rows in each]. They have things like Business Name, Business Address, City, State, along with some specific information in each that I hoped to populate a third table with once I've figured out how to 'link' the two together.
My initial thought was to try to link them together if the Business Name / City / Address were the same.
The issue is, for a lot of rows, there isn't an exact match between the two tables. So in one the business name might be 'XYZ' in another 'XYZ Incorporated', same goes for some addresses, it could be '123 NW 10th St.' or '123 Northwest Tenth Street'.
So now I'm trying to figure out a way to find a 'close enough' match, and that's where I'm lost. Never done anything like that before, and wouldn't know where to start. My thoughts were I'd be able to set some kind of threshold where if the match is close enough, I'd insert the two IDs into another table linked the two together for future processing.
(Databases are currently InnoDB / MySQL)
Thanks a lot for any help!
答案 0 :(得分:2)
I would suggest creating a View
which combines the two tables. In order to find out which fields match most closely, I would recommend using either a "Levenshtein" distance, or something a big smarter like "Jaro/Winkler". I went through something similar to this a while ago and I blogged about it.
http://dannykopping.com/blog/fuzzy-text-search-mysql-jaro-winkler