MySQL数据匹配:更好的选择?

时间:2017-09-18 13:46:37

标签: mysql record-linkage nosql

我有来自不同来源的客户和销售线索,我需要弄清楚客户是否已经注册为主管。

我使用12个字段进行匹配:

address1_clear
address2_clear
address_clear
contact_name_clear
email
invoice_mobile
invoice_phone
mobile
name_clear
phone
phone2
taxnum

_clear后缀表示数据为小写,没有空格和标点符号。

  • 线索 - 300k记录
  • 客户 - 500k记录
  • customers_leads - 460k条记录

这是用于执行匹配的查询:

SELECT l.id as lead_id, c.id as customer_id FROM lead l
INNER JOIN sync_settings s ON s.account_id = l.account_id
INNER JOIN customers c ON c.setting_id = s.id
LEFT JOIN customers_leads cl ON cl.customer_id = c.id AND cl.lead_id = l.id
WHERE cl.lead_id IS NULL AND
(
    (l.phone IS NOT NULL AND l.phone IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.mobile IS NOT NULL AND l.mobile != "" AND l.mobile IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.invoice_phone IS NOT NULL AND l.invoice_phone != "" AND l.invoice_phone IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.invoice_mobile IS NOT NULL AND l.invoice_mobile != "" AND l.invoice_mobile IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.email IS NOT NULL AND l.email != "" AND l.email = c.email) OR
    (l.taxnum IS NOT NULL AND l.taxnum != "" AND l.taxnum = c.taxnum) OR
    (l.contact_name_clear IS NOT NULL AND l.contact_name_clear != "" AND l.contact_name_clear = c.contact_name_clear) OR
    (l.address1_clear IS NOT NULL AND l.address1_clear != "" AND l.address1_clear = c.address_clear) OR
    (l.address2_clear IS NOT NULL AND l.address2_clear != "" AND l.address2_clear = c.address_clear) OR
    (l.name_clear IS NOT NULL AND l.name_clear != "" AND l.name_clear IN (c.contact_name_clear, c.name_clear))
)

超重,响应时间约为4分钟。由于OR和附加条件,索引没有多大帮助。

我想知道:有更好的方法吗?也许使用一些NoSQL数据库基本上构建一个巨大的哈希表或一些我无法谷歌的数据匹配技术?

P上。 S.我知道我可以单独制作单独的表用于匹配字段,它会更快,但我仍然想知道我的替代方案。

2 个答案:

答案 0 :(得分:1)

另一个需要考虑的开源项目是recordlinkage(Python Record Linkage Toolkit)。该项目的documentation包括记录链接过程的概述,初学者'代码示例和API文档。

答案 1 :(得分:0)

您遇到的问题称为record linkage,并且没有本地解决问题的数据库解决方案。

您可以使用许多开源项目,包括Dukededupe(我是主要作者重复数据删除)。