python-dedupe - 如何理解重复数据库？ - Thinbug

如何理解重复数据库？

时间：2018-02-07 10:47:15

标签： python-dedupe

两个问题：

如何解读＆＃39;置信度分数＆＃39;当有3行和3个置信度得分的簇（0.98,0.45,0.45）时。这个置信度分数来自哪里？从逻辑回归或某种方式来自层次聚类？
我的1600万中的10 000个被标记为重复，我应该将其全部作为trening数据吗？或只有10个正面和10个负面就足够了？什么数字对于质量和执行时间会更好？

1 个答案:

答案 0 :(得分：1)

该记录与集群中其他记录之间的置信度得分为1 - square root of the average squared distance，其中distance为1 - predicted probability that a pair of records are coreferent

有关更多详细信息，请参见https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.Dedupe.cluster