我尝试以编程方式从数据集中删除几乎重复的数据之一。我的数据集在逻辑上与下表相似。如您所见,数据集中有两行,人们可以轻松理解这两个数据是相关的,可能是由同一个人添加的。
我对这个问题的解决方案是使用Levenshtein分别比较字段(名称,地址,电话号码)并找出它们的相似比。然后我计算平均比率为0.77873。这种相似性结果似乎很低。我的python代码就像
from Levenshtein import ratio
name = ratio("Game of ThOnes Books for selling","Selling Game of Thrones books")
address = ratio("George Washington street","George Washington st.")
phone = ratio("555-55-55","0(555)-55-55")
total_ratio = name+address+phone
print total_ratio/3 #Average ratio
我的问题是两个比较行数据的最佳方式是什么?这样做需要哪些算法或方法?
答案 0 :(得分:1)
我们可以计算行之间的距离矩阵,形成群集并选择群集成员 作为类似行的候选人。
使用R
包中的stringdistmatrix
和stringdist
函数可以实现距离计算
字符串输入。
stringdist支持的距离方法如下。见package manual 了解更多详情
#Method name; Description
#osa ; Optimal string aligment, (restricted Damerau-Levenshtein distance).
#lv ; Levenshtein distance (as in R's native adist).
#dl ; Full Damerau-Levenshtein distance.
#hamming ; Hamming distance (a and b must have same nr of characters).
#lcs ; Longest common substring distance.
#qgram ;q-gram distance.
#cosine ; cosine distance between q-gram profiles
#jaccard ; Jaccard distance between q-gram profiles
#jw ; Jaro, or Jaro-Winker distance.
#soundex ; Distance based on soundex encoding (see below)
数据:强>
library("stringdist")
#have modified the data slightly to include dissimilar datapoints
Date = c("07-Jan-17","06-Feb-17","03-Mar-17")
name = c("Game of ThOnes Books for selling","Selling Game of Thrones books","Harry Potter BlueRay")
address = c("George Washington street","George Washington st.","Central Avenue")
phone = c("555-55-55","0(555)-55-55","111-222-333")
DF = data.frame(Date,name,address,phone,stringsAsFactors=FALSE)
DF
# Date name address phone
#1 07-Jan-17 Game of ThOnes Books for selling George Washington street 555-55-55
#2 06-Feb-17 Selling Game of Thrones books George Washington st. 0(555)-55-55
#3 03-Mar-17 Harry Potter BlueRay Central Avenue 111-222-333
分层聚类:
rowLabels = sapply(DF[,"name"],function(x) paste0(head(unlist(strsplit(x," ")),2),collapse="_" ) )
#create string distance matrix, hierarchical cluter object and corresponding plot
nameDist = stringdistmatrix(DF[,"name"])
nameHC = hclust(nameDist)
plot(nameHC,labels = rowLabels ,main="HC plot : name")
addressDist = stringdistmatrix(DF[,"address"])
addressDistHC = hclust(addressDist)
plot(addressDistHC ,labels = rowLabels, main="HC plot : address")
phoneDist = stringdistmatrix(DF[,"phone"])
phoneHC = hclust(phoneDist)
plot(phoneHC ,labels = rowLabels, main="HC plot : phone" )
类似行:
这些行在此数据集中始终形成两个群集,以识别我们可以执行的群集的成员
clusterDF = data.frame(sapply(DF[,-1],function(x) cutree(hclust(stringdistmatrix(x)),2) ))
clusterDF$rowSummary = rowSums(clusterDF)
clusterDF
# name address phone rowSummary
#1 1 1 1 3
#2 1 1 1 3
#3 2 2 2 6
#row frequency
rowFreq = table(clusterDF$rowSummary)
#3 6
#2 1
#we filter rows with frequency > 1
similarRowValues = as.numeric(names(which(rowFreq>1)))
DF[clusterDF$rowSummary == similarRowValues,]
# Date name address phone
#1 07-Jan-17 Game of ThOnes Books for selling George Washington street 555-55-55
#2 06-Feb-17 Selling Game of Thrones books George Washington st. 0(555)-55-55
这个演示适用于简单/玩具数据集,但在真实数据集上你必须修改字符串距离计算方法,簇数等等,但我希望这能指出你正确的方向。