Question

我的问题很简单，但我认为很难解决。我只想匹配商店的两个名字：

library(stringdist)

a=c("Adidas", "macys","apple store", "VANS Store New York", "new wave","adsasds") 

distances = stringdist("ADIDAS STORE", a, method = 'jw')
res = data.frame(distances, a)
View(res)

Answerset：

0.4853801 VANS Store New York
0.5833333 Adidas
0.5972222新浪潮
0.6085859 apple store
1.0000000 macys
1.0000000 adsasds

尝试使用其他方法，我得到的结果不同但没有好处。我只想知道“ADIDAS STORE”与“Adidas”的地方相同。

任何人都可以帮助我？

谢谢。

Answer 1

要选择最佳距离，您可以使用此代码来确定示例中最佳的stringdist方法。

在每个stringdist方法上旋转的函数

library(stringdist)
string_dist<-function(m,a=c("Adidas", "macys","apple store", "VANS Store New York", "new wave","adsasds") ){
  out<-stringdist(a="ADIDAS STORE",b = toupper(a),method=m)
  return(out)  
}

应用于您的数据

out<-lapply(methods,FUN = string_dist)
db<-as.data.frame(matrix(unlist(out), nrow=length(unlist(out[1]))))
colnames(db)<-paste("ADIDAS STORE",methods,sep="-")
store_name<-c("Adidas", "macys","apple store", "VANS Store New York", "new wave","adsasds")

现在你可以选择最好的方法（我建议余弦或jw）

db<-cbind(store_name,db)
db
store_name ADIDAS STORE-osa ADIDAS STORE-lv ADIDAS STORE-dl ADIDAS STORE-hamming ADIDAS STORE-lcs ADIDAS STORE-qgram
1              Adidas                6               6               6                  Inf                6                  6
2               macys               10              10              10                  Inf               13                 13
3         apple store                5               5               5                  Inf                9                  9
4 VANS Store New York               14              14              14                  Inf               15                 15
5            new wave               10              10              10                  Inf               16                 14
6             adsasds                7               7               7                  Inf                9                  7
  ADIDAS STORE-cosine ADIDAS STORE-jaccard ADIDAS STORE-jw ADIDAS STORE-soundex
1           0.1801084            0.5555556       0.1666667                    0
2           0.5783630            0.8333333       0.4777778                    1
3           0.3914194            0.3636364       0.3077201                    1
4           0.3625447            0.5000000       0.3040936                    1
5           0.6597931            0.7500000       0.5694444                    1
6           0.1996733            0.6666667       0.2698413                    1

用R - 文本挖掘匹配短语

1 个答案: