Question

我在R中有一个数据框，由3个属性组成5行（记录）。现在给出了相同20个属性的新记录，根据它的内容（值），找到10行中哪一行的最佳方法是什么，这个新行最相似？

现有数据

Age Occupation Nationality,
23  Builder    German,
29  Worker     British,
45  Contractor Vietnamese,
24  Engineer   German,
28  Doctor     Indian,

新数据

23  Doctor German

预期输出

23  Builder    German

我想返回第1行，即上面的行，因为两个属性匹配

Answer 1

df<-data.frame(Age=c(23,29,45,24,28),Occupation=c("Builder","Worker","Contractor","Engineer","Doctor"),Nationality=c("German","British","Vietnamese","German","Indian"),stringsAsFactors=F)

newdata<-c(23,"Doctor","German")


df[which.max(apply(df,1,function(vec,dat){sum(vec==dat)},newdata)),]

  Age Occupation Nationality
1  23    Builder      German

如果是关系，您可以获得所有更好的匹配：

detmatches<-apply(df,1,function(vec,dat){sum(vec==dat)},newdata)
df[which(detmatches==max(detmatches)),]

Answer 2

您可以使用stringdist中的stringdist method=jaccard。使用Map，我们会将df的列与列表elements的相应newdata进行比较。即，来自Age的{{1}}列用于df与stringdist，23与Occupation的比较等等......我们应用后Doctor函数，我们为每个列表元素获取长度等于stringdist的数值。使用nrow(df)添加相应的值（+），然后我们查找带Reduce的{{1}}值（输出将是逻辑索引）。此索引用于对minimum。

进行子集化

which.min

数据

df

在R中找到最接近匹配的数据帧中的行

2 个答案:

数据