Question

我有一张包含数千行的表格。

示例数据：

user_id ZIP City    email
105 100051  Lond.   jsmith@hotmail.com
382 251574          jgjefferson@gmail.com
225 0100051 London  john.smith@hotmail.com

我需要将每个用户与其他用户进行比较，以便能够知道哪些用户是相似的。

在给出的示例中，用户105和225几乎相同，因此预期结果将是与其中两个匹配的新id的列，如下所示：

user_id ZIP City    email                   new_id
105 100051  Lond.   jsmith@hotmail.com      105
382 251574          jgjefferson@gmail.com   382
225 0100051 London  john.smith@hotmail.com  105

我如何将每个字段与其他字段进行比较，并知道如何比较它们，例如聚类，等等？

Answer 1

您的电子邮件：

email<-c("jsmith@hotmail.com","jgjefferson@gmail.com","john.smith@hotmail.com")

电子邮件之间的距离：

dist<-stringdistmatrix(email,email,method="jw")
dist[dist==0]<-1

电子邮件之间的最短距离：

cbind(email,email_near=email[apply(dist, 1, which.min)],dist=apply(dist, 1, FUN=min))

     email                    email_near               dist               
[1,] "jsmith@hotmail.com"     "john.smith@hotmail.com" "0.208754208754209"
[2,] "jgjefferson@gmail.com"  "jsmith@hotmail.com"     "0.281746031746032"
[3,] "john.smith@hotmail.com" "jsmith@hotmail.com"     "0.208754208754209"

之后，我建议在dist上使用阈值来识别最近的电子邮件，然后计算new_ID。

寻找类似用户

1 个答案: