我想计算表中每一行的相似度(对2个数据对象有多相似的数值度量-在这种情况下,是2行有多相似),该表将类似于:
vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
vhigh,vhigh,2,2,med,med,unacc
vhigh,vhigh,2,2,med,high,unacc
vhigh,vhigh,2,2,big,low,unacc
vhigh,vhigh,2,2,big,med,unacc
vhigh,vhigh,2,2,big,high,unacc
我在互联网上尝试了许多不同的方法,但是大多数方法都是用于计算矩阵的相似度。显然,我们可以很容易地看出第一行和第二行“最相似”,因为它们只有一个不同的变量,但是我需要一种一次性的方法来比较该表的每一行。
结果可能像是:第一和第二行的相似度是0.983。
答案 0 :(得分:0)
这实质上是计算相同元素的比例。首先,我创建数据框:
# Create data frame
data <- read.table(text = "vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
vhigh,vhigh,2,2,med,med,unacc
vhigh,vhigh,2,2,med,high,unacc
vhigh,vhigh,2,2,big,low,unacc
vhigh,vhigh,2,2,big,med,unacc
vhigh,vhigh,2,2,big,high,unacc", sep = ",")
接下来,我加载dplyr
。
# Load dplyr library
library(dplyr)
这是完成所有工作的功能。
# Function for comparing rows
row_cf <- function(x, y, df){
sum(df[x,] == df[y,])/ncol(df)
}
在这里它被应用。
# 1) Create all possible row combinations
# 2) Rename the columns for readability
# 3) Run through each row
# 4) Calculate similarity
res <- expand.grid(1:nrow(data), 1:nrow(data)) %>%
rename(row_1 = Var1, row_2 = Var2) %>%
rowwise() %>%
mutate(similarity = row_cf(row_1, row_2, data))
# Results
# row_1 row_2 similarity
# 1 1 1 1.0000000
# 2 2 1 0.8571429
# 3 3 1 0.7142857
# 4 4 1 0.7142857
# 5 5 1 0.5714286
# 6 6 1 0.5714286
# 7 7 1 0.7142857
# 8 8 1 0.5714286
# 9 9 1 0.5714286
# 10 1 2 0.8571429
# 11 2 2 1.0000000
# 12 3 2 0.7142857
# 13 4 2 0.5714286
# 14 5 2 0.7142857
# 15 6 2 0.5714286
# 16 7 2 0.5714286
# 17 8 2 0.7142857
# 18 9 2 0.5714286
# 19 1 3 0.7142857
# 20 2 3 0.7142857
# 21 3 3 1.0000000
# 22 4 3 0.7142857
# 23 5 3 0.7142857
# 24 6 3 0.8571429
# 25 7 3 0.7142857
# 26 8 3 0.7142857
# 27 9 3 0.8571429
# 28 1 4 0.7142857
# 29 2 4 0.5714286
# 30 3 4 0.7142857
# 31 4 4 1.0000000
# 32 5 4 0.8571429
# 33 6 4 0.8571429
# 34 7 4 0.8571429
# 35 8 4 0.7142857
# 36 9 4 0.7142857
# 37 1 5 0.5714286
# 38 2 5 0.7142857
# 39 3 5 0.7142857
# 40 4 5 0.8571429
# 41 5 5 1.0000000
# 42 6 5 0.8571429
# 43 7 5 0.7142857
# 44 8 5 0.8571429
# 45 9 5 0.7142857
# 46 1 6 0.5714286
# 47 2 6 0.5714286
# 48 3 6 0.8571429
# 49 4 6 0.8571429
# 50 5 6 0.8571429
# 51 6 6 1.0000000
# 52 7 6 0.7142857
# 53 8 6 0.7142857
# 54 9 6 0.8571429
# 55 1 7 0.7142857
# 56 2 7 0.5714286
# 57 3 7 0.7142857
# 58 4 7 0.8571429
# 59 5 7 0.7142857
# 60 6 7 0.7142857
# 61 7 7 1.0000000
# 62 8 7 0.8571429
# 63 9 7 0.8571429
# 64 1 8 0.5714286
# 65 2 8 0.7142857
# 66 3 8 0.7142857
# 67 4 8 0.7142857
# 68 5 8 0.8571429
# 69 6 8 0.7142857
# 70 7 8 0.8571429
# 71 8 8 1.0000000
# 72 9 8 0.8571429
# 73 1 9 0.5714286
# 74 2 9 0.5714286
# 75 3 9 0.8571429
# 76 4 9 0.7142857
# 77 5 9 0.7142857
# 78 6 9 0.8571429
# 79 7 9 0.8571429
# 80 8 9 0.8571429
# 81 9 9 1.0000000