Question

首先，我是R的新手

我的数据框为：

df＆lt; -

column-1  column-2 column-3 column-4

vf34       bn56     qw34    mn569
vf34       cv34             mn569
           bn56     qw34    asder45
nght       cv34             asder45
vf34       cv34             mn569

现在我想将相似度矩阵计算为：

Output1:
          vf34  nght  bn56  cv34  qw34   mn569  asder45     
vf34      0     0     1     2     1      3      0
nght      0     0     0     1     0      0      1
bn56      1     0     0     0     2      1      1
cv34      2     1     0     0     0      2      1
qw34      1     0     2     0     0      1      1
mn569     3     0     1     2     1      0      0
asder45   0     1     1     1     1      0      0

因此，基本上它应该从数据帧（或csv文件）中找到所有可能的对，并形成一个具有出现次数的矩阵。

对于ex：第一行，第六列是3.所以说在整个数据中vf34和mn569组合有发生过3次。

数据中的空白值表示原始数据本身缺少数据。

我可以使用countvectorizer在python中执行此操作，然后将获得的矩阵与其转置相乘。但是我是R.的新手。有人可以帮我解决这个问题吗？

 and Output2 that i need is:

1  1 3 2 1 0
 and so on for 5 rows.

 This 1; 1; 3; 2; 1; 0 means: 
 (vf34 and bn56); (vf34 and qw34); (vf34 and mn569); (bn56 and qw34); (bn56 and mn569); 
 (qw34 and mn569) combinations that have occurred.
 These values can be obtained from output1 that is given above.

我需要为所有五行提供这些值。怎么做？

Answer 1

这是获得预期结果的方法。工作流程是：

从“数据集”（unique(unlist(df))）
删除空字符串（''）
创建列（combn(1:..)）
split“indx”由“indx”
子集“df”（df[x]）
删除空字符串
将“character”列更改为“factor”类，级别为“Un1”
使用table和sum（+）列表元素获取频率。

结果（res）和结果的转置再次求和，使下对角元素和上对角线元素相同。

Un <- unique(unlist(df))
Un1 <- Un[Un!='']
indx <- combn(1:ncol(df),2)
res <- Reduce(`+`,lapply(split(indx, col(indx)), function(x) {
            x1 <- df[x]
            x2 <- x1[!(x1[,1]==''|x1[,2]==''),]
            x2[] <- lapply(x2, factor, levels=Un1)
            tbl <- table(x2)}))

 res1 <- res+t(res)
res1
#           column.2
#column.1  vf34 nght bn56 cv34 qw34 mn569 asder45
# vf34       0    0    1    2    1     3       0
# nght       0    0    0    1    0     0       1
# bn56       1    0    0    0    2     1       1
# cv34       2    1    0    0    0     2       1
# qw34       1    0    2    0    0     1       1
# mn569      3    0    1    2    1     0       0
# asder45    0    1    1    1    1     0       0

更新

关于“output2”，由于这些值与您的预期结果不匹配（可能是拼写错误？），因此不太清楚。

lapply(seq_len(nrow(df)), function(i) {x1 <- unlist(df[i,])
                        x2 <- x1[x1!='']
                        i1 <- combn(x2,2)
                   diag(res1[i1[1,], i1[2,]])})
#[[1]]
#[1] 1 1 3 2 1 1

#[[2]]
#[1] 2 3 2

#[[3]]
#[1] 2 1 1

#[[4]]
#[1] 1 1 1

#[[5]]
#[1] 2 3 2

数据

df <- structure(list(column.1 = c("vf34", "vf34", "", "nght", "vf34"
), column.2 = c("bn56", "cv34", "bn56", "cv34", "cv34"), column.3 = c("qw34", 
"", "qw34", "", ""), column.4 = c("mn569", "mn569", "asder45", 
"asder45", "mn569")), .Names = c("column.1", "column.2", "column.3", 
"column.4"), class = "data.frame", row.names = c(NA, -5L))

计算R中csv文件的元素之间的相似性

1 个答案:

更新

数据