首先,我是R的新手
我的数据框为:
df< -
column-1 column-2 column-3 column-4
vf34 bn56 qw34 mn569
vf34 cv34 mn569
bn56 qw34 asder45
nght cv34 asder45
vf34 cv34 mn569
现在我想将相似度矩阵计算为:
Output1:
vf34 nght bn56 cv34 qw34 mn569 asder45
vf34 0 0 1 2 1 3 0
nght 0 0 0 1 0 0 1
bn56 1 0 0 0 2 1 1
cv34 2 1 0 0 0 2 1
qw34 1 0 2 0 0 1 1
mn569 3 0 1 2 1 0 0
asder45 0 1 1 1 1 0 0
因此,基本上它应该从数据帧(或csv文件)中找到所有可能的对,并形成一个具有出现次数的矩阵。
对于ex:第一行,第六列是3.所以说在整个数据中vf34和mn569组合有 发生过3次。
数据中的空白值表示原始数据本身缺少数据。
我可以使用countvectorizer在python中执行此操作,然后将获得的矩阵与其转置相乘。但是我是R.的新手。有人可以帮我解决这个问题吗?
and Output2 that i need is:
1 1 3 2 1 0
and so on for 5 rows.
This 1; 1; 3; 2; 1; 0 means:
(vf34 and bn56); (vf34 and qw34); (vf34 and mn569); (bn56 and qw34); (bn56 and mn569);
(qw34 and mn569) combinations that have occurred.
These values can be obtained from output1 that is given above.
我需要为所有五行提供这些值。怎么做?
答案 0 :(得分:3)
这是获得预期结果的方法。工作流程是:
unique(unlist(df))
)''
)combn(1:..)
)split
“indx”由“indx”df[x]
)table
和sum(+
)列表元素获取频率。结果(res
)和结果的转置再次求和,使下对角元素和上对角线元素相同。
Un <- unique(unlist(df))
Un1 <- Un[Un!='']
indx <- combn(1:ncol(df),2)
res <- Reduce(`+`,lapply(split(indx, col(indx)), function(x) {
x1 <- df[x]
x2 <- x1[!(x1[,1]==''|x1[,2]==''),]
x2[] <- lapply(x2, factor, levels=Un1)
tbl <- table(x2)}))
res1 <- res+t(res)
res1
# column.2
#column.1 vf34 nght bn56 cv34 qw34 mn569 asder45
# vf34 0 0 1 2 1 3 0
# nght 0 0 0 1 0 0 1
# bn56 1 0 0 0 2 1 1
# cv34 2 1 0 0 0 2 1
# qw34 1 0 2 0 0 1 1
# mn569 3 0 1 2 1 0 0
# asder45 0 1 1 1 1 0 0
关于“output2”,由于这些值与您的预期结果不匹配(可能是拼写错误?),因此不太清楚。
lapply(seq_len(nrow(df)), function(i) {x1 <- unlist(df[i,])
x2 <- x1[x1!='']
i1 <- combn(x2,2)
diag(res1[i1[1,], i1[2,]])})
#[[1]]
#[1] 1 1 3 2 1 1
#[[2]]
#[1] 2 3 2
#[[3]]
#[1] 2 1 1
#[[4]]
#[1] 1 1 1
#[[5]]
#[1] 2 3 2
df <- structure(list(column.1 = c("vf34", "vf34", "", "nght", "vf34"
), column.2 = c("bn56", "cv34", "bn56", "cv34", "cv34"), column.3 = c("qw34",
"", "qw34", "", ""), column.4 = c("mn569", "mn569", "asder45",
"asder45", "mn569")), .Names = c("column.1", "column.2", "column.3",
"column.4"), class = "data.frame", row.names = c(NA, -5L))