计算列表中元素之间的平均成对协方差

时间:2017-04-26 13:31:10

标签: r covariance

我有以下数据框:

# df1
id   cg_v
1       a
2       b
3     a b
4     b c
5   b c d
6       d

# df2
id  cg
1    a
2    b
3    a
3    b
4    b
4    c
5    b
5    c
5    d
6    d

我需要向df1添加一列,其中包含cg_v中每对元素计算的均值协方差。如果cg_v只包含一个元素,那么我希望新列包含其方差。

我可以通过cov(crossprod(table(df2)))

获得协方差矩阵
#          a         b          c          d
a  0.9166667 0.0000000 -0.5833333 -0.6666667
b  0.0000000 2.0000000  1.0000000  0.0000000
c -0.5833333 1.0000000  0.9166667  0.3333333
d -0.6666667 0.0000000  0.3333333  0.6666667

我从这里做什么?

最终结果应该是这样的:

# df1
id   cg_v      cg_cov
1       a   0.9166667
2       b   2.0000000
3     a b   0.0000000
4     b c   1.0000000
5   b c d   0.4444444  # This is equal to (1.0000000 + 0.3333337 + 0.0000000)/3
6       d   0.6666667

生成df1df2的代码:

df1 <- structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L),
                      cg_v = c("a", "b", "a b", "b c", "b c d", "d")),
                 .Names = c("id", "cg_v"),
                 class = "data.frame", row.names = c(NA, -6L))

df2 <- structure(list(id = c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L),
                      cg = c("a", "b", "a", "b", "b", "c", "b", "c", "d", "d")),
                 .Names = c("id", "cg"),
                 class = "data.frame", row.names = c(NA, -10L))

1 个答案:

答案 0 :(得分:1)

我认为我使用data.tables和reshape找到了解决这个问题的方法。你想用三个字母b c d做什么?我假设你想得到前两个字母的协方差:

        require(reshape)
        require(data.table)
        dt1 <- data.table(id = c(1L, 2L, 3L, 4L, 5L, 6L),
                          cg_v = c("a", "b", "a b", "b c", "b c d", "d"))
        dt2 <- data.table(id = c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L),
                              cg = c("a", "b", "a", "b", "b", "c", "b", "c", "d", "d"))
        cov_dt <- data.table(melt(cov(crossprod(table(df2)))))
        dt1 <- cbind(dt1, t(sapply(strsplit(as.character(df1$cg_v), " "), function(x)x[1:2])))
        #replace the na with the first colomn
        dt1[is.na(V2), V2 := V1]

        # Merge them on two columns
        setkey(dt1, "V1", "V2")
        setkey(cov_dt, "X1", "X2")
        result <- cov_dt[dt1]
> result[,.(id, cg_v, value)]
   id  cg_v     value
1:  1     a 0.9166667
2:  3   a b 0.0000000
3:  2     b 2.0000000
4:  4   b c 1.0000000
5:  5 b c d 1.0000000
6:  6     d 0.6666667

Variant,如果有超过2个字母(不是最有效的代码)也可以使用:

require(reshape)
require(combinat)
df1 <- data.frame(id = c(1L, 2L, 3L, 4L, 5L, 6L),
                  cg_v = c("a", "b", "a b", "b c", "b c d", "d"))
df2 <- data.frame(id = c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L),
                      cg = c("a", "b", "a", "b", "b", "c", "b", "c", "d", "d"))
cov_dt <- cov(crossprod(table(df2)))
mat <- sapply(strsplit(as.character(df1$cg_v), " "), function(x) if(length(x) == 1){c(x,x)} else(x))
# Should be all minimal 2 
sapply(mat, length) > 1
mat <- sapply(mat, function(x) matrix(combn(x,2), nrow = 2))
df1$cg_cov <- sapply(mat, function(x) mean(apply(x,2, function(x) cov_dt[x[1],x[2]])))
> df1
  id  cg_v    cg_cov
1  1     a 0.9166667
2  2     b 2.0000000
3  3   a b 0.0000000
4  4   b c 1.0000000
5  5 b c d 0.4444444
6  6     d 0.6666667