Question

对于多列，我对data.table melt和dcast有疑问。我在StackOverFlow上浏览过，但很多类似的帖子都不是我想要的。我将在下面解释。

首先，data是关于问题的原因和价值量。这是我data的一部分：

ID   Type    Problem1    Value1     Problem2    Value2    Problem3    Value3
1    A       X           500        Y           1000      Z           400
2    A       X           600        Z           700       
3    B       Y           700        Z           100
4    B       W           200        V           200
5    C       Z           500        V           500       
6    C       X           1000       W           100       V           900

其次，ID是唯一的。 Type包含三个（A，B和C）。有 5 问题。

以ID == 1为例。它是Type A，包含3个问题（X，Y和Z）。其Problem X有Value 500，Problem Y有Value 1000，Problem Z有Value 400。以ID == 5为例。它是Type C，包含2个问题（Z和V）。其Problem Z有Value 500，Problem V有Value 500。

第三，列ID，Type，Problem1，Problem2和Problem3为character。 Value1，Value2和Value3为numeric。

我想要的结果是：

Type    X     Y     Z     W     V
A       1100  1000  1100  0     0   
B       0     700   100   200   200
C       1000  0     500   100   1400

我不知道如何正确解释。我想对Type进行分组，然后总结每个问题的问题。我认为这是关于长到宽。我找到了引用here和here。第二个可能有用。但是，我不知道从哪里开始。有什么建议吗？

# data
dt <- fread("
ID   Type    Problem1    Value1     Problem2    Value2    Problem3    Value3
1    A       X           500        Y           1000      Z           400
2    A       X           600        Z           700       
3    B       Y           700        Z           100
4    B       W           200        V           200
5    C       Z           500        V           500       
6    C       X           1000       W           100       V           900", fill = T)

Answer 1

我们可以先melt指定patterns到'long'格式，然后measure dcast作为fun.aggregate sum }}

dcast(melt(dt, measure = patterns("^Value", "^Problem"), 
    value.name = c("Value", "Problem"))[Problem != ""
     ][, Problem := factor(Problem, levels = c("X", "Y", "Z", "W", "V"))], 
     Type ~Problem, value.var = "Value", sum, na.rm = TRUE)
#   Type    X    Y    Z   W    V
#1:    A 1100 1000 1100   0    0
#2:    B    0  700  100 200  200
#3:    C 1000    0  500 100 1400

来自melt的

data.table可以在patterns参数中使用多个measure。因此，当我们说"^Value"时，它匹配名称以“值”开头（^）的所有列，并且类似于“问题”，并创建两个“值”列。在上文中，我们使用value.name参数将这些列命名为“Value”和“Problem”。由于数据集具有一些空格，因此长格式也包含我们使用Problem != ""删除的空白元素。如果我们需要按特定顺序排列列，则下一步非常重要。因此，我们将“问题”更改为factor类，并按该顺序指定levels。现在，melt部分已完成。通过指定公式dcast列和value.var（此处为fun.aggregate）

，长格式现在已更改为sum的“广角”

Answer 2

虚拟和直接的方式，但仍然有效（希望有人可以帮助改善我的解决方案）。

library(magrittr)
rbind(
    dt[, .(Type, P = Problem1, V = Value1)],
    dt[, .(Type, P = Problem2, V = Value2)],
    dt[, .(Type, P = Problem3, V = Value3)]) %>%
    .[P != ""] %>%
    dcast(Type ~ P, value.var = "V", sum)

修改通过遵循akrun的代码（将函数传递给dcast）进行改进。

Answer 3

这可以通过dplyr / tidyr轻松完成：

library("dplyr")
library("tidyr")

# assume x is your dataframe
bind_rows(
  select(x, ID, Type, Problem = Problem1, Value = Value1),
  select(x, ID, Type, Problem = Problem2, Value = Value2),
  select(x, ID, Type, Problem = Problem3, Value = Value3)
  ) %>%
filter(!(is.na(Problem))) %>%
group_by(Type, Problem) %>%
summarise(Value = sum(Value)) %>%
spread(Problem, Value, fill = 0)

<强>输出

# A tibble: 3 x 6
# Groups:   Type [3]
   Type     V     W     X     Y     Z
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1     A     0     0  1100  1000  1100
2     B   200   200     0   700   100
3     C  1400   100  1000     0   500

如果列V-Z的顺序很重要，可以通过添加最终的select语句轻松修复。

Answer 4

这是一个方法，它使用akrun执行的melt函数，然后使用矩阵子集来返回所需的结果。

# melt and aggregate the data
temp <- melt(dt, measure = patterns("^Value", "^Problem"),
             value.name = c("Value", "Problem"))[
        !is.na(Value), .(Value=sum(Value)), by=.(Type, Problem)]

# set up the storage matrix
dimNames <- list(sort(unique(temp$Type)), unique(temp$Problem))
myMat <- matrix(0, length(dimNames[[1]]), length(dimNames[[2]]), dimnames=dimNames)

# fill in the matrix with the desired values
myMat[cbind(temp$Type, temp$Problem)] <- temp$Value

返回矩阵

myMat
     X    Y   W    Z    V
A 1100 1000   0 1100    0
B    0  700 200  100  200
C 1000    0 100  500 1400

要返回data.table，您可以执行

data.table(myMat, keep.rownames=TRUE)
   rn    X    Y   W    Z    V
1:  A 1100 1000   0 1100    0
2:  B    0  700 200  100  200
3:  C 1000    0 100  500 1400

data.table

4 个答案: