在看似相同的任务上表现不同的合计(R)

时间:2019-06-19 07:59:19

标签: r aggregate

在这个问题上,我好几天都在撞砖墙。我想知道是否有人可以看到我的代码有什么问题,或者告诉我是否忽略了明显的内容。

我有这个data.frame,其中大多数列是矢量(数字或字符),一列是字符矢量的列表:

t0g2 <- structure(list(P = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 
4, 4, 5, 5, 5, 5), ID = c(8, 10, 7, 9, 5, 2, 3, 4, 8, 9, 1, 2, 
8, 1, 4, 10, 4, 10, 2, 7), SC = c("A", "D", "A", "B", "B", "A", 
"A", "E", "A", "B", "D", "A", "A", "D", "E", "D", "E", "D", "A", 
"A"), FP = list(`40,41,37,8,11` = c("40", "41", "37", "8", "11"
), `49,28,16,41` = c("49", "28", "16", "41"), `15,49` = c("15", 
"49"), `27,12,20,35,45` = c("27", "12", "20", "35", "45"), `1,34,43,37` = c("1", 
"34", "43", "37"), `41,7,30,2,34,43` = c("41", "7", "30", "2", 
"34", "43"), `22,35,31,10,3` = c("22", "35", "31", "10", "3"), 
    `29,6,15` = c("29", "6", "15"), `40,41,37,8,11` = c("40", 
    "41", "37", "8", "11"), `27,12,20,35,45` = c("27", "12", 
    "20", "35", "45"), `10,49,28` = c("10", "49", "28"), `41,7,30,2,34,43` = c("41", 
    "7", "30", "2", "34", "43"), `40,41,37,8,11` = c("40", "41", 
    "37", "8", "11"), `10,49,28` = c("10", "49", "28"), `29,6,15` = c("29", 
    "6", "15"), `49,28,16,41` = c("49", "28", "16", "41"), `29,6,15` = c("29", 
    "6", "15"), `49,28,16,41` = c("49", "28", "16", "41"), `41,7,30,2,34,43` = c("41", 
    "7", "30", "2", "34", "43"), `15,49` = c("15", "49"))), class = "data.frame", row.names = c("8", 
"10", "7", "9", "5", "2", "3", "4", "81", "91", "1", "21", "82", 
"11", "41", "101", "42", "102", "22", "71"))

我想通过一列来聚合它,而其他列的功能只是唯一值的串联。 [是的,我知道这可以通过许多临时软件包完成,但是我需要使用基数R]。

如果我选择数字列“ ID”作为要在其上进行聚合的列,则此方法非常好用

aggregate(x=t0g2[, !(colnames(t0g2) %in% c("ID"))], by=list(ID=t0g2[["ID"]]), 
          FUN=function(y) unique(unlist(y)))
#  ID       P SC                   FP
#1  1    3, 4  D           10, 49, 28
#2  2 2, 3, 5  A 41, 7, 30, 2, 34, 43
#3  3       2  A    22, 35, 31, 10, 3
#4  4 2, 4, 5  E            29, 6, 15
#5  5       2  B        1, 34, 43, 37
#6  7    1, 5  A               15, 49
#7  8 1, 3, 4  A    40, 41, 37, 8, 11
#8  9    1, 3  B   27, 12, 20, 35, 45
#9 10 1, 4, 5  D       49, 28, 16, 41

或字符列“ SC”:

aggregate(x=t0g2[, !(colnames(t0g2) %in% c("SC"))], by=list(SC=t0g2[["SC"]]), 
          FUN=function(y) unique(unlist(y)))
#  SC             P         ID                                                             FP
#1  A 1, 2, 3, 4, 5 8, 7, 2, 3 40, 41, 37, 8, 11, 15, 49, 7, 30, 2, 34, 43, 22, 35, 31, 10, 3
#2  B       1, 2, 3       9, 5                              27, 12, 20, 35, 45, 1, 34, 43, 37
#3  D    1, 3, 4, 5      10, 1                                             49, 28, 16, 41, 10
#4  E       2, 4, 5          4                                                      29, 6, 15

但是,如果我尝试使用“ P”(据我所知只是另一个数字列),这就是我得到的:

aggregate(x=t0g2[, !(colnames(t0g2) %in% c("P"))], by=list(P=t0g2[["P"]]), 
          FUN=function(y) unique(unlist(y)))
#   P ID.1 ID.2 ID.3 ID.4 SC.1 SC.2 SC.3                                                                  FP
#1  1    8   10    7    9    A    D    B               40, 41, 37, 8, 11, 49, 28, 16, 15, 27, 12, 20, 35, 45
#2  2    5    2    3    4    B    A    E           1, 34, 43, 37, 41, 7, 30, 2, 22, 35, 31, 10, 3, 29, 6, 15
#3  3    8    9    1    2    A    B    D 40, 41, 37, 8, 11, 27, 12, 20, 35, 45, 10, 49, 28, 7, 30, 2, 34, 43
#4  4    8    1    4   10    A    D    E                        40, 41, 37, 8, 11, 10, 49, 28, 29, 6, 15, 16
#5  5    4   10    2    7    E    D    A                         29, 6, 15, 49, 28, 16, 41, 7, 30, 2, 34, 43

有人知道发生了什么,为什么会这样? 从字面上看这东西...


编辑:根据jay.sf的要求,在“ P”上添加了所需输出的示例。

#  P          ID      SC                                                                  FP
#1 1 8, 10, 7, 9 A, D, B               40, 41, 37, 8, 11, 49, 28, 16, 15, 27, 12, 20, 35, 45
#2 2  5, 2, 3, 4 B, A, E           1, 34, 43, 37, 41, 7, 30, 2, 22, 35, 31, 10, 3, 29, 6, 15
#3 3  8, 9, 1, 2 A, B, D 40, 41, 37, 8, 11, 27, 12, 20, 35, 45, 10, 49, 28, 7, 30, 2, 34, 43
#4 4 8, 1, 4, 10 A, D, E                        40, 41, 37, 8, 11, 10, 49, 28, 29, 6, 15, 16
#5 5 4, 10, 2, 7 E, D, A                         29, 6, 15, 49, 28, 16, 41, 7, 30, 2, 34, 43

实际上,我发现通过合计设置simplify=F可以实现我想要的效果。
我希望这不会适得其反。


编辑2 :事与愿违...

即使列可以是向量,我也不希望所有列都成为列表,但是使用simplify = F时它们确实成为列表:

sapply(aggregate(x=t0g2[,!(colnames(t0g2) %in% c("P"))],by=list(P=t0g2[["P"]]),FUN=function(y) unique(unlist(y)), simplify = F),class)
#        P        ID        SC        FP 
#"numeric"    "list"    "list"    "list" 

sapply(aggregate(x=t0g2[,!(colnames(t0g2) %in% c("ID"))],by=list(ID=t0g2[["ID"]]),FUN=function(y) unique(unlist(y)), simplify = T),class)
#         ID           P          SC          FP 
#  "numeric"      "list" "character"      "list" 

sapply(aggregate(x=t0g2[,!(colnames(t0g2) %in% c("ID"))],by=list(ID=t0g2[["ID"]]),FUN=function(y) unique(unlist(y)), simplify = F),class)
#       ID         P        SC        FP 
#"numeric"    "list"    "list"    "list" 

所以我仍然没有解决方案...:(


编辑3 :也许是可行的(如果比较笨拙)的解决方案?

t0g2_by_ID <- aggregate(x=t0g2[,!(colnames(t0g2) %in% c("ID"))],by=list(ID=t0g2[["ID"]]),FUN=function(y) unique(unlist(y)), simplify = F)

sapply(t0g2_by_ID,class)
#       ID         P        SC        FP 
#"numeric"    "list"    "list"    "list" 

for (i in 1:NCOL(t0g2_by_ID)) {y = t0g2_by_ID[,i]; if ((class(y) == "list") & (length(y) == length(unlist(y)))) {t0g2_by_ID[,i] <- unlist(y)} }

sapply(t0g2_by_ID,class)
#       ID           P          SC          FP 
#"numeric"      "list" "character"      "list" 

我尝试使用sapply消除了优雅的循环,但是随后任何cbind操作都返回到列表的data.frame。

这是我能想到的最好的方法。

如果有人可以建议仅使用基数R 来更好地做到这一点,那就太好了。

2 个答案:

答案 0 :(得分:1)

aggregate显然试图给出一个可能的矩阵。请参阅以下示例:

# data
n <- 10
df <- data.frame(id= rep(1:2, each= n/2),
             value= 1:n)

length(unique(df$value[df$id == 1])) == length(unique(df$value[df$id == 2]))
TRUE

每个id值的唯一长度都相同,因此聚合提供了一个矩阵

aggregate(x= df[, "value"], by=list(id=df[, "id"]), 
      FUN=function(y) unique(unlist(y)))
   id x.1 x.2 x.3 x.4 x.5
1  1   1   2   3   4   5
2  2   6   7   8   9  10

现在我们更改数据,以使每个id的唯一长度不相等

df$value[2] <- 1
length(unique(df$value[df$id == 1])) == length(unique(df$value[df$id == 2]))
FALSE

在这种情况下,我们得到的输出值为,

aggregate(x= df[, "value"], by=list(id=df[, "id"]), 
      FUN=function(y) unique(unlist(y)))
  id              x
1  1     1, 3, 4, 5
2  2 6, 7, 8, 9, 10

对于您的情况,对于每个P值,正好有4个唯一的ID值,正好有3个唯一的SC值,因此,汇总将这些结果显示为矩阵。对于FP而言并非如此:这里的聚合不能提供矩阵,因此,我们得到的值由,

分隔

答案 1 :(得分:0)

aggregate的参数simplify默认为TRUE,这意味着它会尽可能简化为向量或矩阵。 P中的所有组的n = 4,因此您的汇总数据已简化为矩阵。只需设置simpflify = FALSE即可更改此行为:

aggregate(x=t0g2[, !(colnames(t0g2) %in% c("P"))], by=list(P=t0g2[["P"]]), 
          FUN=function(y) unique(unlist(y)), simplify = F)

#### OUTPUT ####

  P          ID      SC                                                                  FP
1 1 8, 10, 7, 9 A, D, B               40, 41, 37, 8, 11, 49, 28, 16, 15, 27, 12, 20, 35, 45
2 2  5, 2, 3, 4 B, A, E           1, 34, 43, 37, 41, 7, 30, 2, 22, 35, 31, 10, 3, 29, 6, 15
3 3  8, 9, 1, 2 A, B, D 40, 41, 37, 8, 11, 27, 12, 20, 35, 45, 10, 49, 28, 7, 30, 2, 34, 43
4 4 8, 1, 4, 10 A, D, E                        40, 41, 37, 8, 11, 10, 49, 28, 29, 6, 15, 16
5 5 4, 10, 2, 7 E, D, A                         29, 6, 15, 49, 28, 16, 41, 7, 30, 2, 34, 43