在这个问题上,我好几天都在撞砖墙。我想知道是否有人可以看到我的代码有什么问题,或者告诉我是否忽略了明显的内容。
我有这个data.frame,其中大多数列是矢量(数字或字符),一列是字符矢量的列表:
t0g2 <- structure(list(P = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4,
4, 4, 5, 5, 5, 5), ID = c(8, 10, 7, 9, 5, 2, 3, 4, 8, 9, 1, 2,
8, 1, 4, 10, 4, 10, 2, 7), SC = c("A", "D", "A", "B", "B", "A",
"A", "E", "A", "B", "D", "A", "A", "D", "E", "D", "E", "D", "A",
"A"), FP = list(`40,41,37,8,11` = c("40", "41", "37", "8", "11"
), `49,28,16,41` = c("49", "28", "16", "41"), `15,49` = c("15",
"49"), `27,12,20,35,45` = c("27", "12", "20", "35", "45"), `1,34,43,37` = c("1",
"34", "43", "37"), `41,7,30,2,34,43` = c("41", "7", "30", "2",
"34", "43"), `22,35,31,10,3` = c("22", "35", "31", "10", "3"),
`29,6,15` = c("29", "6", "15"), `40,41,37,8,11` = c("40",
"41", "37", "8", "11"), `27,12,20,35,45` = c("27", "12",
"20", "35", "45"), `10,49,28` = c("10", "49", "28"), `41,7,30,2,34,43` = c("41",
"7", "30", "2", "34", "43"), `40,41,37,8,11` = c("40", "41",
"37", "8", "11"), `10,49,28` = c("10", "49", "28"), `29,6,15` = c("29",
"6", "15"), `49,28,16,41` = c("49", "28", "16", "41"), `29,6,15` = c("29",
"6", "15"), `49,28,16,41` = c("49", "28", "16", "41"), `41,7,30,2,34,43` = c("41",
"7", "30", "2", "34", "43"), `15,49` = c("15", "49"))), class = "data.frame", row.names = c("8",
"10", "7", "9", "5", "2", "3", "4", "81", "91", "1", "21", "82",
"11", "41", "101", "42", "102", "22", "71"))
我想通过一列来聚合它,而其他列的功能只是唯一值的串联。 [是的,我知道这可以通过许多临时软件包完成,但是我需要使用基数R]。
如果我选择数字列“ ID”作为要在其上进行聚合的列,则此方法非常好用
aggregate(x=t0g2[, !(colnames(t0g2) %in% c("ID"))], by=list(ID=t0g2[["ID"]]),
FUN=function(y) unique(unlist(y)))
# ID P SC FP
#1 1 3, 4 D 10, 49, 28
#2 2 2, 3, 5 A 41, 7, 30, 2, 34, 43
#3 3 2 A 22, 35, 31, 10, 3
#4 4 2, 4, 5 E 29, 6, 15
#5 5 2 B 1, 34, 43, 37
#6 7 1, 5 A 15, 49
#7 8 1, 3, 4 A 40, 41, 37, 8, 11
#8 9 1, 3 B 27, 12, 20, 35, 45
#9 10 1, 4, 5 D 49, 28, 16, 41
或字符列“ SC”:
aggregate(x=t0g2[, !(colnames(t0g2) %in% c("SC"))], by=list(SC=t0g2[["SC"]]),
FUN=function(y) unique(unlist(y)))
# SC P ID FP
#1 A 1, 2, 3, 4, 5 8, 7, 2, 3 40, 41, 37, 8, 11, 15, 49, 7, 30, 2, 34, 43, 22, 35, 31, 10, 3
#2 B 1, 2, 3 9, 5 27, 12, 20, 35, 45, 1, 34, 43, 37
#3 D 1, 3, 4, 5 10, 1 49, 28, 16, 41, 10
#4 E 2, 4, 5 4 29, 6, 15
但是,如果我尝试使用“ P”(据我所知只是另一个数字列),这就是我得到的:
aggregate(x=t0g2[, !(colnames(t0g2) %in% c("P"))], by=list(P=t0g2[["P"]]),
FUN=function(y) unique(unlist(y)))
# P ID.1 ID.2 ID.3 ID.4 SC.1 SC.2 SC.3 FP
#1 1 8 10 7 9 A D B 40, 41, 37, 8, 11, 49, 28, 16, 15, 27, 12, 20, 35, 45
#2 2 5 2 3 4 B A E 1, 34, 43, 37, 41, 7, 30, 2, 22, 35, 31, 10, 3, 29, 6, 15
#3 3 8 9 1 2 A B D 40, 41, 37, 8, 11, 27, 12, 20, 35, 45, 10, 49, 28, 7, 30, 2, 34, 43
#4 4 8 1 4 10 A D E 40, 41, 37, 8, 11, 10, 49, 28, 29, 6, 15, 16
#5 5 4 10 2 7 E D A 29, 6, 15, 49, 28, 16, 41, 7, 30, 2, 34, 43
有人知道发生了什么,为什么会这样? 从字面上看这东西...
编辑:根据jay.sf的要求,在“ P”上添加了所需输出的示例。
# P ID SC FP
#1 1 8, 10, 7, 9 A, D, B 40, 41, 37, 8, 11, 49, 28, 16, 15, 27, 12, 20, 35, 45
#2 2 5, 2, 3, 4 B, A, E 1, 34, 43, 37, 41, 7, 30, 2, 22, 35, 31, 10, 3, 29, 6, 15
#3 3 8, 9, 1, 2 A, B, D 40, 41, 37, 8, 11, 27, 12, 20, 35, 45, 10, 49, 28, 7, 30, 2, 34, 43
#4 4 8, 1, 4, 10 A, D, E 40, 41, 37, 8, 11, 10, 49, 28, 29, 6, 15, 16
#5 5 4, 10, 2, 7 E, D, A 29, 6, 15, 49, 28, 16, 41, 7, 30, 2, 34, 43
实际上,我发现通过合计设置simplify=F
可以实现我想要的效果。
我希望这不会适得其反。
编辑2 :事与愿违...
即使列可以是向量,我也不希望所有列都成为列表,但是使用simplify = F
时它们确实成为列表:
sapply(aggregate(x=t0g2[,!(colnames(t0g2) %in% c("P"))],by=list(P=t0g2[["P"]]),FUN=function(y) unique(unlist(y)), simplify = F),class)
# P ID SC FP
#"numeric" "list" "list" "list"
sapply(aggregate(x=t0g2[,!(colnames(t0g2) %in% c("ID"))],by=list(ID=t0g2[["ID"]]),FUN=function(y) unique(unlist(y)), simplify = T),class)
# ID P SC FP
# "numeric" "list" "character" "list"
sapply(aggregate(x=t0g2[,!(colnames(t0g2) %in% c("ID"))],by=list(ID=t0g2[["ID"]]),FUN=function(y) unique(unlist(y)), simplify = F),class)
# ID P SC FP
#"numeric" "list" "list" "list"
所以我仍然没有解决方案...:(
编辑3 :也许是可行的(如果比较笨拙)的解决方案?
t0g2_by_ID <- aggregate(x=t0g2[,!(colnames(t0g2) %in% c("ID"))],by=list(ID=t0g2[["ID"]]),FUN=function(y) unique(unlist(y)), simplify = F)
sapply(t0g2_by_ID,class)
# ID P SC FP
#"numeric" "list" "list" "list"
for (i in 1:NCOL(t0g2_by_ID)) {y = t0g2_by_ID[,i]; if ((class(y) == "list") & (length(y) == length(unlist(y)))) {t0g2_by_ID[,i] <- unlist(y)} }
sapply(t0g2_by_ID,class)
# ID P SC FP
#"numeric" "list" "character" "list"
我尝试使用sapply
消除了优雅的循环,但是随后任何cbind
操作都返回到列表的data.frame。
这是我能想到的最好的方法。
如果有人可以建议仅使用基数R 来更好地做到这一点,那就太好了。
答案 0 :(得分:1)
aggregate
显然试图给出一个可能的矩阵。请参阅以下示例:
# data
n <- 10
df <- data.frame(id= rep(1:2, each= n/2),
value= 1:n)
length(unique(df$value[df$id == 1])) == length(unique(df$value[df$id == 2]))
TRUE
每个id值的唯一长度都相同,因此聚合提供了一个矩阵
aggregate(x= df[, "value"], by=list(id=df[, "id"]),
FUN=function(y) unique(unlist(y)))
id x.1 x.2 x.3 x.4 x.5
1 1 1 2 3 4 5
2 2 6 7 8 9 10
现在我们更改数据,以使每个id的唯一长度不相等
df$value[2] <- 1
length(unique(df$value[df$id == 1])) == length(unique(df$value[df$id == 2]))
FALSE
在这种情况下,我们得到的输出值为,
:
aggregate(x= df[, "value"], by=list(id=df[, "id"]),
FUN=function(y) unique(unlist(y)))
id x
1 1 1, 3, 4, 5
2 2 6, 7, 8, 9, 10
对于您的情况,对于每个P值,正好有4个唯一的ID值,正好有3个唯一的SC值,因此,汇总将这些结果显示为矩阵。对于FP而言并非如此:这里的聚合不能提供矩阵,因此,我们得到的值由,
答案 1 :(得分:0)
aggregate
的参数simplify
默认为TRUE
,这意味着它会尽可能简化为向量或矩阵。 P
中的所有组的n = 4,因此您的汇总数据已简化为矩阵。只需设置simpflify = FALSE
即可更改此行为:
aggregate(x=t0g2[, !(colnames(t0g2) %in% c("P"))], by=list(P=t0g2[["P"]]),
FUN=function(y) unique(unlist(y)), simplify = F)
#### OUTPUT ####
P ID SC FP
1 1 8, 10, 7, 9 A, D, B 40, 41, 37, 8, 11, 49, 28, 16, 15, 27, 12, 20, 35, 45
2 2 5, 2, 3, 4 B, A, E 1, 34, 43, 37, 41, 7, 30, 2, 22, 35, 31, 10, 3, 29, 6, 15
3 3 8, 9, 1, 2 A, B, D 40, 41, 37, 8, 11, 27, 12, 20, 35, 45, 10, 49, 28, 7, 30, 2, 34, 43
4 4 8, 1, 4, 10 A, D, E 40, 41, 37, 8, 11, 10, 49, 28, 29, 6, 15, 16
5 5 4, 10, 2, 7 E, D, A 29, 6, 15, 49, 28, 16, 41, 7, 30, 2, 34, 43