我尝试做类似于what's answered here的事情,这让我获得80%的回报。我有一个带有一个ID列和多个信息列的数据框。我想汇总其他列的所有,以便每个ID只有一行,并且多个条目由例如分号分隔。这是我拥有和想要的一个例子。
HAVE:
ID info1 info2
1 id101 one first
2 id102 twoA second alias A
3 id102 twoB second alias B
4 id103 threeA third alias A
5 id103 threeB third alias B
6 id104 four fourth
7 id105 five fifth
WANT:
ID info1 info2
1 id101 one first
2 id102 twoA; twoB second alias A; second alias B
3 id103 threeA; threeB third alias A; third alias B
4 id104 four fourth
5 id105 five fifth
以下是用于生成这些代码的代码:
have <- data.frame(ID=paste0("id", c(101, 102, 102, 103, 103, 104, 105)),
info1=c("one", "twoA", "twoB", "threeA", "threeB", "four", "five"),
info2=c("first", "second alias A", "second alias B", "third alias A", "third alias B", "fourth", "fifth"),
stringsAsFactors=FALSE)
want <- data_frame(ID=paste0("id", c(101:105)),
info1=c("one", "twoA; twoB", "threeA; threeB", "four", "five"),
info2=c("first", "second alias A; second alias B", "third alias A; third alias B", "fourth", "fifth"),
stringsAsFactors=FALSE)
This question基本上问了同样的问题,但只有一个&#34; info&#34;柱。我有多个其他列,并希望为所有这些列执行此操作。
使用dplyr执行此操作的加分点。
答案 0 :(得分:15)
以下是使用summarise_each
的选项(可以轻松将更改应用于除分组变量之外的所有列)和toString
:
require(dplyr)
have %>%
group_by(ID) %>%
summarise_each(funs(toString))
#Source: local data frame [5 x 3]
#
# ID info1 info2
#1 id101 one first
#2 id102 twoA, twoB second alias A, second alias B
#3 id103 threeA, threeB third alias A, third alias B
#4 id104 four fourth
#5 id105 five fifth
或者,如果您希望它以分号分隔,则可以使用:
have %>%
group_by(ID) %>%
summarise_each(funs(paste(., collapse = "; ")))
答案 1 :(得分:11)
好老aggregate
做得很好
aggregate(have[,2:3], by=list(have$ID), paste, collapse=";")
问题是:它是否可以扩展?
答案 2 :(得分:8)
这是一个data.table
解决方案。
library(data.table)
setDT(have)[, lapply(.SD, paste, collapse = "; "), by = ID]
# ID info1 info2
# 1: id101 one first
# 2: id102 twoA; twoB second alias A; second alias B
# 3: id103 threeA; threeB third alias A; third alias B
# 4: id104 four fourth
# 5: id105 five fifth
答案 3 :(得分:4)
这是 SQL 解决方案^ 1:
library(sqldf)
#Static solution
sqldf("
SELECT ID,
GROUP_CONCAT(info1,';') as info1,
GROUP_CONCAT(info2,';') as info2
FROM have
GROUP BY ID")
#Dynamic solution
concat_cols <- colnames(have)[2:ncol(have)]
group_concat <-
paste(paste0("GROUP_CONCAT(",concat_cols,",';') as ", concat_cols),
collapse = ",")
sqldf(
paste("
SELECT ID,",
group_concat,"
FROM have
GROUP BY ID"))
# Same output for both static and dynamic solutions
# ID info1 info2
# 1 id101 one first
# 2 id102 twoA;twoB second alias A;second alias B
# 3 id103 threeA;threeB third alias A;third alias B
# 4 id104 four fourth
# 5 id105 five fifth
^ 1 - 可能data.table
解决方案在数百万行中表现更好,幸运的是我们还没有那么多基因:)
答案 4 :(得分:1)
library(stringr)
library(dplyr)
have %>% tbl_df %>% group_by(ID) %>% summarise_each(funs(str_c(., collapse="; ")))
修改1:因此可能不需要tbl_df
,而不是stringr
包的str_c,您可以使用paste
(在base
中)。以上操作是按ID列进行分组,然后将str_c
(或paste
)函数应用于每个组的每个剩余列。
编辑2:使用data.table包的另一种解决方案:
library(data.table)
dtbl <- as.data.table(have)
dtbl[,lapply(.SD, function(x) paste(x,collapse=";")), by=ID]
以上情况可能会更快,尤其是在设置密钥时:
setkey(dtbl, ID)
“混合”解决方案:您可以对data.tables使用dplyr
语法!例如:
dtbl %>% tbl_dt %>%
group_by(ID) %>%
summarise_each(funs(paste(., collapse="; ")))