由组进行的逐列串联连接

时间:2016-01-13 22:39:08

标签: r string aggregation

假设我有这个输入:

             ID     date_1      date_2     str
1            1    2010-07-04  2008-01-20   A
2            2    2015-07-01  2011-08-31   C
3            3    2015-03-06  2013-01-18   D
4            4    2013-01-10  2011-08-30   D
5            5    2014-06-04  2011-09-18   B
6            5    2014-06-04  2011-09-18   B
7            6    2012-11-22  2011-09-28   C
8            7    2014-06-17  2013-08-04   A
10           7    2014-06-17  2013-08-04   B
11           7    2014-06-17  2013-08-04   B

我想逐步将str列的值与组变量ID连接起来,如以下输出所示:

             ID     date_1      date_2     str
1            1    2010-07-04  2008-01-20   A
2            2    2015-07-01  2011-08-31   C
3            3    2015-03-06  2013-01-18   D
4            4    2013-01-10  2011-08-30   D
5            5    2014-06-04  2011-09-18   B
6            5    2014-06-04  2011-09-18   B,B
7            6    2012-11-22  2011-09-28   C
8            7    2014-06-17  2013-08-04   A
10           7    2014-06-17  2013-08-04   A,B
11           7    2014-06-17  2013-08-04   A,B,B

我尝试将ave()函数与此代码一起使用:

within(table, {
  Emp_list <- ave(str, ID, FUN = function(x) paste(x, collapse = ","))
})

但它提供了以下输出,这不是我想要的:

         ID      date_1     date_2      str
1         1    2010-07-04 2008-01-20     A
2         2    2015-07-01 2011-08-31     C
3         3    2015-03-06 2013-01-18     D
4         4    2013-01-10 2011-08-30     D
5         5    2014-06-04 2011-09-18     B,B
6         5    2014-06-04 2011-09-18     B,B
7         6    2012-11-22 2011-09-28     C
8         7    2014-06-17 2013-08-04     A,B,B
10        7    2014-06-17 2013-08-04     A,B,B
11        7    2014-06-17 2013-08-04     A,B,B

当然,我想避免循环,因为我在大型数据库上工作。

2 个答案:

答案 0 :(得分:9)

NOTE: The camera API only works on a real device, and not in the emulator.ave()怎么样? Reduce()函数允许我们在计算结果时累积结果。因此,如果我们使用Reduce()运行它,我们就可以累积粘贴的字符串。

paste()

提供更新的数据框f <- function(x) { Reduce(function(...) paste(..., sep = ", "), x, accumulate = TRUE) } df$str <- with(df, ave(as.character(str), ID, FUN = f)

df

注意: ID date_1 date_2 str 1 1 2010-07-04 2008-01-20 A 2 2 2015-07-01 2011-08-31 C 3 3 2015-03-06 2013-01-18 D 4 4 2013-01-10 2011-08-30 D 5 5 2014-06-04 2011-09-18 B 6 5 2014-06-04 2011-09-18 B, B 7 6 2012-11-22 2011-09-28 C 8 7 2014-06-17 2013-08-04 A 10 7 2014-06-17 2013-08-04 A, B 11 7 2014-06-17 2013-08-04 A, B, B 也可能是function(...) paste(..., sep = ", ")。 (感谢Pierre Lafortune)

答案 1 :(得分:8)

这里有一个可能的解决方案,将data.table与内部tapply相结合,似乎可以满足您的需求(如果您使用paste代替toString比如,它对我来说只是看起来更清洁。)

library(data.table)
setDT(df)[, Str := tapply(str[sequence(1:.N)], rep(1:.N, 1:.N), toString), by = ID]
df
#     ID     date_1     date_2 str     Str
#  1:  1 2010-07-04 2008-01-20   A       A
#  2:  2 2015-07-01 2011-08-31   C       C
#  3:  3 2015-03-06 2013-01-18   D       D
#  4:  4 2013-01-10 2011-08-30   D       D
#  5:  5 2014-06-04 2011-09-18   B       B
#  6:  5 2014-06-04 2011-09-18   B    B, B
#  7:  6 2012-11-22 2011-09-28   C       C
#  8:  7 2014-06-17 2013-08-04   A       A
#  9:  7 2014-06-17 2013-08-04   B    A, B
# 10:  7 2014-06-17 2013-08-04   B A, B, B

您可以使用

进行一些改进
setDT(df)[, Str := {Len <- 1:.N ; tapply(str[sequence(Len)], rep(Len, Len), toString)}, by = ID]