Data.table left join和aggregate / concatenate / group_concat

时间:2016-10-05 21:21:08

标签: r data.table

我有以下表格:

x = data.table(Id=c(1,1,2,3,3,4), Name=c("A", "A", "B", "C", "C", "D"), TxId=c(10, 11, 20, 30, 31, 40))
#Id Name TxId
#1:  1    A   10
#2:  1    A   11
#3:  2    B   20
#4:  3    C   30
#5:  3    C   31
#6:  4    D   40

y = data.table(Name=c("A", "B", "B", "C"), Family=c("A-alpha", "B-beta", "B-gamma", "C-delta"))
#   Name  Family
#1:    A A-alpha
#2:    B  B-beta
#3:    B B-gamma
#4:    C C-delta

我可以进行左连接和连接,但我只想为X中的每一行输出一行。

# Left join X to Y on Name column
xy = y[x, on="Name"]
#   Name  Family Id TxId
#1:    A A-alpha  1   10
#2:    A A-alpha  1   11
#3:    B  B-beta  2   20
#4:    B B-gamma  2   20
#5:    C C-delta  3   30
#6:    C C-delta  3   31
#7:    D      NA  4   40

# Concatenate Family column
xy[, Family:=paste0(Family, collapse=", "), by=c("Name", "TxId")]
#   Name          Family Id TxId
#1:    A         A-alpha  1   10
#2:    A         A-alpha  1   11
#3:    B B-beta, B-gamma  2   20
#4:    B B-beta, B-gamma  2   20
#5:    C         C-delta  3   30
#6:    C         C-delta  3   31
#7:    D              NA  4   40

如何摆脱B的额外行?我希望它在Id / TxId上是唯一的。即。

#   Name          Family Id TxId
#1:    A         A-alpha  1   10
#2:    A         A-alpha  1   11
#3:    B B-beta, B-gamma  2   20
#5:    C         C-delta  3   30
#6:    C         C-delta  3   31
#7:    D              NA  4   40

如果我做eddi评论:

xy[, .(Family=paste0(Family, collapse=", "), by=c("Name", "TxId")])

我得到了正确的结果。但是如果我尝试添加其他列,它就不起作用(我得到的结果与我完成:=版本的结果相同):

xy[, .(Id, Family=paste0(Family, collapse=", ")), by=c("Name", "TxId")]

1 个答案:

答案 0 :(得分:1)

请尝试

xy[, .(Family = paste0(Family, collapse = ", "), by = c("Id", "Name", "TxId")]

尝试解释:
如果Id是该群组的一部分,那么对于Id的每个唯一值,它只会出现一次(确切地说,对于IdName的每个唯一组合, TxId)。如果Id - 表达式中包含j,即.(Id, Family = paste0(Family, collapse = ", "),那么Id的每一行都将包含在结果集中,尽管正在汇总Family