R:在条件下从两个数据帧聚合

时间:2015-01-18 14:26:04

标签: r dataframe conditional-statements aggregate multiple-columns

我有一个名为“e”的数据框,其中包含平台的帖子,具有唯一的entry_id和member_id:

row.    member_id   entry_id        timestamp
1       1            a              2008-06-09 12:41:00
2       1            b              2008-07-14 18:41:00
3       1            c              2010-07-17 15:40:00
4       2            d              2008-06-09 12:41:00
5       2            e              2008-09-18 10:22:00
6       3            f              2008-10-03 13:36:00

我有另一个名为“c”的数据框,其中包含注释:

row.    member_id   comment_id      timestamp
1       1            I              2007-06-09 12:41:00
2       1            II             2007-07-14 18:41:00
3       1            III            2009-07-17 15:40:00
4       2            IV             2007-06-09 12:41:00
5       2            V              2009-09-18 10:22:00
6       3            VI             2010-10-03 13:36:00

我想在会员发布条目之前计算会员所写的所有评论。所以数据框“e”应该是这样的。只关注阅读这个例子的年代。然而,解决方案也应该包括几分钟:

row.    member_id   entry_id    prev_comment_count  timestamp
1       1            a              2              2008-06-09 12:41:00
2       1            b              2              2008-07-14 18:41:00
3       1            c              3              2010-07-17 15:40:00
4       2            d              1              2008-06-09 12:41:00
5       2            e              1              2008-09-18 10:22:00
6       3            f              0              2008-10-03 13:36:00

我alrady尝试了以下功能:

functionPrevComments <- function(givE)  nrow(subset
(c, (as.character(givE["member_id"]) == c["member_id"]) & 
(c["timestamp"] <= givE["timestamp"])))

但是当我尝试为它提供服务时,我收到了错误

"Incompatible methods ("Ops.data.frame", "Ops.factor") for "<=""

我使用“$”运算符来引用我之前需要的colums然后我得到了

"$ operator is invalid for atomic vectors "

如何正确应用我的功能,还是有另一个更好的解决方案来解决我的问题?

最诚挚的问候,

尼古拉斯

2 个答案:

答案 0 :(得分:1)

这是一个略有不同的选择。确保你同时拥有&#34;时间戳&#34;在运行代码之前将列转换为POSIXct类。

e$prev_comment_count <- sapply(seq_len(nrow(e)), function(i) {
  nrow(c[c$member_id == e$member_id[i] & c$timestamp < e$timestamp[i], ])
})

e
#  row. member_id entry_id           timestamp prev_comment_count
#1    1         1        a 2008-06-09 12:41:00                  2
#2    2         1        b 2008-07-14 18:41:00                  2
#3    3         1        c 2010-07-17 15:40:00                  3
#4    4         2        d 2008-06-09 12:41:00                  1
#5    5         2        e 2008-09-18 10:22:00                  1
#6    6         3        f 2008-10-03 13:36:00                  0

答案 1 :(得分:1)

e$type <- "entry"
c$type <- "comment"

names(e) <- c("row", "member_id", "action_id", "timestamp", "type")
names(c) <- c("row", "member_id", "action_id", "timestamp", "type")

DF <- rbind(e,c)
DF$timestamp <- as.POSIXct(DF$timestamp, 
                           format = "%Y-%m-%d %H:%M:%S", tz = "GMT")
DF <- DF[order(DF$member_id, DF$timestamp),]
DF$count <- as.integer(ave(DF$type, 
                           DF$member_id, 
                           FUN = function(x) cumsum(x == "comment")))
DF[DF$type == "entry",]

#  row member_id action_id           timestamp  type count
#1   1         1         a 2008-06-09 12:41:00 entry     2
#2   2         1         b 2008-07-14 18:41:00 entry     2
#3   3         1         c 2010-07-17 15:40:00 entry     3
#4   4         2         d 2008-06-09 12:41:00 entry     1
#5   5         2         e 2008-09-18 10:22:00 entry     1
#6   6         3         f 2008-10-03 13:36:00 entry     0

如果速度不够快,可以使用data.table或dplyr进行改进。