我有一个名为“e”的数据框,其中包含平台的帖子,具有唯一的entry_id和member_id:
row. member_id entry_id timestamp
1 1 a 2008-06-09 12:41:00
2 1 b 2008-07-14 18:41:00
3 1 c 2010-07-17 15:40:00
4 2 d 2008-06-09 12:41:00
5 2 e 2008-09-18 10:22:00
6 3 f 2008-10-03 13:36:00
我有另一个名为“c”的数据框,其中包含注释:
row. member_id comment_id timestamp
1 1 I 2007-06-09 12:41:00
2 1 II 2007-07-14 18:41:00
3 1 III 2009-07-17 15:40:00
4 2 IV 2007-06-09 12:41:00
5 2 V 2009-09-18 10:22:00
6 3 VI 2010-10-03 13:36:00
我想在会员发布条目之前计算会员所写的所有评论。所以数据框“e”应该是这样的。只关注阅读这个例子的年代。然而,解决方案也应该包括几分钟:
row. member_id entry_id prev_comment_count timestamp
1 1 a 2 2008-06-09 12:41:00
2 1 b 2 2008-07-14 18:41:00
3 1 c 3 2010-07-17 15:40:00
4 2 d 1 2008-06-09 12:41:00
5 2 e 1 2008-09-18 10:22:00
6 3 f 0 2008-10-03 13:36:00
我alrady尝试了以下功能:
functionPrevComments <- function(givE) nrow(subset
(c, (as.character(givE["member_id"]) == c["member_id"]) &
(c["timestamp"] <= givE["timestamp"])))
但是当我尝试为它提供服务时,我收到了错误
"Incompatible methods ("Ops.data.frame", "Ops.factor") for "<=""
我使用“$”运算符来引用我之前需要的colums然后我得到了
"$ operator is invalid for atomic vectors "
如何正确应用我的功能,还是有另一个更好的解决方案来解决我的问题?
最诚挚的问候,
尼古拉斯
答案 0 :(得分:1)
这是一个略有不同的选择。确保你同时拥有&#34;时间戳&#34;在运行代码之前将列转换为POSIXct类。
e$prev_comment_count <- sapply(seq_len(nrow(e)), function(i) {
nrow(c[c$member_id == e$member_id[i] & c$timestamp < e$timestamp[i], ])
})
e
# row. member_id entry_id timestamp prev_comment_count
#1 1 1 a 2008-06-09 12:41:00 2
#2 2 1 b 2008-07-14 18:41:00 2
#3 3 1 c 2010-07-17 15:40:00 3
#4 4 2 d 2008-06-09 12:41:00 1
#5 5 2 e 2008-09-18 10:22:00 1
#6 6 3 f 2008-10-03 13:36:00 0
答案 1 :(得分:1)
e$type <- "entry"
c$type <- "comment"
names(e) <- c("row", "member_id", "action_id", "timestamp", "type")
names(c) <- c("row", "member_id", "action_id", "timestamp", "type")
DF <- rbind(e,c)
DF$timestamp <- as.POSIXct(DF$timestamp,
format = "%Y-%m-%d %H:%M:%S", tz = "GMT")
DF <- DF[order(DF$member_id, DF$timestamp),]
DF$count <- as.integer(ave(DF$type,
DF$member_id,
FUN = function(x) cumsum(x == "comment")))
DF[DF$type == "entry",]
# row member_id action_id timestamp type count
#1 1 1 a 2008-06-09 12:41:00 entry 2
#2 2 1 b 2008-07-14 18:41:00 entry 2
#3 3 1 c 2010-07-17 15:40:00 entry 3
#4 4 2 d 2008-06-09 12:41:00 entry 1
#5 5 2 e 2008-09-18 10:22:00 entry 1
#6 6 3 f 2008-10-03 13:36:00 entry 0
如果速度不够快,可以使用data.table或dplyr进行改进。