我有一个大型的网络论坛评论数据框及其相关的元数据。我想合并(a)用户名和(b)线程ID相同的所有行(即,我希望每行代表用户在给定线程中的总参与度.I&I #39; d想要最早的日期;我的数据框中的日期目前是DD-MM-YY格式。文本的格式和单词顺序不需要看起来很漂亮的分析类型我&#39我正在做。
举个例子,
old <- rbind(c(1, "hello", "bob", "Sept1"), c(1, "world", "bob", "Sept2"), c(1, "hey there", "mary", "Sept1"), c(2, "to be or", "ted", "Aug1"), c(2, "sample text", "mary", "Aug1"), c(2, "not to be", "ted", "Sept3"))
colnames(old) <- c("thread", "comment", "user", "date")
old
thread comment user date
[1,] "1" "hello" "bob" "Sept1"
[2,] "1" "world" "bob" "Sept2"
[3,] "1" "hey there" "mary" "Sept1"
[4,] "2" "to be or" "ted" "Aug1"
[5,] "2" "sample text" "mary" "Aug1"
[6,] "2" "not to be" "ted" "Sept3"
需要看起来像:
thread comment user date
[1,] "1" "hello world" "bob" "Sept1"
[2,] "1" "hey there" "mary" "Sept1"
[3,] "2" "to be or not to be" "ted" "Aug1"
[4,] "2" "sample text" "mary" "Aug1"
谢谢!
答案 0 :(得分:3)
使用data.table
require(data.table)
old <- data.table(old)
print(old[j = .(comment = paste(comment, collapse = ' '),
date = min(date)),
by = .(user, thread)])
答案 1 :(得分:3)
根据我的经验,唯一能够灵活地执行异构双列聚合的基本R聚合函数是by()
。不幸的是,由于by()
将其结果作为列表返回,这需要可恶的do.call(rbind,...)
技巧,整个事情最终变得相当丑陋和缓慢。但是,对于那些致力于避免附加软件包的人来说,以下是它的完成方式:
df <- data.frame(thread=c(1,1,1,2,2,2),comment=c('hello','world','hey there','to be or','sample text','not to be'),user=c('bob','bob','mary','ted','mary','ted'),date=c('01-09-15','02-09-15','01-09-15','01-08-15','01-08-15','03-09-15'),stringsAsFactors=F); ## define data.frame input
df$date <- as.Date(df$date,'%d-%m-%y'); ## coerce to Date type
df <- df[order(df$date),]; ## ensure sorted by date
keys <- c('thread','user'); ## precompute key columns
res <- do.call(rbind,by(df,df[keys],function(g)
cbind(g[1L,keys],comment=paste(collapse=' ',g$comment),date=g$date[1L])
));
res;
## thread user comment date
## 1 1 bob hello world 2015-09-01
## 3 1 mary hey there 2015-09-01
## 5 2 mary sample text 2015-08-01
## 4 2 ted to be or not to be 2015-08-01
答案 2 :(得分:1)
在dplyr
,
library(dplyr)
data.frame(old) %>%
# parse dates to useful format
mutate(date = as.Date(paste(substr(date, 1, 3),
gsub('[^0-9]', '', date),
'2016'),
'%b %d %Y')) %>%
group_by(thread, user) %>%
summarise(comment = paste(comment, collapse = ' '),
date = min(date))
# Source: local data frame [4 x 4]
# Groups: thread [?]
#
# thread user comment date
# (fctr) (fctr) (chr) (date)
# 1 1 bob hello world 2016-09-01
# 2 1 mary hey there 2016-09-01
# 3 2 mary sample text 2016-08-01
# 4 2 ted to be or not to be 2016-08-01
请注意,您需要将old
放入data.frame中,因为它当前是一个字符向量,这是您从rbind
其他[部分强制]字符向量中获得的。