Question

我有一个大型的网络论坛评论数据框及其相关的元数据。我想合并（a）用户名和（b）线程ID相同的所有行（即，我希望每行代表用户在给定线程中的总参与度.I＆I ＃39; d想要最早的日期;我的数据框中的日期目前是DD-MM-YY格式。文本的格式和单词顺序不需要看起来很漂亮的分析类型我＆＃39我正在做。

举个例子，

old <- rbind(c(1, "hello", "bob", "Sept1"), c(1, "world", "bob", "Sept2"), c(1, "hey there", "mary", "Sept1"), c(2, "to be or", "ted", "Aug1"), c(2, "sample text", "mary", "Aug1"), c(2, "not to be", "ted", "Sept3"))
colnames(old) <- c("thread", "comment", "user", "date")
old

     thread comment       user   date   
[1,] "1"    "hello"       "bob"  "Sept1"
[2,] "1"    "world"       "bob"  "Sept2"
[3,] "1"    "hey there"   "mary" "Sept1"
[4,] "2"    "to be or"    "ted"  "Aug1" 
[5,] "2"    "sample text" "mary" "Aug1" 
[6,] "2"    "not to be"   "ted"  "Sept3"

需要看起来像：

     thread comment              user   date   
[1,] "1"    "hello world"        "bob"  "Sept1"
[2,] "1"    "hey there"          "mary" "Sept1"
[3,] "2"    "to be or not to be" "ted"  "Aug1" 
[4,] "2"    "sample text"        "mary" "Aug1"

谢谢！

Answer 1

使用data.table

require(data.table)
old <- data.table(old)
print(old[j  = .(comment = paste(comment, collapse = ' '),
             date = min(date)),
          by = .(user, thread)])

Answer 2

根据我的经验，唯一能够灵活地执行异构双列聚合的基本R聚合函数是by()。不幸的是，由于by()将其结果作为列表返回，这需要可恶的do.call(rbind,...)技巧，整个事情最终变得相当丑陋和缓慢。但是，对于那些致力于避免附加软件包的人来说，以下是它的完成方式：

df <- data.frame(thread=c(1,1,1,2,2,2),comment=c('hello','world','hey there','to be or','sample text','not to be'),user=c('bob','bob','mary','ted','mary','ted'),date=c('01-09-15','02-09-15','01-09-15','01-08-15','01-08-15','03-09-15'),stringsAsFactors=F); ## define data.frame input
df$date <- as.Date(df$date,'%d-%m-%y'); ## coerce to Date type
df <- df[order(df$date),]; ## ensure sorted by date
keys <- c('thread','user'); ## precompute key columns
res <- do.call(rbind,by(df,df[keys],function(g)
    cbind(g[1L,keys],comment=paste(collapse=' ',g$comment),date=g$date[1L])
));
res;
##   thread user            comment       date
## 1      1  bob        hello world 2015-09-01
## 3      1 mary          hey there 2015-09-01
## 5      2 mary        sample text 2015-08-01
## 4      2  ted to be or not to be 2015-08-01

Answer 3

在dplyr，

library(dplyr)

data.frame(old) %>% 
    # parse dates to useful format
    mutate(date = as.Date(paste(substr(date, 1, 3), 
                                gsub('[^0-9]', '', date), 
                                '2016'), 
                          '%b %d %Y')) %>%
    group_by(thread, user) %>% 
    summarise(comment = paste(comment, collapse = ' '), 
              date = min(date))

# Source: local data frame [4 x 4]
# Groups: thread [?]
# 
#   thread   user            comment       date
#   (fctr) (fctr)              (chr)     (date)
# 1      1    bob        hello world 2016-09-01
# 2      1   mary          hey there 2016-09-01
# 3      2   mary        sample text 2016-08-01
# 4      2    ted to be or not to be 2016-08-01

请注意，您需要将old放入data.frame中，因为它当前是一个字符向量，这是您从rbind其他[部分强制]字符向量中获得的。

按某些列合并R数据帧中的行

3 个答案: