Question

再次坚持并希望更多的线索可以提供指针; o）

我有一个数据集; 3,270行datePublished（2013-04-01：2014-03-31）和域名（coindesk，福布斯，mashable，nytimes，路透社，techcrunch，thenextweb＆amp; theverge）。其副本为here）

> df <- read.csv("dplyr_summary_example.csv")
> head(df)
  datePublished  domain
1 2013-04-01     coindesk
2 2013-04-01     coindesk
3 2013-04-13     coindesk
4 2013-04-15     coindesk
5 2013-04-15     coindesk

每次发布故事时，df基本上都有一行日期/域对。

我想要做的是创建一个看起来有点像的新数据框（例如编号）......

datePublished  coindeskStories  forbesStories... thevergeStories totalStories
2013-04-01     2                1                1               4 
2013-04-13     1                1                0               2
2013-04-15     2                0                1               3

因此，对于df中的每个日期，我想为每个域添加一个总故事列，最后总计列总数（总计总数很容易）。

我一直在关注dplyr，看起来它确实可以完成这项工作，但到目前为止我还没有成功完成这一步。

对于每个域名，然后加入内容非常简单：

daily        <- group_by(df,datePublished) # group stories by date

cnt.nytimes  <- filter(daily, domain=="nytimes")  # filter just the nytimes ones
cnt.nytimes  <- summarise(cnt.nytimes,nytimesStories=n()) # give table of stories by date

cnt.mashable <- filter(daily, domain=="mashable")
cnt.mashable <- summarise(cnt.mashable,mashableStories=n())

df.Stories   <- full_join(cnt.nytimes,cnt.mashable,by="datePublished") # join cnt. dataframes by datePublished
df.Stories   <- arrange(df.Stories,datePublished) #sort by datePublished

df.Stories$totalStories <- apply(df.Stories[c(2:3)],1,sum,na.rm=TRUE) #add a totals column

但是在每个域上执行此操作然后使用连接似乎效率低下。

有人可以提出更简单的路线吗？

Answer 1

reshape2::dcast

怎么样？

require(reshape2)
res <- dcast(df, datePublished ~ domain, value.var = "domain", fun.aggregate = length)

结果：

> head(res)
  datePublished coindesk forbes mashable nytimes reuters techcrunch thenextweb theverge
1    2013-04-01        2      2        0       0       0          1          0        2
2    2013-04-02        0      1        1       0       0          0          0        0
3    2013-04-03        0      3        1       0       0          2          0        0
4    2013-04-04        0      0        0       0       0          1          1        1
5    2013-04-05        0      1        0       0       0          1          1        1
6    2013-04-07        0      1        0       1       0          1          0        0

评论：如果您希望datePublished为Date而不是factor use

df$datePublished <- as.Date(as.character(df$datePublished))

在read.csv

之后

Answer 2

要更改为宽幅面，除tidyr外，您还需要使用dplyr。像

这样的东西

library(dplyr)
library(tidyr)

df %>% 
    group_by(datePublished, domain) %>%
    summarise(nstories = n()) %>%
    spread(domain, nstories)

Answer 3

为什么不使用?aggregate和?summary？

我无法下载您的数据。但是，这可能会对您有所帮助：

set.seed(12)
n <- 10000
date <- sample(1:100, n, replace=T)
type <- sample(letters[1:5], n, replace=T)
sample <- data.frame(date=date, type=type)

temp <- sample[date==1,]
aggregate(type ~ date, data=sample, FUN=summary)

dplyr？ - 寻找更有效的方式来汇总数据

3 个答案: