总结这个数据框架的更好方法是什么?

时间:2016-01-01 03:05:27

标签: r dplyr tidyr magrittr

我有一个这样的数据框:

  message.id sender recipient
1          1      A         B
2          1      A         C
3          2      A         B
4          3      B         C
5          3      B         D
6          3      B         Q

我想通过发件人和收件人列中的值计数来总结它:

  address messages.sent messages.received
1       A             3                 0
2       B             3                 2
3       C             0                 2
4       D             0                 1
5       Q             0                 1

我有工作代码,但它很混乱,我希望有一种方法可以在一个magrittr链中完成所有操作,而不是我在下面所做的:

df <- data.frame(message.id = c(1,1,2,3,3,3),
                 sender = c("A","A","A","B","B","B"),
                 recipient = c("B","C","B","C","D","Q"))
sent <- df %>% 
  group_by(sender) %>%
  summarise(messages.sent = n()) %>%
  mutate(address = sender) %>%
  select(address, messages.sent)

received <- df %>% 
  group_by(recipient) %>%
  summarise(messages.received = n()) %>%
  mutate(address = recipient) %>%
  select(address, messages.received)

df_summary <- merge(sent, received, all = TRUE) %>%
  replace(is.na(.), 0)

4 个答案:

答案 0 :(得分:6)

我们可以使用melt/dcast

library(reshape2)
dcast(melt(df1, id.var='message.id'), value~variable, 
                 value.var='message.id', length)

或使用包装器recast

recast(df1, id.var='message.id', value~variable, length)
#    value sender recipient
#1     A      3         0
#2     B      3         2
#3     C      0         2
#4     D      0         1
#5     Q      0         1

如果我们需要使用dplyr/tidyr

library(dplyr)
library(tidyr)
gather(df1, messages, address, 2:3) %>%
          group_by(messages, address) %>%
          summarise(n=n()) %>% 
          spread(messages, n, fill=0)
#     address sender recipient
#     (chr)  (dbl)     (dbl)
#1       A      3         0
#2       B      3         2
#3       C      0         2
#4       D      0         1
#5       Q      0         1

答案 1 :(得分:3)

如果您正在进行某种网络分析,那么使用igraph

可能会很有用
library(igraph)

g <- graph_from_data_frame(dat[c(2:3)])

data.frame(address = V(g)$name,
           sent    = degree(g, mode="out"),
           rec     = degree(g, mode="in"))

#   address sent rec
# A       A    3   0
# B       B    3   2
# C       C    0   2
# D       D    0   1
# Q       Q    0   1
如果你喜欢那种东西,

igraph也支持管道

此外还有一个基础R努力(我知道它不是你想要的))

lvs <- unique(unlist(dat[2:3])) 
sapply(dat[2:3], function(x) table(factor(x, levels=lvs)))

答案 2 :(得分:2)

使用dplyr和tidyr,您可以执行以下操作:

library(dplyr)
library(tidyr)
df <- data.frame(message.id = c(1,1,2,3,3,3),
                 sender = c("A","A","A","B","B","B"),
                 recipient = c("B","C","B","C","D","Q"), stringsAsFactors = FALSE)
df %>% gather(sender, recipient, -message.id) %>% group_by(recipient) %>% summarise(messages.sent = sum(sender == 'sender'), messages.received = sum(sender == 'recipient'))

Source: local data frame [5 x 3]

  recipient messages.sent messages.received
      (chr)         (int)             (int)
1         A             3                 0
2         B             3                 2
3         C             0                 2
4         D             0                 1
5         Q             0                 1
> 

您可以将第一列名称更改为所需的名称,如下所示:

names(df)[1] <- 'address'

答案 3 :(得分:0)

使用基础R中的aggregatemerge的替代方案。最后,我们删除NAs并使用所需的列名重命名列。

summary <- merge(aggregate(message.id ~ sender, data = df, length), 
                  aggregate(message.id ~ recipient, data = df, length), 
                  by.x = "sender", 
                  by.y = "recipient", 
                  all = TRUE)
summary[is.na(summary)] <- 0
colnames(summary) <- c("address", "sent", "received")
summary

输出:

  address sent received
1       A    3        0
2       B    3        2
3       C    0        2
4       D    0        1
5       Q    0        1