如果两个其他列中的值组合是唯一的,则获取变量的总和

时间:2015-02-06 06:52:28

标签: r

我有发件人和收件人的数据,以及发送的电子邮件数量。玩具示例:

senders <- c("Mable","Beth", "Beth","Susan","Susan")
receivers <- c("Beth", "Mable", "Susan", "Mable","Beth")
num_email <- c(1,1,2,1,1)

df <- data.frame(senders, receivers, num_email)

senders receivers num_email
Mable      Beth          1
Beth       Mable         1
Beth       Susan         2
Susan      Mable         1
Susan      Beth          1

我希望得到一个data.frame,其中包含每个唯一对的总消息。例如。连接Mable | Beth有价值2,因为Mable给Beth发了一条信息,而Beth给Mable发了一条信息。由此产生的data.frame对于每个独特的电子邮件组合应该只有一行(例如,只有Mable | Beth或Beth | Mable,而不是两者。

我尝试了各种方法,重塑和data.table,但我没有运气。我想避免创建一个独特的字符串BethMable并以这种方式合并。非常感谢

2 个答案:

答案 0 :(得分:4)

我们可以先使用base R方法逐行sort前两列。我们使用applyMARGIN=1来做到这一点,转置输出,转换为&#39; data.frame&#39;要创建&#39; df1&#39;,请使用aggregate的公式方法获取&#39; num_email&#39;的sum按转换数据集的前两列分组。

df1 <- data.frame(t(apply(df[1:2], 1, sort)), df[3])
aggregate(num_email~., df1, FUN=sum)

#      X1    X2 num_email
# 1  Beth Mable         2
# 2  Beth Susan         3
# 3 Mable Susan         1

或者使用data.table,我们将前两列转换为characterunname,将前两列的列名更改为默认值&#39; V1&#39 ;,&#39; V2&#39;,并转换为&#39; data.table&#39;。使用字符列的字典顺序,我们为i(V1 > V2)创建逻辑索引,通过反转列的顺序(:=)为满足条件的列分配(.(V2, V1)) ,并获取&#39; num_email&#39;的sum按&#39; V1&#39;,&#39; V2&#39;

分组
library(data.table)
dt = do.call(data.table, c(lapply(unname(df[1:2]), as.character), df[3]))
dt[V1 > V2, c("V1", "V2") := .(V2, V1)]
dt[, .(num_email = sum(num_email)), by= .(V1, V2)]

#       V1    V2 num_email
# 1:  Beth Mable         2
# 2:  Beth Susan         3
# 3: Mable Susan         1

或者使用dplyr,我们使用mutate_each将列转换为character类,然后使用pminpmax来反转订单,并按& #39; V1&#39;,&#39; V2&#39;并获得sum&#39; num_email&#39;。

library(dplyr)
df %>%
  mutate_each(funs(as.character), senders, receivers) %>%
  mutate( V1 = pmin(senders, receivers), 
          V2 = pmax(senders, receivers) ) %>%
  group_by(V1, V2) %>%
  summarise(num_email=sum(num_email))

#      V1    V2 num_email
#   (chr) (chr)     (dbl)
# 1  Beth Mable         2
# 2  Beth Susan         3
# 3 Mable Susan         1

注意:data.table解决方案已由@Frank更新。

答案 1 :(得分:0)

另一种解决方案:

senders <- c("Mable","Beth", "Beth","Susan","Susan")
receivers <- c("Beth", "Mable", "Susan", "Mable","Beth")
num_email <- c(1,1,2,1,1)

df <- data.frame(senders, receivers, num_email)

# finding unique users
users <- unique(c(senders, receivers))
# generate combinations without repetitions
user_combi <- gtools::combinations(v=users, n=length(users), r=2)

# count the number of mails for each combination
counts <- apply(user_combi, MARGIN=1, FUN=function(x) 
                     sum(df$num_email[ (df$senders %in% x) & (df$receivers %in% x)])
               )

# wrap up in a data.frame
df2 <- data.frame(user_combi, counts)

这给出了:

> df2
     X1    X2 counts
1  Beth Mable      2
2  Beth Susan      3
3 Mable Susan      1