我有发件人和收件人的数据,以及发送的电子邮件数量。玩具示例:
senders <- c("Mable","Beth", "Beth","Susan","Susan")
receivers <- c("Beth", "Mable", "Susan", "Mable","Beth")
num_email <- c(1,1,2,1,1)
df <- data.frame(senders, receivers, num_email)
senders receivers num_email
Mable Beth 1
Beth Mable 1
Beth Susan 2
Susan Mable 1
Susan Beth 1
我希望得到一个data.frame,其中包含每个唯一对的总消息。例如。连接Mable | Beth有价值2,因为Mable给Beth发了一条信息,而Beth给Mable发了一条信息。由此产生的data.frame对于每个独特的电子邮件组合应该只有一行(例如,只有Mable | Beth或Beth | Mable,而不是两者。
我尝试了各种方法,重塑和data.table,但我没有运气。我想避免创建一个独特的字符串BethMable并以这种方式合并。非常感谢
答案 0 :(得分:4)
我们可以先使用base R
方法逐行sort
前两列。我们使用apply
和MARGIN=1
来做到这一点,转置输出,转换为&#39; data.frame&#39;要创建&#39; df1&#39;,请使用aggregate
的公式方法获取&#39; num_email&#39;的sum
按转换数据集的前两列分组。
df1 <- data.frame(t(apply(df[1:2], 1, sort)), df[3])
aggregate(num_email~., df1, FUN=sum)
# X1 X2 num_email
# 1 Beth Mable 2
# 2 Beth Susan 3
# 3 Mable Susan 1
或者使用data.table
,我们将前两列转换为character
类unname
,将前两列的列名更改为默认值&#39; V1&#39 ;,&#39; V2&#39;,并转换为&#39; data.table&#39;。使用字符列的字典顺序,我们为i(V1 > V2
)创建逻辑索引,通过反转列的顺序(:=
)为满足条件的列分配(.(V2, V1)
) ,并获取&#39; num_email&#39;的sum
按&#39; V1&#39;,&#39; V2&#39;
library(data.table)
dt = do.call(data.table, c(lapply(unname(df[1:2]), as.character), df[3]))
dt[V1 > V2, c("V1", "V2") := .(V2, V1)]
dt[, .(num_email = sum(num_email)), by= .(V1, V2)]
# V1 V2 num_email
# 1: Beth Mable 2
# 2: Beth Susan 3
# 3: Mable Susan 1
或者使用dplyr
,我们使用mutate_each
将列转换为character
类,然后使用pmin
和pmax
来反转订单,并按& #39; V1&#39;,&#39; V2&#39;并获得sum
&#39; num_email&#39;。
library(dplyr)
df %>%
mutate_each(funs(as.character), senders, receivers) %>%
mutate( V1 = pmin(senders, receivers),
V2 = pmax(senders, receivers) ) %>%
group_by(V1, V2) %>%
summarise(num_email=sum(num_email))
# V1 V2 num_email
# (chr) (chr) (dbl)
# 1 Beth Mable 2
# 2 Beth Susan 3
# 3 Mable Susan 1
注意:data.table
解决方案已由@Frank更新。
答案 1 :(得分:0)
另一种解决方案:
senders <- c("Mable","Beth", "Beth","Susan","Susan")
receivers <- c("Beth", "Mable", "Susan", "Mable","Beth")
num_email <- c(1,1,2,1,1)
df <- data.frame(senders, receivers, num_email)
# finding unique users
users <- unique(c(senders, receivers))
# generate combinations without repetitions
user_combi <- gtools::combinations(v=users, n=length(users), r=2)
# count the number of mails for each combination
counts <- apply(user_combi, MARGIN=1, FUN=function(x)
sum(df$num_email[ (df$senders %in% x) & (df$receivers %in% x)])
)
# wrap up in a data.frame
df2 <- data.frame(user_combi, counts)
这给出了:
> df2
X1 X2 counts
1 Beth Mable 2
2 Beth Susan 3
3 Mable Susan 1