我需要通过发送消息的用户对数据表进行排序。目前,数据如下所示:
我想重新排列行,以便我可以看到用户之间相互交换的消息数量。如果一个用户发送了一条消息,但另一个用户没有响应,我需要在Messages_sent列中输入值0。
下一步,我需要计算两个用户之间的会话长度,因此,每两行总结一次Messages_sent。
请告知我如何重新排列数据表!
答案 0 :(得分:0)
使用dplyr,要获取描述中给出的表,此代码应该可以正常工作。但是如果你想要在两个方向上计算总和,那么第一行包含你想要的所有数据。
df <- merge(df,df
,by.x=c("from_id","to_id"),by.y=c("to_id","from_id")
,all.x=TRUE,all.y=TRUE)
df <- mutate(df,Messages_sent.x=coalesce(Messages_sent.x,0),
Messages_sent.y=coalesce(Messages_sent.y,0))
df$row <- 1:nrow(df)
rbind(select(df,-Messages_sents.y) %>%
rename(Messages_sent=Messages_sent.x),
select(df,-Messages_sent.x) %>%
rename(Messages_sent=Messages_sent.y,from_id=to_id,to_id=from_id)
) %>% arrange(row) %>% select(-row)
答案 1 :(得分:0)
以下是使用基本R函数的步骤:
df <- data.frame(from_id=c(624227,624227,624227,624227,624227,624227,667255,667255,667255,7134655,713465),
to_id = c(352731,693915,184455,771100,503940,91558,626814,857601,862512,156874,419242),
message_sent=c(1,6,2,1,1,1,2,7,3,1,1))
# merge dataset together with itself swapping from_id and to_id columns
df.full <- merge(df,df, by.x=c("from_id","to_id"), by.y=c("to_id","from_id"),suffixes = c(".x",".y"), all=TRUE)
# fill missing values with 0
# those records will correspond to all the pairs where
# someone did not send any messages back
df.full[is.na(df.full)] <- 0
# calculate total number of messages for each pair:
df.full$total <- df.full$message_sent.x + df.full$message_sent.y
head(df.full)
# from_id to_id message_sent.x message_sent.y total
# 1 91558 624227 0 1 1
# 2 156874 7134655 0 1 1
# 3 184455 624227 0 2 2
# 4 352731 624227 0 1 1
# 5 419242 713465 0 1 1
# 6 503940 624227 0 1 1
对于非常大的数据集,基本R函数可能很慢,在这种情况下,您可以查看使用dplyr库(对于大多数步骤,它具有类似的功能):
library(dplyr)
df.full.2 <- merge(df,df # merge dataframe and switched one
,by.x=c("from_id","to_id"),by.y=c("to_id","from_id")
,all.x=TRUE,all.y=TRUE) %>%
mutate(message_sent.x=coalesce(message_sent.x,0), # replace NAs with 0
message_sent.y=coalesce(message_sent.y,0)) %>%
mutate(total=rowSums(.[3:4])) # calculate total number of messages
head(df2.full.2)
# from_id to_id message_sent.x message_sent.y total
#1 91558 624227 0 1 1
#2 156874 7134655 0 1 1
#3 184455 624227 0 2 2
#4 352731 624227 0 1 1
#5 419242 713465 0 1 1
#6 503940 624227 0 1 1
如果成对记录是很重要的,您还可以添加以下代码:
df2.full.3 <- df2.full.2 %>%
mutate(pair.id=sprintf("%06d%6d",pmin(from_id,to_id ),
pmax(from_id,to_id ))) %>%
arrange(pair.id) %>% select(-pair.id)
head(df2.full.3)
# from_id to_id message_sent.x message_sent.y total
#1 91558 624227 0 1 1
#2 624227 91558 1 0 1
#3 156874 7134655 0 1 1
#4 7134655 156874 1 0 1
#5 184455 624227 0 2 2
#6 624227 184455 2 0 2
还有data.table包,对于非常大的数据集也非常有效:
library(data.table)
# convert dataframe to datatable
setDT(df)
df.full <- merge(df,df, by.x=c("from_id","to_id"), by.y=c("to_id","from_id"),
suffixes = c(".x",".y"), all=TRUE)
# substitute NAs with zeros
for (j in 3:4)set(df.full,which(is.na(df.full[[j]] )),j,0)
# calculate the total number of messages
df.full[, total:=message_sent.x+message_sent.y]
head(df.full)
# from_id to_id message_sent.x message_sent.y total
# 1: 91558 624227 0 1 1
# 2: 156874 7134655 0 1 1
# 3: 184455 624227 0 2 2
# 4: 352731 624227 0 1 1
# 5: 419242 713465 0 1 1
# 6: 503940 624227 0 1 1
根据数据集的大小,其中一种方法可能比其他方法更有效。