基于两列在R中成对排列行

时间:2018-05-23 13:19:04

标签: r datatable

我需要通过发送消息的用户对数据表进行排序。目前,数据如下所示: data

我想重新排列行,以便我可以看到用户之间相互交换的消息数量。如果一个用户发送了一条消息,但另一个用户没有响应,我需要在Messages_sent列中输入值0。

table

下一步,我需要计算两个用户之间的会话长度,因此,每两行总结一次Messages_sent。

请告知我如何重新排列数据表!

2 个答案:

答案 0 :(得分:0)

使用dplyr,要获取描述中给出的表,此代码应该可以正常工作。但是如果你想要在两个方向上计算总和,那么第一行包含你想要的所有数据。

df <- merge(df,df
  ,by.x=c("from_id","to_id"),by.y=c("to_id","from_id")
  ,all.x=TRUE,all.y=TRUE)
df <- mutate(df,Messages_sent.x=coalesce(Messages_sent.x,0),
                Messages_sent.y=coalesce(Messages_sent.y,0))
df$row <- 1:nrow(df)
rbind(select(df,-Messages_sents.y) %>%
        rename(Messages_sent=Messages_sent.x),
      select(df,-Messages_sent.x) %>% 
        rename(Messages_sent=Messages_sent.y,from_id=to_id,to_id=from_id)
     ) %>% arrange(row) %>% select(-row)

答案 1 :(得分:0)

以下是使用基本R函数的步骤:

df <- data.frame(from_id=c(624227,624227,624227,624227,624227,624227,667255,667255,667255,7134655,713465),
                 to_id = c(352731,693915,184455,771100,503940,91558,626814,857601,862512,156874,419242),
                 message_sent=c(1,6,2,1,1,1,2,7,3,1,1))

# merge dataset together with itself swapping from_id and to_id columns 
df.full <- merge(df,df, by.x=c("from_id","to_id"), by.y=c("to_id","from_id"),suffixes = c(".x",".y"), all=TRUE)

# fill missing values with 0
# those records will correspond to all the pairs where 
# someone did not send any messages back
df.full[is.na(df.full)] <- 0

# calculate total number of messages for each pair:
df.full$total <- df.full$message_sent.x + df.full$message_sent.y

head(df.full)
#   from_id   to_id message_sent.x message_sent.y total
# 1   91558  624227              0              1     1
# 2  156874 7134655              0              1     1
# 3  184455  624227              0              2     2
# 4  352731  624227              0              1     1
# 5  419242  713465              0              1     1
# 6  503940  624227              0              1     1

对于非常大的数据集,基本R函数可能很慢,在这种情况下,您可以查看使用dplyr库(对于大多数步骤,它具有类似的功能):

library(dplyr)
df.full.2 <- merge(df,df               # merge dataframe and switched one
            ,by.x=c("from_id","to_id"),by.y=c("to_id","from_id")
            ,all.x=TRUE,all.y=TRUE) %>%
  mutate(message_sent.x=coalesce(message_sent.x,0),     # replace NAs with 0
         message_sent.y=coalesce(message_sent.y,0)) %>%
  mutate(total=rowSums(.[3:4]))        # calculate total number of messages

head(df2.full.2)
#  from_id   to_id message_sent.x message_sent.y total
#1   91558  624227              0              1     1
#2  156874 7134655              0              1     1
#3  184455  624227              0              2     2
#4  352731  624227              0              1     1
#5  419242  713465              0              1     1
#6  503940  624227              0              1     1

如果成对记录是很重要的,您还可以添加以下代码:

df2.full.3 <- df2.full.2 %>% 
  mutate(pair.id=sprintf("%06d%6d",pmin(from_id,to_id ),
                                   pmax(from_id,to_id ))) %>%
  arrange(pair.id) %>% select(-pair.id)

head(df2.full.3)
#  from_id   to_id message_sent.x message_sent.y total
#1   91558  624227              0              1     1
#2  624227   91558              1              0     1
#3  156874 7134655              0              1     1
#4 7134655  156874              1              0     1
#5  184455  624227              0              2     2
#6  624227  184455              2              0     2

还有data.table包,对于非常大的数据集也非常有效:

library(data.table)
# convert dataframe to datatable
setDT(df)
df.full <- merge(df,df, by.x=c("from_id","to_id"), by.y=c("to_id","from_id"),
                 suffixes = c(".x",".y"), all=TRUE)

# substitute NAs with zeros
for (j in 3:4)set(df.full,which(is.na(df.full[[j]] )),j,0)

# calculate the total number of messages
df.full[, total:=message_sent.x+message_sent.y]
head(df.full)
#    from_id   to_id message_sent.x message_sent.y total
# 1:   91558  624227              0              1     1
# 2:  156874 7134655              0              1     1
# 3:  184455  624227              0              2     2
# 4:  352731  624227              0              1     1
# 5:  419242  713465              0              1     1
# 6:  503940  624227              0              1     1

根据数据集的大小,其中一种方法可能比其他方法更有效。