通过重塑来查找配对事件

时间:2015-02-07 23:24:58

标签: r reshape reshape2

我有一个他们在某些时间购买的用户和商品的列表,我想从原始数据中生成这些对的列表。虽然我可以并且可能会写一个小的python脚本来做这件事,但我有一种唠叨的感觉,重塑(或者更可能是reshape2)包可以在几行中完成。

在代码中我想将下面的df数据帧转换为resdf数据帧:

df <- data.frame(user=c("u1","u2","u1","u3","u2","u4","u5","u4"),
                 item=c("i1","i1","i2","i3","i2","i3","i3","i4"),
                 time=c(1,1,2,3,4,4,5,6))
> df
  user item time
1   u1   i1    1
2   u2   i1    1
3   u1   i2    2
4   u3   i3    3
5   u2   i2    4
6   u4   i3    4
7   u5   i3    5
8   u4   i4    6
> 

### some reshape code here

resdf <- data.frame(user=c("u1","u2","u4"),
                    item1=c("i1","i1","i3"),
                    item2=c("i2","i2","i4"),
                    time=c(1,1,4),
                    delt=c(1,3,2))
> pdf
  user item1 item2 time delt
1   u1    i1    i2    1    1
2   u2    i1    i2    1    3
3   u4    i3    i4    4    2

有没有可以帮助我的重塑向导?

3 个答案:

答案 0 :(得分:6)

如果将具有重复user值的行合并回没有欺骗的行,您将获得所需的信息,然后进行一些按摩以实现所需的安排:

> merge(df[!duplicated(df$user), ], df[duplicated(df$user), ], by="user")
  user item.x time.x item.y time.y
1   u1     i1      1     i2      2
2   u2     i1      1     i2      4
3   u4     i3      4     i4      6
> inter <- merge(df[!duplicated(df$user), ], df[duplicated(df$user), ], by="user")
> inter$delt <- inter$time.y-inter$time.x
> inter[ , c(1,2,4,3,6)]
  user item.x item.y time.x delt
1   u1     i1     i2      1    1
2   u2     i1     i2      1    3
3   u4     i3     i4      4    2

答案 1 :(得分:2)

这是我尝试使用data.table包(也有dcast函数)

library(data.table)
setkey(setDT(df), user, item) # sorting by user and time so `head` and `diff` will work
df[, `:=`(indx = paste0("item", seq_len(.N)), # Creating all the needed variables
          indx2 = .N,
          time2 = head(time, 1),
          delt = diff(time)), 
     user]

dcast(df[indx2 > 1L], # Decasting by the modified item column
              user + time2 + delt ~ indx, 
              value.var = "item")

#    user time2 delt item1 item2
# 1:   u1     1    1    i1    i2
# 2:   u2     1    3    i1    i2
# 3:   u4     4    2    i3    i4

答案 2 :(得分:2)

以下是使用dplyr的解决方案:

library(dplyr)

df %>%
  group_by(user) %>%
  filter(n() == 2) %>%
  arrange(time) %>%
  summarise(
    item1 = first(item),
    item2 = last(item),
    delt = last(time) - first(time),
    time = first(time)
    ) %>%
  select(user, item1, item2, time, delt)