希望基于公共列数据组合多行数据

时间:2018-04-12 16:50:52

标签: r preprocessor

我正在尝试使用公共列数据合并两个连续的数据行。本质上,我试图从

开始
 UserID Geography   Login  Logout

   user1  East      0:00:22    -       
   user1  East         -     0:01:29
   user2  West      0:03:57    -    
   user2  West         -     0:48:10
   user3  South     0:59:25    -    
   user3  South        -     1:08:21

  UserID Geography   Login  Logout

   user1  East      0:00:22  0:01:29       
   user2  West      0:03:57  0:48:10    
   user3  South     0:59:25  1:08:21   

我提前为格式化道歉。我想提一下,像这样的多行包含user1,user2等数据,因此MAX或MIN等聚合函数不起作用。我正在寻找的解决方案是R,但任何其他语言也是最受欢迎的。

提前致谢, 戈帕尔

1 个答案:

答案 0 :(得分:1)

这可以通过 dplyr tidyr 包来完成。实质上,我们将登录和注销时间收集到一个列中,删除空值,并将登录和注销事件重新传播到它们自己的列中。

df1 <- read.table(text = 'UserID Geography   Login  Logout

              user1  East      0:00:22    -       
              user1  East         -     0:01:29
              user2  West      0:03:57    -    
              user2  West         -     0:48:10
              user3  South     0:59:25    -    
              user3  South        -     1:08:21', header = T)

  UserID Geography   Login  Logout
1  user1      East 0:00:22       -
2  user1      East       - 0:01:29
3  user2      West 0:03:57       -
4  user2      West       - 0:48:10
5  user3     South 0:59:25       -
6  user3     South       - 1:08:21

library(dplyr)
library(tidyr)
df2 <- df1 %>% 
  gather(action, time, -UserID, -Geography) %>% 
  filter(time != '-') %>% 
  spread(action, time)

  UserID Geography   Login  Logout
1  user1      East 0:00:22 0:01:29
2  user2      West 0:03:57 0:48:10
3  user3     South 0:59:25 1:08:21

处理多个会话

在OP的原始数据集中,每个用户可以进行多次登录:

df <- read.table(text = 'UserID Geography   EventType ChannelType   Time 
user4   South   Log-in  Web 0:00:10 
user1   East    Log-in  Web 0:00:22 
user4   South   Log-out Mobile  0:00:44 
user1   East    Log-out Web 0:01:29 
user5   East    Log-in  Web 0:02:01 
user1   East    Log-in  Mobile 0:03:57 
user16  South   Log-in  Mobile  0:04:36 
user15  North   Log-in  Mobile  0:05:42 
user3   North   Log-in  Web 0:05:59 
user8   South   Log-in  Mobile  0:07:09 
user19  North   Log-in  Mobile  0:09:22 
user11  North   Log-in  Web 0:12:39 
user8   South   Log-out Web 0:18:32 
user8   South   Log-in  Web 0:19:35', header = T, stringsAsFactors = F)

关键是使用 dplyr 对每个用户进行登录和注销,然后对这些进行编号。现在每个登录/注销配对都是唯一标识的,数据可以重新格式化:

df2 <- df %>% 
  arrange(UserID, Time) %>% 
  group_by(UserID, EventType) %>% 
  mutate(EventNum = 1:n()) %>% 
  select(-ChannelType) %>% 
  spread(EventType, Time, fill = '-') %>% 
  arrange(`Log-in`)

   UserID Geography EventNum `Log-in` `Log-out`
    <chr>     <chr>    <int>    <chr>     <chr>
 1  user4     South        1  0:00:10   0:00:44
 2  user1      East        1  0:00:22   0:01:29
 3  user5      East        1  0:02:01         -
 4  user1      East        2  0:03:57         -
 5 user16     South        1  0:04:36         -
 6 user15     North        1  0:05:42         -
 7  user3     North        1  0:05:59         -
 8  user8     South        1  0:07:09   0:18:32
 9 user19     North        1  0:09:22         -
10 user11     North        1  0:12:39         -
11  user8     South        2  0:19:35         -