合并多个数据帧并根据每个df时间戳对它们进行排序

时间:2017-09-07 12:45:31

标签: mysql r

如果这听起来不太清楚,请原谅我,但我会尽力处理这个具有挑战性的问题。 我有多个数据帧。

每个数据框都有hashed_user_id,server_timestap和event。 三个数据帧的示例如下:

Data Frame 1

hashed_user_id  server_timestamp    event
user1          2017-04-27 15:25:12   AS
user2          2017-04-29 19:34:19   AS
user3          2017-05-01 21:28:17   AS
user4          2017-05-03 23:01:16   AS

Data Frame 2

hashed_user_id  server_timestamp    event
user1          2017-04-27 16:25:12   AV1
user2          2017-04-29 20:34:19   AV1
user5          2017-05-01 22:19:17   AV1
user6          2017-05-03 14:01:16   AV1

Data Frame 3

hashed_user_id  server_timestamp    event
user1          2017-04-27 17:25:12   AV2
user2          2017-04-29 15:34:19   AV2
user5          2017-05-01 21:28:17   AV2
user6          2017-05-03 23:01:16   AV2

我希望拥有的等待表应该将所有用户合并到一个表中,并列出由server_timestamp排序的所有事件。因此,预期的新数据框将如下所示:

Expected result:

hashed_user_id  sorted_event1   sorted_event2   sorted_event3
user1             AS                 AV1             AV2
user2             AV2                AS              AV1
user3             AS                 NA              NA
user4             AS                 NA              NA
user5             AV2                AV1
user6             AV1                AV2

非常感谢!

2 个答案:

答案 0 :(得分:2)

library(tibble)
library(tidyr)

# read your data 
dt1 <- tribble(
  ~hashed_user_id,~server_timestamp, ~event,
  "user1", "2017-04-27 15:25:12", "AS",
  "user2", "2017-04-29 19:34:19", "AS",
  "user3", "2017-05-01 21:28:17", "AS",
  "user4", "2017-05-03 23:01:16", "AS"
)

dt2 <- tribble(
  ~hashed_user_id,~server_timestamp, ~event,
  "user1", "2017-04-27 16:25:12", "AV1",
  "user2", "2017-04-29 20:34:19", "AV1",
  "user5", "2017-05-01 22:28:17", "AV1",
  "user6", "2017-05-03 14:01:16", "AV1"
)

dt3 <- tribble(
  ~hashed_user_id,~server_timestamp, ~event,
  "user1", "2017-04-27 17:25:12", "AV2",
  "user2", "2017-04-29 15:34:19", "AV2",
  "user5", "2017-05-01 21:28:17", "AV2",
  "user6", "2017-05-03 23:01:16", "AV2"
)

# solution
dt <- rbind(dt1, dt2, dt3) %>% 
  mutate(server_timestamp = as.POSIXct(server_timestamp, format = "%Y-%m-%d %H:%M:%S")) %>%
  group_by(hashed_user_id) %>%
  arrange(server_timestamp) %>%
  mutate(sorted_event_id = paste0("sorted_event", 1:n())) %>%
  select(-server_timestamp) %>%
  spread(sorted_event_id, event) %>%
  ungroup()

答案 1 :(得分:0)

在某种意义上它并不是一个解决方案,它没有提供您的预期输出,但最好避免将数据排序在NAs这样的不同列中。

如果您以后仍然必须在R中工作,那么您将有一些肮脏的工作要做。

考虑将您的已排序事件放在向量中,并将其存储在data.frame / tibble中。

首先将这些data.frame放入列表中! :)

res <- list(df1,df2,df3) %>%
  bind_rows %>%
  arrange(server_timestamp) %>%
  select(-server_timestamp) %>%
  nest(event,.key="sorted_events")
# A tibble: 6 x 2
#    hashed_user_id     sorted_events
# <chr>           <list>
#   1          user1 <tibble [3 x 1]>
#   2          user2 <tibble [3 x 1]>
#   3          user3 <tibble [1 x 1]>
#   4          user5 <tibble [2 x 1]>
#   5          user6 <tibble [2 x 1]>
#   6          user4 <tibble [1 x 1]>
res$sorted_events[[4]]
# # A tibble: 2 x 1
#    event
#    <chr>
#  1   AV2
#  2   AV1