Question

我正在使用R和mongolite软件包从MongoDB获取数据。这导致数据包含许多嵌套列表，这些列表无法通过使用unlist，rbindlist和dplyr中的bind_rows简化为数据框（至少我没有设法做到这一点）。

经过大量的反复试验后，我找到了一种方法，使用包含reshape2的函数和使用dplyr和tidyr来使其成为我想要的形式。然而，融化需要花费很多时间（每个列表最多15分钟，我有6个）。

你有什么想法，我怎么能更快？（当然，另一种可能的解决方案是正确查询MongoDB，使其不会产生列表，但更像是我的目标数据框。）

以下代码创建了一个具有相似属性的虚拟数据集，数据集的目标形式以及我到达那里的解决方案。

虚拟数据：

dummy_data <- list(
  list(actions = list(list(action_type = "link_clicks", value = 30), 
                      list(action_type = "post_likes", value = 3)), 
       date = '2015-08-04'), 
  list(actions = list(list(action_type = "link_clicks", value = 10), 
                      list(action_type = "post_likes", value = 2),
                      list(action_type = "page_engagement", value = 5)), 
       date = '2015-08-02')
  )

目标表格：

final_data = data.frame(c(30, 10), c(3, 2), c(NA, 5), c('2015-08-04', '2015-08-02'))
names(final_data) = c('actions: link_clicks', 'actions: post_likes', 'actions: page_engagement', 'date')
final_data

临时解决方案

Solution <- reshape2::melt(dummy_data)
Solution <- Solution %>% 
  select(L1, L2, L3, L4, value) %>%
  mutate(L4 = ifelse(is.na(L4), L2, L4)) %>% 
  spread(key = L4, value = value) %>%
  mutate(L2 = ifelse(!is.na(action_type), paste0(L2, ": ", action_type), L2),
         value = ifelse(!is.na(value), value, date)) %>%
  select(L1, L2, value) %>%
  spread(key = L2, value = value) %>% 
  select(-L1)

如果你对mongolite查询有任何建议，这里是我使用的最简单的查询：

M_DB <- mongolite::mongo(collection = "name", url = "url")
M_DB_List <- M_DB$iterate()$batch(size = 100000)

非常感谢

**编辑：** 一个更复杂的数据结构，因为这更接近我的问题

 dummy_data_complex <- list(
  list(actions = list(list(action_type = "link_clicks", value = 30), 
                      list(action_type = "post_likes", value = 3)), 
       date = '2015-08-04',
       currency = 'EUR'), 
  list(actions = list(list(action_type = "link_clicks", value = 10), 
                      list(action_type = "post_likes", value = 2),
                      list(action_type = "page_engagement", value = 5)), 
       date = '2015-08-02',
       demographics = list(gender = "female", 
                           list(actions = list(action_type = "link_clicks", value = 10)))
  ))

Answer 1

以下是tidyverse

的选项

library(tidyverse)
dummy_data %>% 
     map_df(~ .x %>%
                 as_tibble(.) %>%
                 mutate(actions = map(actions, as_tibble)) %>%
                          unnest)   %>%
     group_by(date, action_type) %>%
     mutate(n = row_number()) %>%
     spread(action_type, value) %>%
     select(-n)
# A tibble: 2 x 4
# Groups: date [2]
#   date       link_clicks page_engagement post_likes
#* <chr>            <dbl>           <dbl>      <dbl>
#1 2015-08-02        10.0            5.00       2.00
#2 2015-08-04        30.0           NA          3.00

嵌套列表具有不同长度的数据帧

1 个答案: