我正在使用R和mongolite软件包从MongoDB获取数据。这导致数据包含许多嵌套列表,这些列表无法通过使用unlist,rbindlist和dplyr中的bind_rows简化为数据框(至少我没有设法做到这一点)。
经过大量的反复试验后,我找到了一种方法,使用包含reshape2的函数和使用dplyr和tidyr来使其成为我想要的形式。然而,融化需要花费很多时间(每个列表最多15分钟,我有6个)。
你有什么想法,我怎么能更快? (当然,另一种可能的解决方案是正确查询MongoDB,使其不会产生列表,但更像是我的目标数据框。)
以下代码创建了一个具有相似属性的虚拟数据集,数据集的目标形式以及我到达那里的解决方案。
虚拟数据:
dummy_data <- list(
list(actions = list(list(action_type = "link_clicks", value = 30),
list(action_type = "post_likes", value = 3)),
date = '2015-08-04'),
list(actions = list(list(action_type = "link_clicks", value = 10),
list(action_type = "post_likes", value = 2),
list(action_type = "page_engagement", value = 5)),
date = '2015-08-02')
)
目标表格:
final_data = data.frame(c(30, 10), c(3, 2), c(NA, 5), c('2015-08-04', '2015-08-02'))
names(final_data) = c('actions: link_clicks', 'actions: post_likes', 'actions: page_engagement', 'date')
final_data
临时解决方案
Solution <- reshape2::melt(dummy_data)
Solution <- Solution %>%
select(L1, L2, L3, L4, value) %>%
mutate(L4 = ifelse(is.na(L4), L2, L4)) %>%
spread(key = L4, value = value) %>%
mutate(L2 = ifelse(!is.na(action_type), paste0(L2, ": ", action_type), L2),
value = ifelse(!is.na(value), value, date)) %>%
select(L1, L2, value) %>%
spread(key = L2, value = value) %>%
select(-L1)
如果你对mongolite查询有任何建议,这里是我使用的最简单的查询:
M_DB <- mongolite::mongo(collection = "name", url = "url")
M_DB_List <- M_DB$iterate()$batch(size = 100000)
非常感谢
**编辑:** 一个更复杂的数据结构,因为这更接近我的问题
dummy_data_complex <- list(
list(actions = list(list(action_type = "link_clicks", value = 30),
list(action_type = "post_likes", value = 3)),
date = '2015-08-04',
currency = 'EUR'),
list(actions = list(list(action_type = "link_clicks", value = 10),
list(action_type = "post_likes", value = 2),
list(action_type = "page_engagement", value = 5)),
date = '2015-08-02',
demographics = list(gender = "female",
list(actions = list(action_type = "link_clicks", value = 10)))
))
答案 0 :(得分:0)
以下是tidyverse
library(tidyverse)
dummy_data %>%
map_df(~ .x %>%
as_tibble(.) %>%
mutate(actions = map(actions, as_tibble)) %>%
unnest) %>%
group_by(date, action_type) %>%
mutate(n = row_number()) %>%
spread(action_type, value) %>%
select(-n)
# A tibble: 2 x 4
# Groups: date [2]
# date link_clicks page_engagement post_likes
#* <chr> <dbl> <dbl> <dbl>
#1 2015-08-02 10.0 5.00 2.00
#2 2015-08-04 30.0 NA 3.00