从JSON生成的多级列表中提取偶发缺失元素的数据框,

时间:2019-01-02 17:08:41

标签: r purrr

我正在通过API获取足球数据-结果JSON作为列表返回;以下是dput的示例:

list(list(id = 10332894L, league_id = 8L, season_id = 12962L, 
aggregate_id = NULL, venue_id = 201L, localteam_id = 51L, 
visitorteam_id = 27L, weather_report = list(code = "drizzle", 
    temperature = list(temp = 53.92, unit = "fahrenheit"), 
    clouds = "90%", humidity = "87%", wind = list(speed = "12.75 m/s", 
        degree = 200L)), attendance = 25098L, leg = "1/1", 
deleted = FALSE, referee = list(data = list(id = 15267L, 
    common_name = "L. Probert", fullname = "Lee Probert", 
    firstname = "Lee", lastname = "Probert"))), list(id = 10332895L, 
league_id = 8L, season_id = 12962L, aggregate_id = NULL, 
venue_id = 340L, localteam_id = 251L, visitorteam_id = 78L, 
weather_report = list(code = "drizzle", temperature = list(
    temp = 50.07, unit = "fahrenheit"), clouds = "90%", humidity = "93%", 
    wind = list(speed = "6.93 m/s", degree = 160L)), attendance = 22973L, 
leg = "1/1", deleted = FALSE, referee = list(data = list(
    id = 15273L, common_name = "M. Oliver", fullname = "Michael Oliver", 
    firstname = "Michael", lastname = "Oliver"))))

我现在正在使用for循环进行提取-当完整数据中有数百个时,reprex显示2个顶级列表项。使用循环的主要缺点是有时会丢失一些导致循环停止的值。我想将其移至purrr,但正在努力使用at_depthmodify_depth提取第二级嵌套项目。嵌套中也有嵌套,这确实增加了复杂性。

最终状态应该是一个整洁的数据帧-从该数据中,df仅具有2行,但将具有许多列,每个列代表一个项目,无论该项目嵌套在此列表中的什么位置。如果缺少某些内容,则应为NA值。

一种解决方案的理想方案,即使它可能不太优雅,是每个级别/嵌套项产生一个数据框,然后可以将其绑定在一起。

谢谢。

1 个答案:

答案 0 :(得分:1)

步骤1:使用社区Wiki的功能hereNULL替换为NA

simple_rapply <- function(x, fn)
{
  if(is.list(x))
  {
    lapply(x, simple_rapply, fn)
  } else
  {
    fn(x)
  }
}    
non.null.l <- simple_rapply(l, function(x) if(is.null(x)) NA else x)

第二步:

library(purrr)
map_df(map(non.null.l,unlist),bind_rows)