在嵌套数据框中过滤

时间:2016-07-17 08:43:17

标签: json r filter nested dplyr

我正在使用Yelp数据集,并希望根据类别过滤业务集。

我用

将JSON文件导入到R中
yelp_business = stream_in(file("yelp_academic_dataset_business.json"))

然后在以下数据框中产生:

  'data.frame': 77445 obs. of  15 variables:
 $ business_id  : chr  "5UmKMjUEUNdYWqANhGckJw" "UsFtqoBl7naz8AVUBZMjQQ" "3eu6MEFlq2Dg7bQh8QbdOg" "cE27W9VPgO88Qxe4ol6y_g" ...
 $ full_address : chr  "4734 Lebanon Church Rd\nDravosburg, PA 15034" "202 McClure St\nDravosburg, PA 15034" "1 Ravine St\nDravosburg, PA 15034" "1530 Hamilton Rd\nBethel Park, PA 15234" ...
 $ hours        :'data.frame':  77445 obs. of  7 variables:
  ..$ Friday   :'data.frame':   77445 obs. of  2 variables:
  .. ..$ close: chr  "21:00" NA NA NA ...
  .. ..$ open : chr  "11:00" NA NA NA ...
  ..$ Tuesday  :'data.frame':   77445 obs. of  2 variables:
  .. ..$ close: chr  "21:00" NA NA NA ...
  .. ..$ open : chr  "11:00" NA NA NA ...
  ..$ Thursday :'data.frame':   77445 obs. of  2 variables:
  .. ..$ close: chr  "21:00" NA NA NA ...
  .. ..$ open : chr  "11:00" NA NA NA ...
  ..$ Wednesday:'data.frame':   77445 obs. of  2 variables:
  .. ..$ close: chr  "21:00" NA NA NA ...
  .. ..$ open : chr  "11:00" NA NA NA ...
  ..$ Monday   :'data.frame':   77445 obs. of  2 variables:
  .. ..$ close: chr  "21:00" NA NA NA ...
  .. ..$ open : chr  "11:00" NA NA NA ...
  ..$ Sunday   :'data.frame':   77445 obs. of  2 variables:
  .. ..$ close: chr  NA NA NA NA ...
  .. ..$ open : chr  NA NA NA NA ...
  ..$ Saturday :'data.frame':   77445 obs. of  2 variables:
  .. ..$ close: chr  NA NA NA NA ...
  .. ..$ open : chr  NA NA NA NA ...
 $ open         : logi  TRUE TRUE TRUE FALSE TRUE TRUE ...
 $ categories   :List of 77445
  ..$ : chr  "Fast Food" "Restaurants"
  ..$ : chr "Nightlife"
  ..$ : chr  "Auto Repair" "Automotive"
  ..$ : chr  "Active Life" "Mini Golf" "Golf"
  ..$ : chr  "Shopping" "Home Services" "Internet Service Providers" "Mobile Phones" ...
  ..$ : chr  "Bars" "American (New)" "Nightlife" "Lounges" ...
  ..$ : chr  "Active Life" "Trainers" "Fitness & Instruction"
  ..$ : chr  "Bars" "American (Traditional)" "Nightlife" "Restaurants"
  ..$ : chr  "Auto Repair" "Automotive" "Tires"
  ..$ : chr  "Active Life" "Mini Golf"
  ..$ : chr  "Home Services" "Contractors"
  ..$ : chr  "Veterinarians" "Pets"
  ..$ : chr  "Libraries" "Public Services & Government"
  ..$ : chr  "Automotive" "Auto Parts & Supplies"

我现在想要根据业务类别过滤所有行,并希望在类别列表中包含所有具有食物的类别。

但是,如果我只是这样尝试:

input ="food"
engage = filter(yelp_business, grepl(input, categories))

我收到以下错误代码:

Error: data_frames can only contain 1d atomic vectors and lists

我首先怀疑嵌套结构是一个原因。但是,使用tidyjson无助于类别是列表而不是主数据帧中的数据帧。

有谁知道如何解决这个问题?我只需要一份所有食品餐馆的商业ID清单,然后过滤Yelp中的评论json文件,以提取书面评论。

真的很感激任何帮助!非常感谢!

1 个答案:

答案 0 :(得分:0)

tidyjson没有yet支持ndjson,我不太确定如何与stream_in()很好地合作。

但是,可以直接读取文件并使用tidyjson自然处理。我正在使用devtools::install_github('jeremystan/tidyjson')的开发版本。

document.id可以很好地识别对象,因此我发现document.id在其中一个“类别”中有“食物”。从那时起,我们过滤并做任何需要的额外数据分析。

library(dplyr)
library(stringr)
library(tidyjson)

j <- readLines("yelp_academic_dataset_business.json")

raw <- j %>% as.tbl_json()

## pull out the categories for filtering
prep <- raw %>% enter_object("categories") %>% 
  gather_array() %>% append_values_string()

## filter to 'food' categories (use document.id to identify json objects)
keepids <- prep[str_detect(str_to_lower(prep$string), "food"), ]$document.id %>% 
  unique()

## filter and do any further data analysis you want to do
raw %>% filter(document.id %in% keepids) %>% 
 spread_values(
  name = json_chr(name), 
  city = json_chr(city),
  state = json_chr(state), 
  stars = json_chr(stars))
#> # A tbl_json: 21 x 5 tibble with a "JSON" attribute
#>         `attr(., "JSON")` document.id                      name       city
#>                     <chr>       <int>                     <chr>      <chr>
#>  1 "{\"business_id\":..."           2             Cut and Taste  Las Vegas
#>  2 "{\"business_id\":..."           8                 Taco Bell Scottsdale
#>  3 "{\"business_id\":..."          10           Sehne Backwaren  Stuttgart
#>  4 "{\"business_id\":..."          20   Graceful Cake Creations       Mesa
#>  5 "{\"business_id\":..."          26    Chipotle Mexican Grill    Toronto
#>  6 "{\"business_id\":..."          30  Carrabba's Italian Grill   Glendale
#>  7 "{\"business_id\":..."          32             I Deal Coffee    Toronto
#>  8 "{\"business_id\":..."          34 Lo-Lo's Chicken & Waffles    Phoenix
#>  9 "{\"business_id\":..."          38              Kabob Palace  Las Vegas
#> 10 "{\"business_id\":..."          43              Tea Shop 168    Markham
#> # ... with 11 more rows, and 2 more variables: state <chr>, stars <chr>

注意 - 我只处理了yelp_academic_dataset_business.json文件的前100条记录。