我正在使用Yelp数据集,并希望根据类别过滤业务集。
我用
将JSON文件导入到R中yelp_business = stream_in(file("yelp_academic_dataset_business.json"))
然后在以下数据框中产生:
'data.frame': 77445 obs. of 15 variables:
$ business_id : chr "5UmKMjUEUNdYWqANhGckJw" "UsFtqoBl7naz8AVUBZMjQQ" "3eu6MEFlq2Dg7bQh8QbdOg" "cE27W9VPgO88Qxe4ol6y_g" ...
$ full_address : chr "4734 Lebanon Church Rd\nDravosburg, PA 15034" "202 McClure St\nDravosburg, PA 15034" "1 Ravine St\nDravosburg, PA 15034" "1530 Hamilton Rd\nBethel Park, PA 15234" ...
$ hours :'data.frame': 77445 obs. of 7 variables:
..$ Friday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr "21:00" NA NA NA ...
.. ..$ open : chr "11:00" NA NA NA ...
..$ Tuesday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr "21:00" NA NA NA ...
.. ..$ open : chr "11:00" NA NA NA ...
..$ Thursday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr "21:00" NA NA NA ...
.. ..$ open : chr "11:00" NA NA NA ...
..$ Wednesday:'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr "21:00" NA NA NA ...
.. ..$ open : chr "11:00" NA NA NA ...
..$ Monday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr "21:00" NA NA NA ...
.. ..$ open : chr "11:00" NA NA NA ...
..$ Sunday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr NA NA NA NA ...
.. ..$ open : chr NA NA NA NA ...
..$ Saturday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr NA NA NA NA ...
.. ..$ open : chr NA NA NA NA ...
$ open : logi TRUE TRUE TRUE FALSE TRUE TRUE ...
$ categories :List of 77445
..$ : chr "Fast Food" "Restaurants"
..$ : chr "Nightlife"
..$ : chr "Auto Repair" "Automotive"
..$ : chr "Active Life" "Mini Golf" "Golf"
..$ : chr "Shopping" "Home Services" "Internet Service Providers" "Mobile Phones" ...
..$ : chr "Bars" "American (New)" "Nightlife" "Lounges" ...
..$ : chr "Active Life" "Trainers" "Fitness & Instruction"
..$ : chr "Bars" "American (Traditional)" "Nightlife" "Restaurants"
..$ : chr "Auto Repair" "Automotive" "Tires"
..$ : chr "Active Life" "Mini Golf"
..$ : chr "Home Services" "Contractors"
..$ : chr "Veterinarians" "Pets"
..$ : chr "Libraries" "Public Services & Government"
..$ : chr "Automotive" "Auto Parts & Supplies"
我现在想要根据业务类别过滤所有行,并希望在类别列表中包含所有具有食物的类别。
但是,如果我只是这样尝试:
input ="food"
engage = filter(yelp_business, grepl(input, categories))
我收到以下错误代码:
Error: data_frames can only contain 1d atomic vectors and lists
我首先怀疑嵌套结构是一个原因。但是,使用tidyjson无助于类别是列表而不是主数据帧中的数据帧。
有谁知道如何解决这个问题?我只需要一份所有食品餐馆的商业ID清单,然后过滤Yelp中的评论json文件,以提取书面评论。
真的很感激任何帮助!非常感谢!
答案 0 :(得分:0)
tidyjson
没有yet支持ndjson,我不太确定如何与stream_in()
很好地合作。
但是,可以直接读取文件并使用tidyjson
自然处理。我正在使用devtools::install_github('jeremystan/tidyjson')
的开发版本。
document.id
可以很好地识别对象,因此我发现document.id
在其中一个“类别”中有“食物”。从那时起,我们过滤并做任何需要的额外数据分析。
library(dplyr)
library(stringr)
library(tidyjson)
j <- readLines("yelp_academic_dataset_business.json")
raw <- j %>% as.tbl_json()
## pull out the categories for filtering
prep <- raw %>% enter_object("categories") %>%
gather_array() %>% append_values_string()
## filter to 'food' categories (use document.id to identify json objects)
keepids <- prep[str_detect(str_to_lower(prep$string), "food"), ]$document.id %>%
unique()
## filter and do any further data analysis you want to do
raw %>% filter(document.id %in% keepids) %>%
spread_values(
name = json_chr(name),
city = json_chr(city),
state = json_chr(state),
stars = json_chr(stars))
#> # A tbl_json: 21 x 5 tibble with a "JSON" attribute
#> `attr(., "JSON")` document.id name city
#> <chr> <int> <chr> <chr>
#> 1 "{\"business_id\":..." 2 Cut and Taste Las Vegas
#> 2 "{\"business_id\":..." 8 Taco Bell Scottsdale
#> 3 "{\"business_id\":..." 10 Sehne Backwaren Stuttgart
#> 4 "{\"business_id\":..." 20 Graceful Cake Creations Mesa
#> 5 "{\"business_id\":..." 26 Chipotle Mexican Grill Toronto
#> 6 "{\"business_id\":..." 30 Carrabba's Italian Grill Glendale
#> 7 "{\"business_id\":..." 32 I Deal Coffee Toronto
#> 8 "{\"business_id\":..." 34 Lo-Lo's Chicken & Waffles Phoenix
#> 9 "{\"business_id\":..." 38 Kabob Palace Las Vegas
#> 10 "{\"business_id\":..." 43 Tea Shop 168 Markham
#> # ... with 11 more rows, and 2 more variables: state <chr>, stars <chr>
注意 - 我只处理了yelp_academic_dataset_business.json
文件的前100条记录。