刮擦或处理SQL API调用的结果

时间:2015-09-04 13:46:45

标签: r api sqldf opendata

我试图从以下开放数据网页下载数据并进行一些分析

http://data.ci.newark.nj.us/dataset/new-jersey-education-indicators/resource/d7b23f97-cba5-4c15-997c-37a696395d66

他们给出了一些例子,例如这个查询示例(通过SQL语句)

http://data.ci.newark.nj.us/api/action/datastore_search_sql?sql=SELECT * from "d7b23f97-cba5-4c15-997c-37a696395d66" WHERE title LIKE 'jones' 

我使用sqldf包来读取数据但无法成功。

2 个答案:

答案 0 :(得分:1)

您可以直接使用他们的API,而不是诉诸rvest并抓取您。正如我所说,他们的SQL示例错误,但它没有WHERE…部分(下面的示例)。以下是直接搜索或SQL搜索中可重复过程的构建块:

library(jsonlite)
library(httr)

# for passing in a SQL statement
query_nj_sql <- function(sql=NULL) {
  if (is.null(sql)) return(NULL)
  res <- GET("http://data.ci.newark.nj.us/api/action/datastore_search_sql",
             query=list(sql=sql))
  stop_for_status(res) # catches errors
  fromJSON(content(res, as="text"))
}

# for their plain search syntax
query_nj_search <- function(resource_id=NULL, query=NULL, offset=NULL) {
  if (is.null(resource_id)) return(NULL)
  res <- GET("http://data.ci.newark.nj.us/api/action/datastore_search",
             query=list(resource_id=resource_id,
                        offset=NULL,
                        q=query))
  stop_for_status(res) # catches errors
  fromJSON(content(res, as="text"))  
}

# this SQL does not error out
sql_dat <- query_nj_sql('SELECT * from "d7b23f97-cba5-4c15-997c-37a696395d66"')

search_dat <- query_nj_search(resource_id="d7b23f97-cba5-4c15-997c-37a696395d66")

正如我所说,SQL查询不会出错。

两个调用都返回一个稍微复杂的list结构,你可以用它来检查:

str(sql_dat)
str(search_dat)

但记录在那里:

dplyr::glimpse(sql_dat$result$records)

## Observations: 545
## Variables: 40
## $ Total population 25 years and over                 (chr) "6389.0", "68.0", "4197.0", "389.0", "1211.0", "4...
## $ Male - Associate's degree                          (chr) "286.0", "0.0", "63.0", "6.0", "69.0", "31.0", "7...
## $ Male - Master's degree                             (chr) "148.0", "29.0", "379.0", "17.0", "79.0", "24.0",...
## $ Male - 7th and 8th grade                           (chr) "49.0", "0.0", "16.0", "2.0", "14.0", "0.0", "0.0...
## $ Female - High school graduate, GED, or alternative (chr) "915.0", "0.0", "426.0", "46.0", "174.0", "30.0",...
## $ Male - 11th grade                                  (chr) "88.0", "0.0", "12.0", "0.0", "3.0", "0.0", "0.0"...
## $ Male - Bachelor's degree                           (chr) "561.0", "0.0", "878.0", "93.0", "137.0", "58.0",...
## $ Male - Some college, 1 or more years, no degree    (chr) "403.0", "0.0", "179.0", "23.0", "39.0", "0.0", "...
… (this goes on a while)

API看起来可能是分页的,因此您可能必须处理它(因此offset参数)。

由于NJ Edu API支持OData查询,您也可以使用RSocrata包。

答案 1 :(得分:0)

看起来他们的SQL示例不起作用。但我认为你甚至不需要使用sqldf,你可以使用RCurl包来提取数据。

如果您想尝试其他示例,可以使用他们拥有的html API调用:

library(RCurl)
web <- "http://data.ci.newark.nj.us/api/action/datastore_search?resource_id=d7b23f97-cba5-4c15-997c-37a696395d66&q=jones"
page <- getURL(web)

然后使用html解析使内容更容易理解。