我正在尝试阅读有关Greyhound总线时序的HTML数据。 An example can be found here.我主要关注从表中获取计划和状态数据,但是当我执行以下代码时:
library(XML)
url<-"http://bustracker.greyhound.com/routes/4511/I/Chicago_Amtrak_IL-Cincinnati_OH/4511/10-26-2016"
greyhound<-readHTMLTable(url)
greyhound<-greyhound[[2]]
这只产生下表:
我不确定为什么它会抓取甚至不在页面上的数据,而不是
答案 0 :(得分:1)
您无法使用readHTMLTable
检索数据,因为轨迹结果是以javascript脚本的形式发送的。因此,您应该选择该脚本并解析它以提取正确的信息。
她的解决方案,这样做:
代码看起来可能很短但是非常紧凑(生产它需要一个小时)!
library(XML)
library(httr)
library(jsonlite)
library(data.table)
dc <- htmlParse(GET(url))
script <- xpathSApply(dc,"//script/text()",xmlValue)[[5]]
res <- strsplit(script,"stopArray.push({",fixed=TRUE)[[1]][-1]
dcast(point~name,data=rbindlist(Map(function(x,y){
x <- paste('{',sub(');|);.*docum.*',"",x))
dx <- unlist(fromJSON(x))
data.frame(point=y,name=names(dx),value=dx)
},res,seq_along(res))
,fill=TRUE)[name!="polyline"])
表格结果:
point category direction id lat linkName lon
1: 1 2 empty 562310 41.878589630127 Chicago_Amtrak_IL -87.6398544311523
2: 2 2 empty 560252 41.8748474121094 Chicago_IL -87.6435165405273
3: 3 1 empty 561627 41.7223281860352 Chicago_95th_&_Dan_Ryan_IL -87.6247329711914
4: 4 2 empty 260337 41.6039199829102 Gary_IN -87.3386917114258
5: 5 1 empty 260447 40.4209785461426 Lafayette_e_IN -86.8942031860352
6: 6 2 empty 260392 39.7617835998535 Indianapolis_IN -86.161018371582
7: 7 2 empty 250305 39.1079406738281 Cincinnati_OH -84.5041427612305
name shortName ticketName
1: Chicago Amtrak: 225 S Canal St, IL 60606 Chicago Amtrak, IL CHD
2: Chicago: 630 W Harrison St, IL 60607 Chicago, IL CHD
3: Chicago 95th & Dan Ryan: 14 W 95th St, IL 60628 Chicago 95th & Dan Ryan, IL CHD
4: Gary: 100 W 4th Ave, IN 46402 Gary, IN GRY
5: Lafayette (e): 401 N 3rd St, IN 47901 Lafayette (e), IN XIN
6: Indianapolis: 350 S Illinois St, IN 46225 Indianapolis, IN IND
7: Cincinnati: 1005 Gilbert Ave, OH 45202 Cincinnati, OH CIN
答案 1 :(得分:0)
正如@agstudy所说,数据呈现为HTML;它不是直接从服务器通过HTML传递的。因此,您可以(a)使用类似RSelenium
的内容来抓取呈现的内容,或者(b)从包含数据的<script>
标记中提取数据。
为了解释@ agstudy的工作,我们观察到数据包含在(许多)脚本标签之一的一系列stopArray.push()
命令中。例如:
stopArray.push({
"id" : "562310",
"name" : "Chicago Amtrak: 225 S Canal St, IL 60606",
"shortName" : "Chicago Amtrak, IL",
"ticketName" : "CHD",
"category" : 2,
"linkName" : "Chicago_Amtrak_IL",
"direction" : "empty",
"lat" : 41.87858963012695,
"lon" : -87.63985443115234,
"polyline" : "elr~Fnb|uOmC@@nG?XBdH@rC?f@?P?V@`AlAAn@A`CCzBC~BE|CEdCA^Ap@A"
});
现在,这是每个函数调用中包含的json
数据。我倾向于认为,如果有人以机器可读的格式进行格式化数据的工作,那么我们应该很感激它!
tidyverse
解决此问题的方法如下:
rvest
包下载页面。script
表达式标识要使用的相应xpath
标记,该表达式搜索包含字符串script
的所有url =
标记。stopArray.push()
来电中的所有内容。[]
围绕字符串以指示json
列表。jsonlite::fromJSON
转换为data.frame
。请注意,我将polyline
列隐藏在最后,因为它太大而不适合以前。
library(tidyverse)
library(rvest)
library(stringr)
library(jsonlite)
url <- "http://bustracker.greyhound.com/routes/4511/I/Chicago_Amtrak_IL-Cincinnati_OH/4511/10-26-2016"
page <- read_html(url)
page %>%
html_nodes(xpath = '//script[contains(text(), "url = ")]') %>%
html_text() %>%
str_extract_all(regex("(?<=stopArray.push\\().+?(?=\\);)", multiline = T, dotall = T), F) %>%
unlist() %>%
paste(collapse = ",") %>%
sprintf("[%s]", .) %>%
fromJSON() %>%
select(-polyline) %>%
head()
#> id name
#> 1 562310 Chicago Amtrak: 225 S Canal St, IL 60606
#> 2 560252 Chicago: 630 W Harrison St, IL 60607
#> 3 561627 Chicago 95th & Dan Ryan: 14 W 95th St, IL 60628
#> 4 260337 Gary: 100 W 4th Ave, IN 46402
#> 5 260447 Lafayette (e): 401 N 3rd St, IN 47901
#> 6 260392 Indianapolis: 350 S Illinois St, IN 46225
#> shortName ticketName category
#> 1 Chicago Amtrak, IL CHD 2
#> 2 Chicago, IL CHD 2
#> 3 Chicago 95th & Dan Ryan, IL CHD 1
#> 4 Gary, IN GRY 2
#> 5 Lafayette (e), IN XIN 1
#> 6 Indianapolis, IN IND 2
#> linkName direction lat lon
#> 1 Chicago_Amtrak_IL empty 41.87859 -87.63985
#> 2 Chicago_IL empty 41.87485 -87.64352
#> 3 Chicago_95th_&_Dan_Ryan_IL empty 41.72233 -87.62473
#> 4 Gary_IN empty 41.60392 -87.33869
#> 5 Lafayette_e_IN empty 40.42098 -86.89420
#> 6 Indianapolis_IN empty 39.76178 -86.16102