Question

https://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&original_query_param=NAME&query_string=2249709&original_query_string=ARKANSAS%20BEST%20LOGISTICS%20INC

我需要将美国网页上经过检查/崩溃的表格解析为R数据框。用于网站上某些表的解析技术不适用于其他表。

我能够用以下代码解析检查表：

    inspections <- carrier %>%
      html_node('.querylabel+ center table') %>%
      html_table(fill = TRUE)

但是，当我尝试解析位于检查表下方的崩溃表时，出现错误：

    Error in UseMethod("html_table") : 
      no applicable method for 'html_table' applied to an object of class 
    "xml_missing"

我使用了以下代码：

    crashes <- carrier %>%
      html_node('center:nth-child(19) table') %>%
      html_table(fill = TRUE)

我使用选择器小工具选择了该表为“ center：nth-child（19）表”的css。我还尝试将html_node（）与x路径一起使用：

    crashes <- carrier %>%
      html_node(xpath = '//center[(((count(preceding-sibling::*) + 1) = 
     19) and parent::*)]//table') %>%
      html_table(fill = TRUE)

那也不起作用。我对网络爬虫非常陌生，因此如果这是一个简单的解决方案，我深表歉意。

运营商是网址：

    carrier <- read_html(https://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&original_query_param=NAME&query_string=2249709&original_query_string=ARKANSAS%20BEST%20LOGISTICS%20INC)

Answer 1

有两个“检查”表和两个“崩溃”表，分别用于美国和加拿大。这是两种解决方法：

使用前面的链接（“ Inspections：”，“ Crashes：”）来标识链接之后的center元素。然后寻找table节点，并对其进行解析。

library(rvest)
dot_url <- 
  "https://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&original_query_param=NAME&query_string=2249709&original_query_string=ARKANSAS%20BEST%20LOGISTICS%20INC" %>% 
  read_html()

dot_url %>% 
  html_node("a[href$='#Inspections'] + center") %>% 
  html_node("table") %>% 
  html_table()

              Inspection Type Vehicle Driver Hazmat IEP
1                 Inspections       0      0      0   0
2              Out of Service       0      0      0   0
3            Out of Service %      0%     0%     0%  0%
4 Nat'l Average %(2009- 2010)  20.72%  5.51%  4.50% N/A

dot_url %>% 
  html_node("a[href$='#Accidents'] + center") %>% 
  html_node("table") %>% 
  html_table()

     Type Fatal Injury Tow Total
1 Crashes     0      0   0     0

您也可以使用a[href$='#InspectionsCA'] ...在加拿大执行此操作，但是格式并不理想（“ Crashes：”表具有相同的href值）。（请注意，href$=表示链接以以下文本结尾：https://www.w3.org/TR/2011/REC-css3-selectors-20110929/#selectors。）

使用表的summary字段获取“检查”表和“崩溃”表的集合并为其命名（例如使用purrr::set_names和c("US", "Canada")），或丢弃不需要的（使用[[）：

dot_url %>% 
  html_nodes("table[summary='Inspections']") %>% 
  html_table()

[[1]]
              Inspection Type Vehicle Driver Hazmat IEP
1                 Inspections       0      0      0   0
2              Out of Service       0      0      0   0
3            Out of Service %      0%     0%     0%  0%
4 Nat'l Average %(2009- 2010)  20.72%  5.51%  4.50% N/A

[[2]]
   Inspection Type Vehicle Driver
1      Inspections       0      0
2   Out of Service       0      0
3 Out of Service %      0%     0%

dot_url %>% 
  html_nodes("table[summary='Crashes']") %>% 
  html_table()

[[1]]
     Type Fatal Injury Tow Total
1 Crashes     0      0   0     0

[[2]]
     Type Fatal Injury Tow Total
1 Crashes     0      0   0     0

需要帮助将具有RVest的表解析为数据帧

1 个答案: