Question

我是HTML的新手，我目前正在使用RSelenium从HTML表格中抓取数据的项目中。我可以使用以下代码：

for(i in 1:50){
 remDR$navigate(URLs[i])
  CPSHList[[i]] <- remDR$getPageSource()[[1]] %>%
   read_html()%>%
    html_nodes(xpath = "//*[@id=\"dcCSScontentContainer\"]/div/table[5])")%>%
    html_table()%>%
    data.frame(stringsAsFactors = FALSE)
    }

我遇到的问题是此页面上有多个表，并且在页面之间存在一些表，而另一些则没有。因此，我想要的特定表的Xpath根据其他表的存在而针对每个页面进行更改。在进行了一些初步研究之后，我认为我可能可以更改Xpath，具体取决于表是否包含基于td标签的特定单元格。表格如下：

<table class="dcCSStableLight" border="1" cellspacing="0" cellpadding="1" 
 bordercolor="#ececd7">
 <tbody>
  <tr>
    <td class="dark" align="left" colspan="8" bgcolor="#B0C4DE"><b>Current 
     Prison Sentence History:</b>
     </td>
   </tr>
  <tr bgcolor="#B0C4DE">
    <th><b>Offense Date</b>
      </th> 
    <th><b>Offense</b>
      </th>
    <th><b>Sentence Date</b>
      </th>
    <th><b>County</b>
      </th>
    <th><b>Case No.</b>
      </th>
    <th><b>Prison Sentence Length</b>
      </th>
  </tr>
 <tr valign="top" bgcolor="#FFFFFF">
  <td>06/14/2015</td><td>BURG/DWELL/OCCUP.CONVEY</td>
  <td>08/04/2016</td><td>ST. JOHNS</td><td>1501553</td>
  <td nowrap="">5Y 0M 0D </td>
   </tr>
 </tbody>
</table>

我想出了这个

"//div/table[contains(td, \"Current Prison Sentence History:\"]"

但是，它在R中返回了一个无效的表达式错误：

"Invalid expression [1207]xmlXPathEval: evaluation failed"

谢谢！

Answer 1

我对R不熟悉，但是您将css用于xPath，这是错误的。替换：

html_nodes(css = "//*[@id=\"dcCSScontentContainer\"]/div/table[5])")%>%

具有：

html_nodes(xpath = "//*[@id=\"dcCSScontentContainer\"]/div/table[5])")%>%

您的xPath也可以这样重写：

代替：

//div/table[contains(td, \"Current Prison Sentence History:\")]

此：

//table[contains(b, 'Current Prison Sentence History:')]

Answer 2

按如下所示更改XPath

//table//td[contains(text(),'Current Prison Sentence History:')]

每个页面的特定表的X路径和CSS选择器都会更改

2 个答案: