Question

我一直在一个项目中，从多个URL提取一个具有特定文本（“当前监狱历史：”）的html表，该URL根据一个人的ID进行更改。话虽这么说，我试图使用CSS选择器，但是这样做的问题是因为某些页面比其他页面具有更多的表，所以CSS选择器会逐页更改。因此，我认为我将能够使用xpath来基于表的文本内容获取要查找的表。 HTML在下面

<table class="dcCSStableLight" border="1" cellspacing="0" cellpadding="1" 
 bordercolor="#ececd7">
  <tbody>
   <tr>
      <td class="dark" align="left" colspan="8" bgcolor="#B0C4DE">
        <b>Current Prison Sentence History:</b>
      </td>
   </tr>
   <tr bgcolor="#B0C4DE">
     <th><b>Offense Date</b>
     </th> 
     <th><b>Offense</b>
     </th>
     <th><b>Sentence Date</b>
     </th>
     <th><b>County</b>
     </th>
     <th><b>Case No.</b>
     </th>
     <th><b>Prison Sentence Length</b>
     </th>
   </tr>
   <tr valign="top" bgcolor="#FFFFFF">
     <td>06/14/2015</td>
     <td>BURG/DWELL/OCCUP.CONVEY</td>
     <td>08/04/2016</td><td>ST. JOHNS</td>
     <td>1501553</td>
     <td nowrap="">5Y 0M 0D </td>
   </tr>
  </tbody>
</table>

我想出了以下xpath来拉表

//*[@id='dcCSScontentContainer'/div/table/tbody/tr/td/b[contains(text(),"Current")]/ancestor::table

当我使用Chrome Developer工具检查xpath时，它返回所需的表，但是在我的R Selenium代码中，它返回一个空列表。

for(i in 1:2){
remDR$navigate(URLs[i])
remDR$screenshot(display=TRUE) 
remDR$setImplicitWaitTimeout(10000)
CPSHList[[i]] <- remDR$getPageSource()[[1]] %>%
read_html()%>%
html_nodes(xpath = "//*[@id='dcCSScontentContainer']/div/table/tbody/tr/td/b[contains(text(),'Current')]/ancestor::table")%>%
html_table()%>%
data.frame(stringsAsFactors = FALSE)
}

Answer 1

您应该尝试查找包含b并带有此文本的表。

//table[.//b[contains(text(), 'Current')]]

Xpath可在Chrome开发工具中使用，但不适用于RSelenium

1 个答案: