Question

我使用R废弃网站，在解析HTML代码时，我有以下代码：

    <div class="line">
        <h2 class="clearfix">
            <span class="property">Number<div>number extra</div></span>
            <span class="value">3</span>
        </h2>
    </div>
    <div class="line">
        <h2 class="clearfix">
            <span class="property">Surface</span>
            <span class="value">72</span>
        </h2>
    </div>

现在我想在此代码中获得一些值。

如何使用xml值＆＃34; Number＆＃34;来标识范围。并获取节点，以便提取＆＃34;数字额外＆＃34; ？我知道如何使用xpathApply来识别节点以获取xmlValue或某些属性（例如href与xmlGetAttr）。但我不知道如何通过了解其xmlvalue来识别节点。
```
xpathApply(page, '//span[@class="property"]',xmlValue)
```
如果我想得到＆＃34;值＆＃34; 72为物业类＆＃34; Surface＆＃34;，最有效的方法是什么？

我开始这样做：首先，我提取所有＆＃34;属性＆＃34;：

xpathApply(page, '//span[@class="property"]',xmlValue)

然后我提取所有＆＃34;值＆＃34;：

xpathApply(page, '//span[@class="value"]',xmlValue)

然后我构建一个列表或矩阵，以便我可以识别＆＃34; Surface＆＃34;的值，即72.但问题是，有时，class="property"的范围不能有一个span =＆＃34;值＆＃34;就在h2之后。所以我无法建立一个合适的清单。

这可能是最有效的方法吗？：使用class="property"确定范围，然后确定包含此h2的{{1}}，然后使用{{标识span 1}}？

Answer 1

通过添加单个根元素使您的HTML格式良好，

<?xml version="1.0" encoding="UTF-8"?>
<r> 
  <div class="line"> 
    <h2 class="clearfix"> 
      <span class="property">Number
        <div>number extra</div>
      </span>  
      <span class="value">3</span> 
    </h2> 
  </div>  
  <div class="line"> 
    <h2 class="clearfix"> 
      <span class="property">Surface</span>  
      <span class="value">72</span> 
    </h2> 
  </div> 
</r>

（A）此XPath表达式，

//span[@class='property' and starts-with(., 'Number')]/div/text()

将返回

number extra

按要求。

（B）此XPath表达式，

//h2[span[@class='property' and . = 'Surface']]/span[@class='value']/text()

将返回

按要求。

Answer 2

XPath可以使用自己的函数text()来评估标记的内容。为简单起见使用rvest：

library(rvest)

html <- '<div class="line">
        <h2 class="clearfix">
<span class="property">Number<div>number extra</div></span>
<span class="value">3</span>
</h2>
</div>
<div class="line">
<h2 class="clearfix">
<span class="property">Surface</span>
<span class="value">72</span>
</h2>
</div>' 

html %>% read_html() %>%    # read html
    html_nodes(xpath = '//span[text()="Number"]/*') %>%    # select node
    html_text()    # get text contents of node
# [1] "number extra"

XPath也有selectors to follow family axes，在本例中为following::：

html %>% read_html() %>%    # read html
    html_nodes(xpath = '//span[text()="Surface"]/following::*') %>%    # select node
    html_text()    # get text contents of node
# [1] "72"

如何在XPath中识别具有XML值的节点？

2 个答案: