如何在xpath中找不到任何内容时返回NA?

时间:2017-03-03 21:27:04

标签: html r xpath web-scraping html-parsing

很难制定问题,但举一个例子,它很容易理解。

我使用R来解析HTML代码。

在下面,我有一个名为html的html代码,然后我尝试提取//span[@class="number"]中的所有值以及//span[@class="surface"]中的所有值:

html <- '<div class="line">
<span class="number">Number 1</span>
<span class="surface">Surface 1</span>
</div>
<div class="line">
<span class="surface">Surface 2</span>
</div>' 

page = htmlTreeParse(html,useInternal = TRUE,encoding="UTF-8")

number = unlist(xpathApply(page,'//span[@class="number"]',xmlValue))
surface = unlist(xpathApply(page,'//span[@class="surface"]',xmlValue))

number的输出是:

[1] "Number 1"

surface的输出是:

[1] "Surface 1" "Surface 2"

然后,当我尝试cbind这两个元素时,我不能,因为它们的长度不同。

所以我的问题是:我可以做些什么来为number输出:

[1] "Number 1" NA

然后我可以合并numbersurface

2 个答案:

答案 0 :(得分:1)

library( 'XML' )  # load library
doc = htmlParse( html )  # parse html
# define xpath expression. div contains class = line, within which span has classes number and surface
xpexpr <- '//div[ @class = "line" ]'  

a1 <- lapply( getNodeSet( doc, xpexpr ), function( x ) { # loop through nodeset
      y <- xmlSApply( x, xmlValue, trim = TRUE )  # get xmlvalue
      names(y) <- xmlApply( x, xmlAttrs ) # get xmlattributes and assign it as names to y
      y   # return y
    } )

循环遍历a1并提取numbersurface的值并相应地设置名称。然后列绑定数字和表面值

nm <- c( 'number', 'surface' )
do.call( 'cbind', lapply( a1, function( x ) setNames( x[ nm ], nm ) ) )
#                [,1]        [,2]       
# number  "Number 1"  NA         
# surface "Surface 1" "Surface 2"

数据:

html <- '<div class="line">
<span class="number">Number 1</span>
<span class="surface">Surface 1</span>
</div>
<div class="line">
<span class="surface">Surface 2</span>
</div>' 

答案 1 :(得分:1)

更容易为每个标记选择封闭标记(此处为div),并查找其中的每个标记。使用rvest和purrr,我觉得更简单,

library(rvest)
library(purrr)

html %>% read_html() %>% 
    html_nodes('.line') %>% 
    map_df(~list(number = .x %>% html_node('.number') %>% html_text(), 
                 surface = .x %>% html_node('.surface') %>% html_text()))

#> # A tibble: 2 × 2
#>     number   surface
#>      <chr>     <chr>
#> 1 Number 1 Surface 1
#> 2     <NA> Surface 2