很难制定问题,但举一个例子,它很容易理解。
我使用R来解析HTML代码。
在下面,我有一个名为html
的html代码,然后我尝试提取//span[@class="number"]
中的所有值以及//span[@class="surface"]
中的所有值:
html <- '<div class="line">
<span class="number">Number 1</span>
<span class="surface">Surface 1</span>
</div>
<div class="line">
<span class="surface">Surface 2</span>
</div>'
page = htmlTreeParse(html,useInternal = TRUE,encoding="UTF-8")
number = unlist(xpathApply(page,'//span[@class="number"]',xmlValue))
surface = unlist(xpathApply(page,'//span[@class="surface"]',xmlValue))
number
的输出是:
[1] "Number 1"
surface
的输出是:
[1] "Surface 1" "Surface 2"
然后,当我尝试cbind
这两个元素时,我不能,因为它们的长度不同。
所以我的问题是:我可以做些什么来为number
输出:
[1] "Number 1" NA
然后我可以合并number
和surface
。
答案 0 :(得分:1)
library( 'XML' ) # load library
doc = htmlParse( html ) # parse html
# define xpath expression. div contains class = line, within which span has classes number and surface
xpexpr <- '//div[ @class = "line" ]'
a1 <- lapply( getNodeSet( doc, xpexpr ), function( x ) { # loop through nodeset
y <- xmlSApply( x, xmlValue, trim = TRUE ) # get xmlvalue
names(y) <- xmlApply( x, xmlAttrs ) # get xmlattributes and assign it as names to y
y # return y
} )
循环遍历a1
并提取number
和surface
的值并相应地设置名称。然后列绑定数字和表面值
nm <- c( 'number', 'surface' )
do.call( 'cbind', lapply( a1, function( x ) setNames( x[ nm ], nm ) ) )
# [,1] [,2]
# number "Number 1" NA
# surface "Surface 1" "Surface 2"
数据:
html <- '<div class="line">
<span class="number">Number 1</span>
<span class="surface">Surface 1</span>
</div>
<div class="line">
<span class="surface">Surface 2</span>
</div>'
答案 1 :(得分:1)
更容易为每个标记选择封闭标记(此处为div
),并查找其中的每个标记。使用rvest和purrr,我觉得更简单,
library(rvest)
library(purrr)
html %>% read_html() %>%
html_nodes('.line') %>%
map_df(~list(number = .x %>% html_node('.number') %>% html_text(),
surface = .x %>% html_node('.surface') %>% html_text()))
#> # A tibble: 2 × 2
#> number surface
#> <chr> <chr>
#> 1 Number 1 Surface 1
#> 2 <NA> Surface 2