Question

我有一组html页面。我要提取属性“ border” = 1的所有表节点。这是一个示例：

<table border="1" cellspacing="0" cellpadding="5">
   <tbody><tr><td>
    <table border="0" cellpadding="2" cellspacing="0">
      <tbody><tr>
        <td bgcolor="#ff9999"><strong><font size="+1">CASEID</font></strong></td>
      </tr></tbody>
    </table>
   <tr><td>[tbody]
</table>

在该示例中，我想选择border = 1的表节点，而不是border = 0的表节点。我正在使用html_nodes()中的rvest，但不知道如何添加属性：

html_nodes(x, "table")

Answer 1

有两种从HTML和类似文档中查找节点的主要方法：CSS选择器和XPath。 CSS通常更容易使用，但不能处理更复杂的用例，而XPath具有可以执行诸如在节点内搜索文本之类的功能的功能。可以使用哪个always up for debate，但我认为两者都值得尝试。

library(rvest)

with_css <- html_nodes(x, css = "table[border='1']")
with_css
#> {xml_nodeset (1)}
#> [1] <table border="1" cellspacing="0" cellpadding="5"><tbody>\n<tr><td>\n     ...

验证表格是否正确：

html_table(with_css, fill = TRUE)
#> [[1]]
#>        X1     X2
#> 1  CASEID CASEID
#> 2  CASEID   <NA>
#> 3 [tbody]   <NA>

等效的XPath获取相同的表。

with_xpath <- html_nodes(x, xpath = "//table[@border=1]")
with_xpath
#> {xml_nodeset (1)}
#> [1] <table border="1" cellspacing="0" cellpadding="5"><tbody>\n<tr><td>\n     ...
html_table(with_xpath, fill = TRUE)
#> [[1]]
#>        X1     X2
#> 1  CASEID CASEID
#> 2  CASEID   <NA>
#> 3 [tbody]   <NA>

Answer 2

查看html_nodes文档中链接的CSS3 selectors documentation。它提供了CSS选择器语法的详尽说明。

对于您的情况，您想要

html_nodes(x, "tag[attribute]")

选择设置了tag的所有attribute，或者

html_nodes(x, "tag[attribute=value]")

选择所有tag设置为attribute的{{1}}。

如何在R中使用html_nodes选择具有“ attribute = x”的节点？

2 个答案: