Question

我正在使用R（rvest）从不同站点抓取文章，这些文章通常采用不同的结构，并希望使用xpath提取其后代包含一些文本的所有html节点（不重复）。

简化，其结构可能类似于（没有为可读性引入空白）

for root, dirs, files in os.walk(SEARCHDIR):
    files = [f for f in files if not (f[0] == '.' and any(i in f for i in ["BLACKBOX", ".RTN", ".log"]))]

我尝试了几种不同的xpath-但它们似乎总是选择重复的节点

<html>
<body>
    <a name="SomeMarker">
            <font style="FONT-SIZE: 12pt;"><b>Sports article</b></font>
    </a>
<div>
<b>This is possibly an article heading</b>
<font style="FONT-SIZE: 10pt;"> This is the <i>body</i> of the article.</font
<font style="FONT-SIZE: 10pt;"> It could have <i><b>interesting tags</b></i> embedded in the text</font>
</div>

<p id="SomeId"><b>This is another article heading</b>
    <font style="FONT-SIZE: 10pt;"> This is the <i>body</i> of the second article</font>
    <p><font style="FONT-SIZE: 10pt;"> It could have further <i><b><u>interesting tags</u></b></i> embedded in the text</font></p>
</p>

</body>
</html>

依此类推-但是所有这些都会导致文本节点的各种排列

目前，我有很多重复的节点，例如：

"//a/following::*//*[text()]"
"//a/following::*/*[normalize-space(text())]"
"//a/following::*/*[normalize-space(text())]/parent::*"

首选结果是仅获取其子代包含一些文本的顶级节点，即在上述情况下：

[1] <div>\n<b>This is possibly an article heading</b><font style="FONT-SIZE: 10pt;"> This is the <i>body</i> of the article.</font><font style="FONT-SIZE: 10pt;"> It could have <i><b>interes ...
[2] <font style="FONT-SIZE: 10pt;"> This is the <i>body</i> of the article.</font>
[3] <i><b>interesting tags</b></i>
[4] <p id="SomeId"><b>This is another article heading</b><font style="FONT-SIZE: 10pt;"> This is the <i>body</i> of the second article.</font></p>\n
[5] <font style="FONT-SIZE: 10pt;"> This is the <i>body</i> of the second article.</font>
[6] <p><font style="FONT-SIZE: 10pt;"> It could have further <i><b><u>interesting tags</u></b></i> embedded in the text</font></p>
[7] <b><u>interesting tags</u></b>

我知道xpath仅用于提取文本-实际上我希望具有标记的html节点完好无损，因为我想在顶级节点上进行进一步处理（例如，提取标题）。非常感谢。

Answer 1

选项1：使用以下同级：：

// a /以下兄弟姐妹：: [ [text（）]]

选项2：将父级与以下项配合使用：

// a /以下：: [ [text（）]] [父:: body]

如何查找没有重复的顶级文本节点

1 个答案: