Question

我有一个HTML文件，看起来像这样（简化）：

<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>

我想要提取的是“table class =”main“”的内容，所以在明确的单词中，我想提取与上面写的相同的文件。考虑一下：这个例子是简化的;围绕-tags，还有很多其他...... 我尝试使用以下代码提取内容：

root = lxml.html.parse('www.test.xyz').getroot()

for empty in root.xpath('//*[self::b or self::i][not(node())]'):
    empty.getparent().remove(empty)

tables = root.cssselect('table.main')

以上代码有效。但问题是我得到了两次;看看我的意思：代码的结果是：

<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>

所以问题是中间部分最后出现了一次太多。为什么会这样，如何省略和修复？

paul t。，也是一个stackoverflow用户，告诉我使用“root.xpath（'// table [@ class =”main“而不是（.// table [@ class =”main“]）] “）”。这段代码打印出我有两次的部分。

我希望问题的描述足够清楚......感谢任何帮助和任何建议：）

Answer 1

您想要选择所有尚未被选为相同元素后代的“main”类的表。
这似乎工作正常：

root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')

提取HTML文件的内容

1 个答案: