Question

我有一个嵌套表的html。我希望在外表和内表之间找到文本。我认为这是一个经典问题，但到目前为止还没有找到答案。我想出的是 tree.xpath(//p[not(ancestor-or-self::table)])。但这不起作用，但因为所有文本都来自外部表格。另外，使用preceding::table还不够，因为文本可以包围内部表格。

对于一个概念性示例，如果一个表看起来好像是[...text1...[inside table No.1]...text2...[inside table No.2]...text3...]，那么我怎样才能获得text1/2/3而不会被inside tables No.1&2的文本污染。也许这是我的想法，是否有可能通过xpath构建表层的概念，所以我可以告诉lxml或其他库＆＃34;在0层和1层之间给我所有文本＆＃34;

下面是一个简化的示例html文件。实际上，外部表可能包含许多嵌套表，但我只想要最外部表和第一个嵌套表之间的文本。谢谢大家！

<table>
    <tr><td>
        <p> text I want </p>
        <div> they can be in different types of nodes </div>
        <table>
            <tr><td><p> unwanted text </p></td></tr>
            <tr><td>
                <table>
                    <tr><td><u> unwanted text</u></td></tr> 
                </table>
            </td></tr>
        </table>
        <p> text I also want </p>
        <div> as long as they're inside the root table and outside the first-level inside tables </div>
    </td></tr>
    <tr><td>
        <u> they can be between the first-level inside tables </u>
        <table>
        </table>
    </td></tr>
</table>

它返回["text I want", "they can be in different types of nodes", "text I also want", "as long as they're inside the root table and outside the first-level inside tables", "they can be between the first-level inside tables"]。

Answer 1

其中一个XPath可以执行此操作，如果最外面的表是根元素：

/table/descendant::table[1]/preceding::p

在这里，您遍历最外层table的第一个后代table，然后选择其前面的所有p元素。

如果没有，您将不得不采用不同的方法访问p之间的tables元素，可能正在使用generate-id()函数。

lxml xpath在两个嵌套表之间获取文本

1 个答案: