Question

所以，我得到了这种类型的html架构：

<table id="proposal-details" class="details">

                        <tbody><tr>
                            <th>
                                Application type:
                            </th>
                            <td>
                                P
                            </td>
                        </tr>
                        <tr>
                            <th>
                                Proposed development
                            </th>
                            <td>
                                Prune 1 x Eucalyptus
                            </td>
                        </tr>
                        <tr>
                            <th>
                                Date received:
                            </th>
                            <td>
                                06 Feb 2015
                            </td>
                        </tr>
                        <tr>
                            <th>
                                Registration date:
                                <br>
                                (Statutory start date)
                            </th>
                            <td>
                                06 Feb 2015
                            </td>
                        </tr>

我有xpath来捕获所有 th ;这一直到最后的文本注册日期：，其中我实际上不需要选择 br 文本。< / p>

我有一个解决这个问题的方法，问题在于这个xpath，

len(response.xpath("//table//tr//th[not(.//br)]/text()").extract())

整个标签被忽略了。有什么建议吗？

这是我得到的输出：

[u' Application type: ',
 u' Proposed development ',
 u' Date received: ']

我实际上需要注册日期：而没有列表中的（法定开始日期）。

Answer 1

据我了解您的问题，您希望获取所有th元素的文本，但忽略<br>之后的文字。如果是这种情况，请使用以下XPath

//table//tr//th/text()[not(preceding-sibling::br)]

应用于您的输入时会产生结果

Application type:
Proposed development
Date received:
Registration date:

您使用的XPath只会忽略每个th有孩子的br：

th[not(.//br)]

th/text()[not(preceding-sibling::br)]检索th中没有前一个兄弟br的所有文字元素。

Xpath获取所有节点的文本，但具有特定标记的节点

1 个答案: