Question

实际上，情况稍微复杂一些。

我正在尝试从此示例html中获取数据：

<li itemprop="itemListElement">
    <h4>
        <a href="/one" title="page one">one</a>
    </h4>
</li>

<li itemprop="itemListElement">
    <h4>
        <a href="/two" title="page two">two</a>
    </h4>
</li>

<li itemprop="itemListElement">
    <h4>
        <a href="/three" title="page three">three</a>
    </h4>
</li>

<li itemprop="itemListElement">
    <h4>
        <a href="/four" title="page four">four</a>
    </h4>
</li>

目前，我正在使用带有urllib和lxml的Python 3。出于某种原因，以下代码无法按预期工作（请阅读评论）

scan = []

example_url = "path/to/html"
page = html.fromstring(urllib.request.urlopen(example_url).read())

# Extracting the li elements from the html
for item in page.xpath("//li[@itemprop='itemListElement']"):
    scan.append(item)

# At this point, the list 'scan' length is 4 (Nothing wrong)

for list_item in scan:
    # This is supposed to print '1' since there's only one match
    # Yet, this actually prints '4' (This is wrong)
    print(len(list_item.xpath("//h4/a")))

正如您所看到的，第一步是提取4个li元素并将它们附加到列表中，然后扫描每个li元素以查找a元素，但问题是是li中的每个scan元素实际上都是四个元素。

......或者我想。

进行快速调试，我发现scan列表正确包含了四个li元素，因此我得出了一个可能的结论：上面提到的for循环有问题

for list_item in scan:
    # This is supposed to print '1' since there's only one match
    # Yet, this actually prints '4' (This is wrong)
    print(len(list_item.xpath("//h4/a")))

    # Something is wrong here...

唯一真正的问题是我无法查明错误。是什么导致的？

PS：我知道，从列表中获取a元素有一种更简单的方法，但这只是一个示例html，真正的html包含更多......事物。

Answer 1

在您的示例中，当XPath以//开头时，它将从文档的根开始搜索（这就是它匹配所有四个锚元素的原因）。如果要相对于li元素进行搜索，则省略前导斜杠：

for item in page.xpath("//li[@itemprop='itemListElement']"):
    scan.append(item)

for list_item in scan:
    print(len(list_item.xpath("h4/a")))

当然，您也可以将//替换为.//，以便搜索也是相对的：

for item in page.xpath("//li[@itemprop='itemListElement']"):
    scan.append(item)

for list_item in scan:
    print(len(list_item.xpath(".//h4/a")))

以下是从规范中得出的相关引用：

2.5 Abbreviated Syntax

//是/descendant-or-self::node()/的缩写。例如，//para是/descendant-or-self::node()/child::para的缩写，因此将选择文档中的任何para元素（即使作为文档元素的para元素也将由{{{ 1}}因为文档元素节点是根节点的子节点;） //para是div//para的缩写，因此会选择所有div/descendant-or-self::node()/child::para div子女的后代。

Answer 2

print(len(list_item.xpath(".//h4/a")))

//表示/descendant-or-self::node() 它以/开头，因此它将从文档的根节点进行搜索。

使用.指出当前上下文节点是list_item，而不是整个文档

XPath找到的结果数不正确

2 个答案: