Question

我几乎肯定会犯这个可怕的错误，我的问题的原因是我自己的无知，但阅读python文档和示例并没有帮助。

我正在抓网。我正在抓取的页面有以下几个重要元素：

<div class='parent'>
   <span class='title'>
      <a>THIS IS THE TITLE</a>
   </span>
   <div class='copy'>
      <p>THIS IS THE COPY</p>
   </div>
</div>

我的目标是从'title'和'copy'中提取文本节点，按父div分组。在上面的示例中，我想检索一个元组('THIS IS THE TITLE', 'THIS IS THE COPY')

以下是我的代码

## 'tree' is the ElementTree of the document I've just pulled 
xpath = "//div[@class='parent']"
filtered_html = tree.xpath(xpath)

arr = []

for i in filtered_html:

   title_filter = "//span[@class='author']/a/text()"  # xpath for title text
   copy_filter = "//div[@class='copy']/p/text()"      # xpath for copy text

   title = i.getroottree().xpath(title_filter)
   copy = i.getroottree().xpath(copy_filter)
   arr.append((title, copy))

我希望filtered_html成为 n 元素的列表（它是）。然后我尝试迭代该元素列表，并为每个元素转换为ElementTree并检索标题并使用另一个xpath表达式复制文本。所以在每次迭代时，我期望title是长度为1的列表，包含元素 i 的标题文本，copy是对应的列表。复制文本。

我最终得到：在每次迭代时，title是一个长度为 n 的列表，其中包含与title_filter xpath表达式匹配的文档中的所有元素，以及{{ 1}}是复制文本的长度 n 的对应列表。

我敢肯定，到现在为止，任何知道他们用xpath和etree做什么的人都能认出我做了一些可怕的，错误的和愚蠢的事情。如果是这样，他们可以告诉我应该怎么做呢？

Answer 1

您的核心问题是，您在每个文本元素上进行的getroottree调用会重置您在整个树上运行xpath。 getroottree完全听起来像 - 返回您调用它的元素的根元素树。如果你把这个电话留下来，我觉得你会得到你想要的东西。

我个人会在主循环的元素树上使用iterfind方法，并且可能会对结果元素使用findtext方法，以确保我只收到一个标题和一个副本。

我的（未经测试的！）代码如下所示：

parent_div_xpath = "//div[@class='parent']"
title_filter = "//span[@class='title']/a"
copy_filter = "//div[@class='copy']/p"
arr = [(i.findtext(title_filter), i.findtext(copy_filter)) for i in tree.iterfind(parent_div_xpath)]

或者，您可以完全跳过显式迭代：

title_filter = "//div[@class='parent']/span[@class='title']/a/text()"
copy_filter = "//div[@class='parent']/div[@class='copy']/p/text()"
arr = izip(tree.findall(title_filter), tree.findall(copy_filter))

您可能需要从xpath中删除text()调用并将其移动到生成器表达式中，我不确定findall是否会尊重它。如果没有，那就像：

arr = izip(title.text for title in tree.findall(title_filter), copy.text for copy in tree.findall(copy_filter))

如果父div中有多个标题/副本对，则可能需要调整该xpath。

使用lxml和xpath从python ElementTree中提取多个值

1 个答案: