Question

我有一些使用scrapy擦除bbcode论坛的Python代码，我需要一个Xpath表达式，它只提供帖子的文本，不包括引号中的文本。 HTML看起来像这样：

<td class="postbody">
   hi this is a response
   <div class="bbc-block">
      <blockquote>
         blah blah blah here's a quote
         <br>
      </blockquote>
   </div>
   <br>
   and now I'm responding to what I quoted
</td>
<td class="postbody">
   <div class="bbc-block">
      <blockquote>
         and now I'm responding to what I quoted
         <br>
      </blockquote>
   </div>
   <br>
   wow what a great response
</td>

对于每个帖子，每页多次发生这种情况。我最终想要的只是排除了blockquote的每个td节点的文本：

嗨，这是一个回复\ n，现在我回应我引用的内容
哇哇响应

我必须提取这些块的Python代码如下 - 首先我将它从scrapy的HtmlResponse转换为lxml的HtmlElement类，因为这是我可以想出使用的唯一方法lxml.html.text_content（）方法：

import lxml.html as ht

def posts_from_response(self, response):
    dom = ht.fromstring(response.body)
    posts = dom.xpath('//td[@class="postbody"]')
    posts_text = [p.text_content() for p in posts]
    return posts_text

我已经广泛搜索了几天的解决方案，并尝试了大约十几种

'//td[@class="postbody"][not(@class="bbc-block")]'

以各种方式附加到其中，但没有任何东西能让我完全符合我想要的分组。

是否有1.使用单个语句获取此方法的方法，或者2.在posts列表上执行第二个Xpath选择器以排除bbc-block节点的方法？

使用XPath和Scrapy / lxml排除特定子节点

0 个答案: