Question

我试图从网站上抓取用户评论数据。我希望在最后有一个2列数据（评级和评论）。

这是一个模拟我的抓取问题的示例xml文件。我已经在https://www.freeformatter.com/xpath-tester.html#ad-output.to上尝试了输出。

<root>
  <div class="user-review">
    <div class="rating"> 5,0 </div>
    <p class="review-content"> Reiew text of item/movie.
      <span class="details">
          <span class="details-header">Detail: </span>
      <span class="details-content">Some details to emphasis</span>
      </span>
      Continue to review
    </p>
  </div>
  <div class="user-review">
    <div class="rating"> 4,0 </div>
    <p class="review-content">Reiew text of item/movie.
    </p>
  </div>
  <div class="user-review">
    <div class="rating"> 4,0 </div>
    <p class="review-content">Reiew text of item/movie.
    </p>
  </div>
</root>

我可以通过以下查询获得3个评分值。

/root/div/div[@class="rating"]/text()

输出：

Text=' 5,0 '
Text=' 4,0 '
Text=' 4,0 '

当我尝试获取评论部分时，第一个文本分为2个部分。因此，我有两个不同大小的列表（3个大小的评级和4个大小的评论），并且无法匹配带评级的评论

//p[@class="review-content"]/text()

输出：

Text='  Reiew text of item/movie.
        '
Text='
Continue to review
    '
Text='Reiew text of item/movie.
    '
Text='Reiew text of item/movie.

有人可以帮助我获得我预期的一个输出吗？

预期输出1：

Text='  Reiew text of item/movie.
    Continue to review
    '
Text='Reiew text of item/movie.
    '
Text='Reiew text of item/movie.

预期输出2：

Text='  Reiew text of item/movie. Some details to emphasis
    Continue to review
    '
Text='Reiew text of item/movie.
    '
Text='Reiew text of item/movie.

Answer 1

试试这个，sel在这里选择器，在你的情况下可能是响应

tags = sel.xpath('//p[@class="review-content"]')
reviews = []
for tag in tags:
    text = " ".join(tag.xpath('.//text()').extract())
    reviews.append(text)

Answer 2

您必须使用div类循环user-review个元素，并从每个元素中提取评论内容。如果你想要一个单行，请看看：

import scrapy

text = """
<root>
  <div class="user-review">
    <div class="rating"> 5,0 </div>
    <p class="review-content"> Reiew text of item/movie.
      <span class="details">
          <span class="details-header">Detail: </span>
      <span class="details-content">Some details to emphasis</span>
      </span>
      Continue to review
    </p>
  </div>
  <div class="user-review">
    <div class="rating"> 4,0 </div>
    <p class="review-content">Reiew text of item/movie.
    </p>
  </div>
  <div class="user-review">
    <div class="rating"> 4,0 </div>
    <p class="review-content">Reiew text of item/movie.
    </p>
  </div>
</root>
"""

selector = scrapy.Selector(text=text)
review_content = [review.xpath('normalize-space(.//p[@class="review-content"])').extract_first() for review in selector.xpath('//div[@class="user-review"]')]

如何在scrapy中获取带有xpath的节点的所有文本数据

2 个答案: