Question

我正在尝试使用scrapy刮取亚马逊上的评论文本。问题是，当评论包含多个输入时，span元素中的文本由＆lt; br＆gt;标签。所以，当我想要抓第一篇评论时，我会使用这行代码：

response.css('span.a-size-base.review-text::text').extract_first()

这并没有给我所有的评论文本，只提供了＆lt; span＆gt;元素和第一个＆lt; br＆gt;元件。

我知道当我更换＆＃34; extract_first（）＆＃34;通过＆＃34; extract（）＆＃34;，我将获得所有文本。但是，这也给了我其他评论的文本。

基本上，extract（）方法返回一个数组，其元素由＆lt; br＆gt;标签。我需要将它与＆分开。 span＆gt;标签

有没有办法在open＆lt;之间刮掉所有文本？ span＆gt;元素和结束＆lt; / span＆gt;元件？

HTML代码示例：

< span data-hook="review-body" class="a-size-base review-text">
    "I like this product, the reasons why are explained below"
    < br >
    < br >
    "1. It looks nice" 
    < br >
    "2. I love it"
< /span >

网站上的内容：

我喜欢这个产品，原因解释如下

看起来很不错
我喜欢它

输出我将使用extract_first（）：

＆＃34;我喜欢这个产品，原因解释如下＆＃34;

输出我将使用extract（）（请注意它包含三个元素）：

＆＃34;我喜欢这个产品，原因解释如下＆＃34;，＆＃34; 1。看起来不错＃34; ＆＃34; 2。我喜欢它＆＃34;

输出我想得到（只有一个元素，评论本身）：

＆＃34;我喜欢这个产品，原因解释如下1.它看起来不错2.我喜欢它＆＃34;

Answer 1

使用extract（）并加入列表。

>>> text=["I like this product, the reasons why are explained below", "1. It looks nice", "2. I love it"]
>>> " ".join(text)
'I like this product, the reasons why are explained below 1. It looks nice 2. I love it'

<br/>标签使用scrapy和python

1 个答案: