Question

使用scrapy从影院网页收集数据。

使用xpath选择器，如果我使用带有extract（）方法的选择器，那么：

def parse_with_extract(self, response):
    div = response.xpath("//div[@class='col-sm-7 col-md-9']/p[@class='movie__option']")
    data = i.xpath("text()").extract()
    return data

它返回：

如果我使用带有extract_first（）方法的选择器：

def parse_with_extract_first(self, response):
    div = response.xpath("//div[@class='col-sm-7 col-md-9']/p[@class='movie__option']")
    storage = []
    for i in div:
        data = i.xpath("text()").extract_first()
        storage.append(data)
    return storage

它返回：

为什么extract（）方法返回所有字符，包括“\ xa0”和extract_first（）方法返回空字符串而不是????

Answer 1

如果你仔细观察回复，你会发现@class=movie__option元素看起来像这样：

'<p class="movie__option" style="color: #000;">\n                                    <strong>Thursday 3rd of May 2018:</strong>\n                                    11:20am\xa0 \xa0  \n                                </p>'

如果您提取此元素的text()，您基本上会得到两个字符串：一个在strong标记之前，另一个在之后（text()只获取第一级文本）：< / p>

['\n                                    ',
 '\n                                    11:20am\xa0 \xa0  \n                                ']

extract_first所做的只是取这两个字符串中的第一个：

'\n                                    '

Answer 2

嗯，根据你的输出，它看起来像下面这样：

['\n                                    ',
 '\n                                    11:20am\xa0 \xa0  \n                                ']

包含两个字符串。

我建议所有获得相同数据作为回报（如换行和空白）的人，使用 Python 的内置方法 strip()。此方法适用于字符串。因此，您可以通过以下方式应用此方法：

<块引用>

data = response.xpath("//path/to/your/data").get().strip()

这将使您的输出看起来像这样：

'11:20am'

另外，看看extract()和extract_first()有什么区别。

```
extract()
```

此方法返回列表。这是 Scrapy 中的旧方法。现在使用的方法不是extract()，而是getall()。和extract()一样。

<块引用>

extract() -- 更新为 --> getall()

现在我们来看看extract_first()方法

```
extract_first()
```

此方法返回 str 而不是列表。这也是 Scrapy 中的老方法。现在使用的方法不是extract_first()，而是get()。

<块引用>

extract_first() -- 更新为 --> get()

scrapy选择器上的extract_first（）和extract（）方法没有返回相同的值

2 个答案: