Question

我想使用scrapy从网站上搜索评论数据。代码如下。

问题在于，每次程序进入下一页时，它都会从头开始（由于回调）并重置records[]。因此，数组将再次为空，并且records[]中保存的每个评论都将丢失。这导致当我打开我的csv文件时，我只得到最后一页的评论。

我想要的是所有数据都存储在我的csv文件中，因此records[]每次请求下一页时都不会重置。我不能在解析方法之前放行：records = []，因为没有定义数组。

这是我的代码：

def parse(self, response):
    records = []

    for r in response.xpath('//div[contains(@class, "a-section review")]'):
        rtext = r.xpath('.//div[contains(@class, "a-row review-data")]').extract_first()                
        rating = r.xpath('.//span[contains(@class, "a-icon-alt")]/text()').extract_first()
        votes = r.xpath('normalize-space(.//span[contains(@class, "review-votes")]/text())').extract_first()

        if not votes:
            votes = "none"

        records.append((rating, votes, rtext))
        print(records)

    nextPage = response.xpath('//li[contains(@class, "a-last")]/a/@href').extract_first()
    if nextPage:
        nextPage = response.urljoin(nextPage)
        yield scrapy.Request(url = nextPage)    

    import pandas as pd
    df = pd.DataFrame(records, columns=['rating' , 'votes', 'rtext'])
    df.to_csv('ama.csv', sep = '|', index =False, encoding='utf-8')

Answer 1

将记录声明移动到方法调用将使用python概述here in the python docs中的常见陷阱。但是，在这种情况下，在方法声明中实例化列表的奇怪行为将对您有利。

Python的默认参数在定义函数时被计算一次，而不是每次调用函数时（比如Ruby）。这意味着如果你使用一个可变的默认参数并对其进行变异，那么你将会对该对象进行变异，以便将来调用该函数。

def parse(self, response, records=[]):


    for r in response.xpath('//div[contains(@class, "a-section review")]'):
        rtext = r.xpath('.//div[contains(@class, "a-row review-data")]').extract_first()                
        rating = r.xpath('.//span[contains(@class, "a-icon-alt")]/text()').extract_first()
        votes = r.xpath('normalize-space(.//span[contains(@class, "review-votes")]/text())').extract_first()

        if not votes:
            votes = "none"

        records.append((rating, votes, rtext))
        print(records)

    nextPage = response.xpath('//li[contains(@class, "a-last")]/a/@href').extract_first()
    if nextPage:
        nextPage = response.urljoin(nextPage)
        yield scrapy.Request(url = nextPage)    

    import pandas as pd
    df = pd.DataFrame(records, columns=['rating' , 'votes', 'rtext'])
    df.to_csv('ama.csv', sep = '|', index =False, encoding='utf-8')

上述方法有点奇怪。更通用的解决方案是简单地使用全局变量。 Here is a post going over how to use globals.

Answer 2

这里parse是一个每次都被调用的回调。尝试全局定义records或调用appender函数并调用它来追加值。

此外，scrappy还可以自行生成CSV。这是我的小实验 - https://gist.github.com/lisitsky/c4aac52edcb7abfd5975be067face1bb

因此您可以将数据加载到csv，然后pandas会读取它。

如何在回调后避免数组重置？

2 个答案: