Question

这涉及几乎相同的代码我刚刚问了一个关于今天早上的不同问题，所以如果它看起来很熟悉，那是因为它是。

class LbcSubtopicSpider(scrapy.Spider):

...irrelevant/sensitive code...

    rawTranscripts = []
    rawTranslations = []

    def parse(self, response):
        rawTitles = []
        rawVideos = []
        for sel in response.xpath('//ul[1]'): #only scrape the first list

            ...irrelevant code...

            index = 0
            for sub in sel.xpath('li/ul/li/a'): #scrape the sublist items
                index += 1
                if index%2!=0: #odd numbered entries are the transcripts
                    transcriptLink = sub.xpath('@href').extract()
                    #url = response.urljoin(transcriptLink[0])
                    #yield scrapy.Request(url, callback=self.parse_transcript)
                else: #even numbered entries are the translations
                    translationLink = sub.xpath('@href').extract()
                    url = response.urljoin(translationLink[0])
                    yield scrapy.Request(url, callback=self.parse_translation)

        print rawTitles
        print rawVideos
        print "translations:" 
        print self.rawTranslations

    def parse_translation(self, response):
        for sel in response.xpath('//p[not(@class)]'):
            rawTranslation = sel.xpath('text()').extract()
            rawTranslation = ''.join(rawTranslation)
            #print rawTranslation
            self.rawTranslations.append(rawTranslation)
            #print self.rawTranslations

我的问题是＆＃34;打印self.rawTranslations＆＃34;在parse(...)方法中，只打印"[]"。这可能意味着以下两种情况之一：它可能在打印之前重置列表，或者可能是在完成从链接parse_translation(...)填充列表的parse(...)调用之前打印。我倾向于怀疑它是后者，因为我无法看到任何会重置列表的代码，除非类主体中的"rawTranslations = []"多次运行。

值得注意的是，如果我取消注释parse_translation(...)中的同一行，它将打印所需的输出，这意味着它正确地提取文本并且问题似乎是主要的{{1方法。

我试图解决我认为是同步问题的尝试是非常漫无目的的 - 我只是尝试使用基于我能找到的许多Google教程的RLock对象，而且我99％肯定我总是误用它，因为结果是一样的。

Answer 1

这里的问题是你不了解scrapy的确如何运作。

Scrapy是一个爬行框架，用于创建网站蜘蛛，而不仅仅是用于执行requests模块的请求。

Scrapy的请求异步工作，当您调用yield Request(...)时，您正在向一堆请求添加请求，这些请求将在某个时刻执行（您无法控制它）。这意味着您无法预期yield Request(...)之后的某些部分代码将在此时执行。实际上，您的方法应始终产生Request或Item。

现在从我可以看到的和大多数与scrapy混淆的情况，你想继续填充你在某个方法上创建的项目，但你需要的信息是在不同的请求。

在这种情况下，通常使用meta的{{1}}参数进行通信，如下所示：

Request

Answer 2

所以这看起来像是一个hacky解决方案，特别是因为我刚刚了解了Scrapy的请求优先级功能，但这里是我的新代码，它提供了所需的结果：

class LbcVideosSpider(scrapy.Spider):

    ...code omitted...

    done = 0 #variable to keep track of subtopic iterations
    rawTranscripts = []
    rawTranslations = []

    def parse(self, response):
        #initialize containers for each field
        rawTitles = []
        rawVideos = []

        ...code omitted...

            index = 0
            query = sel.xpath('li/ul/li/a')
            for sub in query: #scrape the sublist items
                index += 1
                if index%2!=0: #odd numbered entries are the transcripts
                    transcriptLink = sub.xpath('@href').extract()
                    #url = response.urljoin(transcriptLink[0])
                    #yield scrapy.Request(url, callback=self.parse_transcript)
                else: #even numbered entries are the translations
                    translationLink = sub.xpath('@href').extract()
                    url = response.urljoin(translationLink[0])
                    yield scrapy.Request(url, callback=self.parse_translation, \
                        meta={'index': index/2, 'maxIndex': len(query)/2})

        print rawTitles
        print rawVideos

    def parse_translation(self, response):
        #grab meta variables
        i = response.meta['index']
        maxIndex = response.meta['maxIndex']

        #interested in p nodes without class
        query = response.xpath('//p[not(@class)]')
        for sel in query:
            rawTranslation = sel.xpath('text()').extract()
            rawTranslation = ''.join(rawTranslation) #collapse each line
            self.rawTranslations.append(rawTranslation)

            #increment number of translations done, check if finished
            self.done += 1
            print self.done
            if self.done==maxIndex:
                print self.rawTranslations

基本上，我只是跟踪完成了多少请求，并使一些代码以最终请求为条件。这将打印完全填充的列表。

填充Scrapy的列表在实际填充之前返回

2 个答案: