Scrapy回调函数多次返回相同的结果

时间:2016-03-09 17:24:08

标签: for-loop asynchronous xpath scrapy scrapy-spider

我是Scrapy的新手,我无法设法让回调功能正常工作。我设法得到所有的网址,我设法在回调函数中跟随它们但是当我得到结果时,我多次收到一些结果,并且许多结果都丢失了。什么似乎是问题?

import scrapy

from kexcrawler.items import KexcrawlerItem

class KexSpider(scrapy.Spider):
    name = 'kex'
    allowed_domains = ["kth.diva-portal.org"]
    start_urls = ['http://kth.diva-portal.org/smash/resultList.jsf?dswid=-855&language=en&searchType=RESEARCH&query=&af=%5B%5D&aq=%5B%5B%5D%5D&aq2=%5B%5B%7B%22dateIssued%22%3A%7B%22from%22%3A%222015%22%2C%22to%22%3A%222015%22%7D%7D%2C%7B%22organisationId%22%3A%225956%22%2C%22organisationId-Xtra%22%3Atrue%7D%2C%7B%22publicationTypeCode%22%3A%5B%22article%22%5D%7D%2C%7B%22contentTypeCode%22%3A%5B%22refereed%22%5D%7D%5D%5D&aqe=%5B%5D&noOfRows=250&sortOrder=author_sort_asc&onlyFullText=false&sf=all']

def parse(self, response):
    for href in response.xpath('//li[@class="ui-datalist-item"]/div[@class="searchItem borderColor"]/a/@href'):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_dir_contents)

def parse_dir_contents(self, response):
    item = KexcrawlerItem()
    item['report'] = response.xpath('//div[@class="toSplitVertically"]/div[@id="innerEastCenter"]/span[@class="displayFields"]/span[@class="subTitle"]/text()').extract()
        yield item

这是结果的第一行:

{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]},
{"report": ["Comparing Vocal Fold Contact Criteria Derived From Audio and Electroglottographic Signals"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["Dynamic message-passing approach for kinetic spin models with reversible dynamics"]},
{"report": ["RNA editing of non-coding RNA and its role in gene regulation"]},
{"report": ["Security monitor inlining and certification for multithreaded Java"]},
{"report": ["Security monitor inlining and certification for multithreaded Java"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},

1 个答案:

答案 0 :(得分:0)

我试图复制你的错误而不能。所有的网址都是截然不同的。我在INFO级别记录了每个项目,并且压缩了下面的所有内容,发现每个报告也是唯一的。我确实取消了你的收益率调用,因为它给我一个错误,并用一个字段定义了你的项目类。如果你直接从终端复制并粘贴它,那么我认为它是打印的产物,而不是日志,这让我觉得你可能有多个打印调用在不同的时间被调用。尝试在某处写文件,看看是否真的有重复。为了测试url是否唯一,我将xpath中的元素提取到名为elem的列表中: print len(elem) b = set() for e in elem: b.add(e) print len(b) 您可以尝试创建一个全局项目列表,然后添加一个函数spider_closed,它将在关闭时自动调用,然后在该列表上执行相同操作。集只包含唯一元素,如果存在差异,则实际上是在创建重复项。