避免python scrapy中的响应重叠

时间:2015-06-04 08:42:14

标签: python web-scraping scrapy screen-scraping scrapy-spider

我试图抓住在这里找到的印度Rajya Sabha成员的家庭信息http://164.100.47.5/Newmembers/memberlist.aspx 作为scrapy的新手,我按照thisthis示例代码生成了以下内容。

def parse(self,response):

    print "Inside parse"
    requests = []
    target_base_prefix = 'ctl00$ContentPlaceHolder1$GridView2$ctl'
    target_base_suffix = '$lkb'

    for i in range(2,5):
        if i < 10:
            target_id = "0"+str(i)
        else:
            target_id = str(i)

        evTarget = target_base_prefix+target_id+target_base_suffix

        form_data = {'__EVENTTARGET':evTarget,'__EVENTARGUMENT':''}

        requests.append(scrapy.http.FormRequest.from_response(response, formdata = form_data,dont_filter=True,method = 'POST', callback = self.parse_politician))

    for r in requests:
        print "before yield"+str(r)
        yield r


def parse_pol_bio(self,response):

    print "[parse_pol_bio]- response url - "+response.url

    name_xp = '//span[@id=\"ctl00_ContentPlaceHolder1_GridView1_ctl02_Label3\"]/font/text()'
    base_xp_prefix = '//*[@id=\"ctl00_ContentPlaceHolder1_TabContainer1_TabPanel2_ctl00_DetailsView2_Label'
    base_xp_suffix='\"]/text()'
    father_id = '12'
    mother_id = '13'
    married_id = '1'
    spouse_id = '3'

    name = response.xpath(name_xp).extract()[0].strip()
    name = re.sub(' +', ' ',name)

    father = response.xpath(base_xp_prefix+father_id+base_xp_suffix).extract()[0].strip()
    mother = response.xpath(base_xp_prefix+mother_id+base_xp_suffix).extract()[0].strip()
    married = response.xpath(base_xp_prefix+married_id+base_xp_suffix).extract()[0].strip().split(' ')[0]

    if married == "Married":
        spouse = response.xpath(base_xp_prefix+spouse_id+base_xp_suffix).extract()[0].strip()
    else:
        spouse = ''

    print 'name     marital_stat    father_name     mother_name     spouse'
    print name,married,father,mother,spouse

    item = RsItem()
    item['name'] = name
    item['spouse'] = spouse
    item['mother'] = mother
    item['father'] = father

    return item



def parse_politician(self,response):

    evTarget = 'ctl00$ContentPlaceHolder1$TabContainer1'
    evArg =  'activeTabChanged:1'
    formdata = {'__EVENTTARGET':evTarget,'__EVENTARGUMENT':evArg}

    print "[parse_politician]-response url - "+response.url

    return scrapy.FormRequest.from_response(response, formdata,method = 'POST', callback = self.parse_pol_bio)

解释
parse方法为不同的政治家循环目标ID并发送请求 parse_politician - 用于更改标签的用途
parse_politician_bio执行依赖名称的抓取。

问题
问题是这会导致对parse_politician_bio的重复响应 即有关同一人的信息会多次出现 重复响应的性质在每次运行时都是随机的,即每次响应时可能会重复不同的政客数据 我已经检查过是否有任何请求多次收益但是没有 在每次屈服请求之后还尝试放置一些睡眠以确定它是否有帮助 我怀疑scrapy Request Scheduler在这里。

代码中是否还有其他问题?可以做些什么来避免这种情况?

修改
为了澄清这里的一些内容,我知道dont_filter = True是做什么的,并刻意保留了这一点。

问题是某些响应数据正在被替换。例如,当我生成3个请求时,分别为target_id = 1,2,3。 target_id = 1的响应被target_id = 2的响应替换 [所以这让我对target_id有一个响应 - 3对于target_id -2有两个]

预期输出(csv)

politician name , spouse name , father name , mother name
pol1 , spouse1, father1, mother1
pol2 , spouse2, father2, mother2
pol3 , spouse3, father3, mother3

给出的输出(csv)

politician name , spouse name , father name , mother name
pol1 , spouse1, father1, mother1
pol1 , spouse1, father1, mother1
pol3 , spouse3, father3, mother3

1 个答案:

答案 0 :(得分:1)

最后修好了(p!) 默认情况下,scrapy一次发送16个请求(并发请求) 将CONCURRENT_REQUESTS = 1置于settings.py文件中即可顺序解决问题。

我提出的请求类似(上面检查),响应数据与之重叠 彼此只给出一种类型的重复回复。

不知道如何发生这种情况,但通过发出连续请求的解决方案证实了这一点 有更好的解释吗?