Question

我试图抓住在这里找到的印度Rajya Sabha成员的家庭信息http://164.100.47.5/Newmembers/memberlist.aspx 作为scrapy的新手，我按照this和this示例代码生成了以下内容。

def parse(self,response):

    print "Inside parse"
    requests = []
    target_base_prefix = 'ctl00$ContentPlaceHolder1$GridView2$ctl'
    target_base_suffix = '$lkb'

    for i in range(2,5):
        if i < 10:
            target_id = "0"+str(i)
        else:
            target_id = str(i)

        evTarget = target_base_prefix+target_id+target_base_suffix

        form_data = {'__EVENTTARGET':evTarget,'__EVENTARGUMENT':''}

        requests.append(scrapy.http.FormRequest.from_response(response, formdata = form_data,dont_filter=True,method = 'POST', callback = self.parse_politician))

    for r in requests:
        print "before yield"+str(r)
        yield r


def parse_pol_bio(self,response):

    print "[parse_pol_bio]- response url - "+response.url

    name_xp = '//span[@id=\"ctl00_ContentPlaceHolder1_GridView1_ctl02_Label3\"]/font/text()'
    base_xp_prefix = '//*[@id=\"ctl00_ContentPlaceHolder1_TabContainer1_TabPanel2_ctl00_DetailsView2_Label'
    base_xp_suffix='\"]/text()'
    father_id = '12'
    mother_id = '13'
    married_id = '1'
    spouse_id = '3'

    name = response.xpath(name_xp).extract()[0].strip()
    name = re.sub(' +', ' ',name)

    father = response.xpath(base_xp_prefix+father_id+base_xp_suffix).extract()[0].strip()
    mother = response.xpath(base_xp_prefix+mother_id+base_xp_suffix).extract()[0].strip()
    married = response.xpath(base_xp_prefix+married_id+base_xp_suffix).extract()[0].strip().split(' ')[0]

    if married == "Married":
        spouse = response.xpath(base_xp_prefix+spouse_id+base_xp_suffix).extract()[0].strip()
    else:
        spouse = ''

    print 'name     marital_stat    father_name     mother_name     spouse'
    print name,married,father,mother,spouse

    item = RsItem()
    item['name'] = name
    item['spouse'] = spouse
    item['mother'] = mother
    item['father'] = father

    return item



def parse_politician(self,response):

    evTarget = 'ctl00$ContentPlaceHolder1$TabContainer1'
    evArg =  'activeTabChanged:1'
    formdata = {'__EVENTTARGET':evTarget,'__EVENTARGUMENT':evArg}

    print "[parse_politician]-response url - "+response.url

    return scrapy.FormRequest.from_response(response, formdata,method = 'POST', callback = self.parse_pol_bio)

解释
parse方法为不同的政治家循环目标ID并发送请求 parse_politician - 用于更改标签的用途
parse_politician_bio执行依赖名称的抓取。

问题
问题是这会导致对parse_politician_bio的重复响应即有关同一人的信息会多次出现重复响应的性质在每次运行时都是随机的，即每次响应时可能会重复不同的政客数据我已经检查过是否有任何请求多次收益但是没有在每次屈服请求之后还尝试放置一些睡眠以确定它是否有帮助我怀疑scrapy Request Scheduler在这里。

代码中是否还有其他问题？可以做些什么来避免这种情况？

修改
为了澄清这里的一些内容，我知道dont_filter = True是做什么的，并刻意保留了这一点。

问题是某些响应数据正在被替换。例如，当我生成3个请求时，分别为target_id = 1,2,3。 target_id = 1的响应被target_id = 2的响应替换 [所以这让我对target_id有一个响应 - 3对于target_id -2有两个]

预期输出（csv）

politician name , spouse name , father name , mother name
pol1 , spouse1, father1, mother1
pol2 , spouse2, father2, mother2
pol3 , spouse3, father3, mother3

给出的输出（csv）

politician name , spouse name , father name , mother name
pol1 , spouse1, father1, mother1
pol1 , spouse1, father1, mother1
pol3 , spouse3, father3, mother3

Answer 1

最后修好了（p！）默认情况下，scrapy一次发送16个请求（并发请求）将CONCURRENT_REQUESTS = 1置于settings.py文件中即可顺序解决问题。

我提出的请求类似（上面检查），响应数据与之重叠彼此只给出一种类型的重复回复。

不知道如何发生这种情况，但通过发出连续请求的解决方案证实了这一点有更好的解释吗？

避免python scrapy中的响应重叠

1 个答案: