我试图抓住在这里找到的印度Rajya Sabha成员的家庭信息http://164.100.47.5/Newmembers/memberlist.aspx 作为scrapy的新手,我按照this和this示例代码生成了以下内容。
def parse(self,response):
print "Inside parse"
requests = []
target_base_prefix = 'ctl00$ContentPlaceHolder1$GridView2$ctl'
target_base_suffix = '$lkb'
for i in range(2,5):
if i < 10:
target_id = "0"+str(i)
else:
target_id = str(i)
evTarget = target_base_prefix+target_id+target_base_suffix
form_data = {'__EVENTTARGET':evTarget,'__EVENTARGUMENT':''}
requests.append(scrapy.http.FormRequest.from_response(response, formdata = form_data,dont_filter=True,method = 'POST', callback = self.parse_politician))
for r in requests:
print "before yield"+str(r)
yield r
def parse_pol_bio(self,response):
print "[parse_pol_bio]- response url - "+response.url
name_xp = '//span[@id=\"ctl00_ContentPlaceHolder1_GridView1_ctl02_Label3\"]/font/text()'
base_xp_prefix = '//*[@id=\"ctl00_ContentPlaceHolder1_TabContainer1_TabPanel2_ctl00_DetailsView2_Label'
base_xp_suffix='\"]/text()'
father_id = '12'
mother_id = '13'
married_id = '1'
spouse_id = '3'
name = response.xpath(name_xp).extract()[0].strip()
name = re.sub(' +', ' ',name)
father = response.xpath(base_xp_prefix+father_id+base_xp_suffix).extract()[0].strip()
mother = response.xpath(base_xp_prefix+mother_id+base_xp_suffix).extract()[0].strip()
married = response.xpath(base_xp_prefix+married_id+base_xp_suffix).extract()[0].strip().split(' ')[0]
if married == "Married":
spouse = response.xpath(base_xp_prefix+spouse_id+base_xp_suffix).extract()[0].strip()
else:
spouse = ''
print 'name marital_stat father_name mother_name spouse'
print name,married,father,mother,spouse
item = RsItem()
item['name'] = name
item['spouse'] = spouse
item['mother'] = mother
item['father'] = father
return item
def parse_politician(self,response):
evTarget = 'ctl00$ContentPlaceHolder1$TabContainer1'
evArg = 'activeTabChanged:1'
formdata = {'__EVENTTARGET':evTarget,'__EVENTARGUMENT':evArg}
print "[parse_politician]-response url - "+response.url
return scrapy.FormRequest.from_response(response, formdata,method = 'POST', callback = self.parse_pol_bio)
解释
parse
方法为不同的政治家循环目标ID并发送请求
parse_politician
- 用于更改标签的用途
parse_politician_bio
执行依赖名称的抓取。
问题
问题是这会导致对parse_politician_bio的重复响应
即有关同一人的信息会多次出现
重复响应的性质在每次运行时都是随机的,即每次响应时可能会重复不同的政客数据
我已经检查过是否有任何请求多次收益但是没有
在每次屈服请求之后还尝试放置一些睡眠以确定它是否有帮助
我怀疑scrapy Request Scheduler在这里。
代码中是否还有其他问题?可以做些什么来避免这种情况?
修改
为了澄清这里的一些内容,我知道dont_filter = True是做什么的,并刻意保留了这一点。
问题是某些响应数据正在被替换。例如,当我生成3个请求时,分别为target_id = 1,2,3。 target_id = 1的响应被target_id = 2的响应替换 [所以这让我对target_id有一个响应 - 3对于target_id -2有两个]
预期输出(csv)
politician name , spouse name , father name , mother name
pol1 , spouse1, father1, mother1
pol2 , spouse2, father2, mother2
pol3 , spouse3, father3, mother3
给出的输出(csv)
politician name , spouse name , father name , mother name
pol1 , spouse1, father1, mother1
pol1 , spouse1, father1, mother1
pol3 , spouse3, father3, mother3
答案 0 :(得分:1)
最后修好了(p!)
默认情况下,scrapy一次发送16个请求(并发请求)
将CONCURRENT_REQUESTS = 1
置于settings.py文件中即可顺序解决问题。
我提出的请求类似(上面检查),响应数据与之重叠 彼此只给出一种类型的重复回复。
不知道如何发生这种情况,但通过发出连续请求的解决方案证实了这一点 有更好的解释吗?