如何处理scrapy从网址抓取哪个数据由另一个网址控制的数据

时间:2018-04-14 14:57:58

标签: python scrapy web-crawler

最近,我正在研究抓取,我选择了dataurl,我可以轻松地通过scrapy获取数据。但它总是会回复英文数据。

为了获取中文数据,我发现dataurl个响应取决于urlControllLangContentByServerSide的国家/地区语言与参数&plang=1的对应关系。我甚至为&plang=3附加{plang:3}或formdata dataurl,但这不起作用

简而言之,首先应该访问urlControllLangContentByServerSide,如果我想获得dataurl的中文数据,这已经通过邮递员的许多测试证明了,我不知道如何在代码中处理这个问题。

感谢您抽出时间阅读和思考。

 def start_requests(self):
     urlControllLangContentByServerSide='http://messefrankfurt.kenti-creative.com/index.php?moduleId=129&pageName=list2&pId=14&plang=3'
     dataurl='http://messefrankfurt.kenti-creative.com/modules/exhibitor/ajax/more2.php?moduleId=129&pageName=list2&pId=14&yId=0&hId=0&uId=-2&cId=undefined&aId=-1&fId=0&plang=3'
    # I even append &plang=3 for dataurl,But that doesn't work 
     for  s in range(5):
         time.sleep(.5)  #im trying to visit this url many times to tell server what  
         #language should be used!  maybe that server uses session to controll language data.
         yield scrapy.Request(urlControllLangContentByServerSide,callback=self.parse_m,method='POST')

     for  i in range(5):
         form_data={"page":"%s" % i}
         self.current_index=i
         yield scrapy.FormRequest(url, callback=self.parse,
                             method='POST', formdata=form_data)
     print(self.wrongs)
 def  parse_m(self,response):
     with open('mother%s.html'% random.randint(3,90) ,'wb') as f:
         f.write(response.body)

1 个答案:

答案 0 :(得分:1)

如果您需要首先访问urlControllLangContentByServerSide并按顺序访问dataurl,您可以从parse_m方法退回您的请求,如下所示:

 def start_requests(self):
     urlControllLangContentByServerSide='http://messefrankfurt.kenti-creative.com/index.php?moduleId=129&pageName=list2&pId=14&plang=3'
    # I even append &plang=3 for dataurl,But that doesn't work 
     for  s in range(5):
         time.sleep(.5)  #im trying to visit this url many times to tell server what  
         #language should be used!  maybe that server uses session to controll language data.
         yield scrapy.Request(urlControllLangContentByServerSide,callback=self.parse_m,method='POST')

     print(self.wrongs)

 def  parse_m(self,response):
     dataurl='http://messefrankfurt.kenti-creative.com/modules/exhibitor/ajax/more2.php?moduleId=129&pageName=list2&pId=14&yId=0&hId=0&uId=-2&cId=undefined&aId=-1&fId=0&plang=3'
     with open('mother%s.html'% random.randint(3,90) ,'wb') as f:
         f.write(response.body)
     form_data={"page":"%s" % i}
     self.current_index=i
     yield scrapy.FormRequest(url, callback=self.parse,
                             method='POST', formdata=form_data)

 def  parse(self,response):
    pass #Parse the response of dataurl

如果您需要在请求之间传递数据,则可以使用meta属性。有关更多信息,请参阅this tutorial以及有关如何使用元属性的示例。