最近,我正在研究抓取,我选择了dataurl
,我可以轻松地通过scrapy获取数据。但它总是会回复英文数据。
为了获取中文数据,我发现dataurl
个响应取决于urlControllLangContentByServerSide
的国家/地区语言与参数&plang=1
的对应关系。我甚至为&plang=3
附加{plang:3}
或formdata dataurl
,但这不起作用
简而言之,首先应该访问urlControllLangContentByServerSide
,如果我想获得dataurl
的中文数据,这已经通过邮递员的许多测试证明了,我不知道如何在代码中处理这个问题。
感谢您抽出时间阅读和思考。
def start_requests(self):
urlControllLangContentByServerSide='http://messefrankfurt.kenti-creative.com/index.php?moduleId=129&pageName=list2&pId=14&plang=3'
dataurl='http://messefrankfurt.kenti-creative.com/modules/exhibitor/ajax/more2.php?moduleId=129&pageName=list2&pId=14&yId=0&hId=0&uId=-2&cId=undefined&aId=-1&fId=0&plang=3'
# I even append &plang=3 for dataurl,But that doesn't work
for s in range(5):
time.sleep(.5) #im trying to visit this url many times to tell server what
#language should be used! maybe that server uses session to controll language data.
yield scrapy.Request(urlControllLangContentByServerSide,callback=self.parse_m,method='POST')
for i in range(5):
form_data={"page":"%s" % i}
self.current_index=i
yield scrapy.FormRequest(url, callback=self.parse,
method='POST', formdata=form_data)
print(self.wrongs)
def parse_m(self,response):
with open('mother%s.html'% random.randint(3,90) ,'wb') as f:
f.write(response.body)
答案 0 :(得分:1)
如果您需要首先访问urlControllLangContentByServerSide
并按顺序访问dataurl,您可以从parse_m方法退回您的请求,如下所示:
def start_requests(self):
urlControllLangContentByServerSide='http://messefrankfurt.kenti-creative.com/index.php?moduleId=129&pageName=list2&pId=14&plang=3'
# I even append &plang=3 for dataurl,But that doesn't work
for s in range(5):
time.sleep(.5) #im trying to visit this url many times to tell server what
#language should be used! maybe that server uses session to controll language data.
yield scrapy.Request(urlControllLangContentByServerSide,callback=self.parse_m,method='POST')
print(self.wrongs)
def parse_m(self,response):
dataurl='http://messefrankfurt.kenti-creative.com/modules/exhibitor/ajax/more2.php?moduleId=129&pageName=list2&pId=14&yId=0&hId=0&uId=-2&cId=undefined&aId=-1&fId=0&plang=3'
with open('mother%s.html'% random.randint(3,90) ,'wb') as f:
f.write(response.body)
form_data={"page":"%s" % i}
self.current_index=i
yield scrapy.FormRequest(url, callback=self.parse,
method='POST', formdata=form_data)
def parse(self,response):
pass #Parse the response of dataurl
如果您需要在请求之间传递数据,则可以使用meta属性。有关更多信息,请参阅this tutorial以及有关如何使用元属性的示例。