这是我简单的蜘蛛代码(刚开始):
def start_requests(self):
urls = [
'http://www.liputan6.com/search?q=bubarkan+hti&type=all',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
使用浏览器我可以访问网址' http://www.liputan6.com/search?q=bubarkan+hti&type=all'一般。但是为什么这个scrapy我得到了302响应,而且我没有爬到页面..
请有人告诉我,如何解决它..
答案 0 :(得分:0)
似乎网页需要一些cookie,如果找不到这些cookie,它会重定向到索引页面。
我通过添加这些Cookie来实现它:js_enabled=true; is_cookie_active=true;
:
$scrapy shell "http://www.liputan6.com/search?q=bubarkan+hti&type=all"
# redirect happens
>[1]: response.url
<[1]: 'http://www.liputan6.com'
# add cookie to request:
>[2]: request.headers['Cookie'] = 'js_enabled=true; is_cookie_active=true;'
>[3]: fetch(request)
# redirect no longer happens
>[4]: response.url
<[4]: 'http://www.liputan6.com/search?q=bubarkan+hti&type=all'
编辑:对于您的代码,请尝试:
def start_requests(self):
urls = [
'http://www.liputan6.com/search?q=bubarkan+hti&type=all',
]
for url in urls:
req= scrapy.Request(url=url, callback=self.parse)
req.headers['Cookie'] = 'js_enabled=true; is_cookie_active=true;'
yield req
def parse(self, response):
# 200 response here