这是我编写的代码来刮掉" blablacar"网站。
# -*- coding: utf-8 -*-
import scrapy
class BlablaSpider(scrapy.Spider):
name = 'blabla'
allowed_domains = ['blablacar.in']
start_urls = ['http://www.blablacar.in/ride-sharing/new-delhi/chandigarh']
def parse(self, response):
print(response.text)
运行上述内容时,我收到错误
2018-06-11 00:07:05 [scrapy.extensions.telnet] DEBUG:Telnet控制台 听取127.0.0.1:6023 2018-06-11 00:07:06 [scrapy.core.engine] DEBUG:Crawled(403)http://www.blablacar.in/robots.txt> (引用者:无)2018-06-11 00:07:06 [scrapy.core.engine] DEBUG: 抓取(403)http://www.blablacar.in/ride-sharing/new-delhi/chandigarh> (引荐: 无)2018-06-11 00:07:06 [scrapy.spidermiddlewares.httperror]信息: 忽略响应< 403 http://www.blablacar.in/ride-sharing/new-delhi/chandigarh>:HTTP 状态代码未处理或不允许2018-06-11 00:07:06 [scrapy.core.engine]信息:关闭蜘蛛(已完成)
答案 0 :(得分:0)
通常在html中,403错误表示您无权访问该页面。
如果没有显示相同的错误,请尝试使用其他网站,这可能是因为网站回复引起的
答案 1 :(得分:0)
您需要配置用户代理。我使用配置的用户代理在我的站点中运行您的代码,并且我获得了状态代码200。
1置于具有名称utils.py
的settings.py新文件附近import random
user_agent_list = [
# Chrome
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
# Firefox
'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)',
'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0'
]
def get_random_agent():
return random.choice(user_agent_list)
2添加到您的settings.py文件:
from <SCRAPY_PROJECT>.utils import get_random_agent
USER_AGENT = get_random_agent()
答案 2 :(得分:0)
According to Scrapy documentation,您可以使用handle_httpstatus_list蜘蛛属性。
以您为例:
class BlablaSpider(scrapy.Spider):
name = 'blabla'
allowed_domains = ['blablacar.in']
start_urls = ['http://www.blablacar.in/ride-sharing/new-delhi/chandigarh']
handle_httpstatus_list = [403]