这是代码,我写的是刮掉justdial网站。
import scrapy
from scrapy.http.request import Request
class JustdialSpider(scrapy.Spider):
name = 'justdial'
# handle_httpstatus_list = [400]
# headers={'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
# handle_httpstatus_list = [403, 404]
allowed_domains = ['justdial.com']
start_urls = ['https://www.justdial.com/Delhi-NCR/Chemists/page-1']
# def start_requests(self):
# # hdef start_requests(self):
# headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
# for url in self.start_urls:
# self.log("I just visited :---------------------------------- "+url)
# yield Request(url, headers=headers)
def parse(self,response):
self.log("I just visited the site:---------------------------------------------- "+response.url)
urls = response.xpath('//a/@href').extract()
self.log("Urls-------: "+str(urls))
这是错误显示在终端:
2017-08-18 18:32:25 [scrapy.core.engine] INFO: Spider opened
2017-08-18 18:32:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2017-08-18 18:32:25 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache
storage in D:\scrapy\justdial\.scrapy\httpcache
2017-08-18 18:32:25 [scrapy.extensions.telnet] DEBUG: Telnet console listening o
n 127.0.0.1:6023
2017-08-18 18:32:25 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.j
ustdial.com/robots.txt> (referer: None) ['cached']
2017-08-18 18:32:25 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.j
ustdial.com/Delhi-NCR/Chemists/page-1> (referer: None) ['cached']
2017-08-18 18:32:25 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response
<403 https://www.justdial.com/Delhi-NCR/Chemists/page-1>: HTTP status code is n
ot handled or not allowed
我在stackoverflow上看过类似的问题我尝试了一切,比如 你可以在Code中看到我试过的评论,
更改了UserAgents
设置handle_httpstatus_list = [400]
注意:这个(https://www.justdial.com/Delhi-NCR/Chemists/page-1)网站甚至没有在我的系统中被阻止。当我在chrome / mozilla中打开网站时,它正在打开。这也是(https://www.practo.com/bangalore#doctor-search)网站的错误。
答案 0 :(得分:2)
使用user_agent
spider属性设置用户代理时,它会开始工作。可能设置请求标头是不够的,因为默认用户代理字符串会覆盖它。所以设置蜘蛛属性
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
(与设置start_urls
的方式相同)并尝试使用。
答案 1 :(得分:0)
您的调查显示问题似乎与HTTP客户端(scrapy)有关,而不是网络问题(防火墙,IP禁令)。
阅读scrapy文档以打开调试日志记录。您想要查看scrapy发出的HTTP请求的内容。它可能包括当用户代理仍处于scrapy时由网站设置的cookie。
答案 2 :(得分:0)
As(TomášLinhart)提到,
我们必须在useragents
中添加setting.py
设置,例如
USER_AGENT ='Mozilla / 5.0(Windows NT 6.1; WOW64)AppleWebKit / 537.1 (KHTML,像Gecko一样)Chrome / 22.0.1207.1 Safari / 537.1'