Question

start_urls = ['https://www.qichacha.com/search?key=北京证大向上']
def parse(self, response):
    # the start_url is a list page, the company_url is a detail_url from the list page
    yield scrapy.Request(url=company_url, meta={"infos":info},callback=self.parse_basic_info, dont_filter=True)

当请求company_url时，然后响应405，但是，如果我使用

response = requests.get(company_url, headers=headers)
print(response.code)
print(response.txt)

然后返回200并可以解析html页面，或者

start_urls=[company_url]
def parse(self, response):
    print(response.code)
    print(response.txt)

还有响应200，我不知道为什么响应405 当它响应405，我打印请求是这样的： {'_encoding'：'utf-8'，'method'：'GET'，'_url'：'https://www.qichacha.com/firm_b18bf42ee07d7961e91a0edaf1649287.html'，'_body'：b''，'priority'：0，'callback'：无，'errback'：无，'cookies'：{}，'headers'：{b'User-Agent'：[b'Mozilla / 5.0（Macintosh; Intel Mac OS X 10_7_3）AppleWebKit / 535.20（KHTML，如Gecko） Chrome / 19.0.1036.7 Safari / 535.20']}，'dont_filter'：False，'_ meta'：{'depth'：1}，'flags'：[]} 怎么了？

Answer 1

该页面似乎使用默认的用户代理字符串阻止了Scrapy。像这样运行蜘蛛对我有用：

scrapy runspider -s USER_AGENT="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36" spider.py

或者，您可以在项目的USER_AGENT中设置settings.py。或者，使用scrapy-fake-useragent之类的东西来自动处理。

沙哑爬行405

1 个答案: