scrapy 503服务在starturl上不可用

时间:2019-01-07 06:49:28

标签: python scrapy web-crawler scrapy-spider

我修改了这个蜘蛛,但它给出了这个错误

import React from "react";
    export class Cards extends React.Component {
      render() {
        return (
         {this.props.items.map(item => {
         return <Card titles = {item.title} imgsrc = {item.urlToImage}
          discr = {item.description} urls = {item.url}
          />
        })}
        );
      }
    }

搜寻器代码:

Gave up retrying <GET https://lib.maplelegends.com/robots.txt> (failed 3 times): 503 Service Unavailable 
2019-01-06 23:43:56 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://lib.maplelegends.com/robots.txt> (referer: None)
2019-01-06 23:43:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://lib.maplelegends.com/?p=etc&id=4004003> (failed 1 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://lib.maplelegends.com/?p=etc&id=4004003> (failed 2 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://lib.maplelegends.com/?p=etc&id=4004003> (failed 3 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://lib.maplelegends.com/?p=etc&id=4004003> (referer: None)
2019-01-06 23:43:56 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://lib.maplelegends.com/?p=etc&id=4004003>: HTTP status code is not handled or not allowed

---它无需项目即可运行,并保存在#!/usr/bin/env python3 import scrapy import time start_url = 'https://lib.maplelegends.com/?p=etc&id=4004003' class MySpider(scrapy.Spider): name = 'MySpider' start_urls = [start_url] def parse(self, response): # print('url:', response.url) products = response.xpath('.//div[@class="table-responsive"]/table/tbody') for product in products: item = { #'name': product.xpath('./tr/td/b[1]/a/text()').extract(), 'link': product.xpath('./tr/td/b[1]/a/@href').extract(), } # url = response.urljoin(item['link']) # yield scrapy.Request(url=url, callback=self.parse_product, meta={'item': item}) yield response.follow(item['link'], callback=self.parse_product, meta={'item': item}) time.sleep(5) # execute with low yield scrapy.Request(start_url, dont_filter=True, priority=-1) def parse_product(self, response): # print('url:', response.url) # name = response.xpath('(//strong)[1]/text()').re(r'(\w+)') hp = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "image", " " ))] | //img').re(r':(\d+)') scrolls = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "image", " " ))] | //strong+//a//img/@title').re(r'\bScroll\b') for price, hp, scrolls in zip(name, hp, scrolls): yield {'name': name.strip(), 'hp': hp.strip(), 'scroll':scrolls.strip()} 中---

output.csv

1 个答案:

答案 0 :(得分:2)

Robots.txt

Your crawler is trying to check robots.txt file but the website doesn't have one present.

To avoid this you can set ROBOTSTXT_OBEY setting to false in your settings.py file.
By default it's False but new scrapy projects generated with scrapy startproject command has ROBOTSTXT_OBEY = True generated from the template.

503 responses

Further the website seems to respond as 503 on every first request. The website is using some sort of bot protection:

First request is 503 then some javascript is being executed to make an AJAX request for generating __shovlshield cookie:

enter image description here

Seems like https://shovl.io/ ddos protection is being used.

To solve this you need to reverse engineer how javascript generates the cookie or employ javascript rendering techniques/services such as selenium or splash