Scrapy忽略robots.txt而不解析页面

时间:2017-01-04 15:30:33

标签: python web-scraping scrapy web-crawler

我试图以与ACRIS中的one of the answers类似的方式抓取directives(查看更新答案)。我略微修改了代码以删除弃用的东西。

起初我遇到robots.txt限制我的问题,所以我发现我可以在settings.py中设置ROBOTSTXT_OBEY=False,但实际上它似乎忽略了它但是由于某些原因蜘蛛不是再进入parse方法。

这是我的蜘蛛

from scrapy.http import Request, FormRequest
from scrapy.item import Item, Field
from scrapy.spiders import Spider
import logging

class AcrisItem(Item):
    borough = Field()
    block = Field()


class AcrisSpider(Spider):
    name = "acris"
    allowed_domains = ["a836-acris.nyc.gov"]
    start_urls = ['https://a836-acris.nyc.gov/DS/DocumentSearch/PartyName']

    def start_requests(self):
        return [(Request(url, meta={'dont_redirect': True}, callback=self.parse)) for url in self.start_urls]

    def parse(self, response):
        form_token = response.selector.xpath('//input[@name="__RequestVerificationToken"]/@value').extract_first()

        logging.debug('THE FORM TOKEN IS: %s\n\n' % form_token)

        formdata = {
            "__RequestVerificationToken": form_token,
            "hid_last": "SMITH",
            "hid_first": "JOHN",
            "hid_ml": "",
            "hid_suffix": "",
            "hid_business": "",
            "hid_selectdate": "To Current Date",
            "hid_datefromm": "",
            "hid_datefromd": "",
            "hid_datefromy": "",
            "hid_datetom": "",
            "hid_datetod": "",
            "hid_datetoy": "",
            "hid_partype": "",
            "hid_borough": "All Boroughs/Counties",
            "hid_doctype": "All Document Classes",
            "hid_max_rows": "10",
            "hid_page": "1",
            "hid_partype_name": "All Parties",
            "hid_doctype_name": "All Document Classes",
            "hid_borough_name": "All Boroughs/Counties",
            "hid_ReqID": "",
            "hid_SearchType": "PARTYNAME",
            "hid_ISIntranet": "N",
            "hid_sort": ""
        }

        if form_token:
            yield FormRequest(url="https://a836-acris.nyc.gov/DS/DocumentSearch/PartyNameResult",
                              method="POST",
                              formdata=formdata,
                              meta={'dont_redirect': True},
                              callback=self.parse_page)

    def parse_page(self, response):
        rows = response.selector.xpath('//form[@name="DATA"]/table/tbody/tr[2]/td/table/tbody/tr')

        for row in rows:
            item = AcrisItem()

            borough = row.xpath('.//td[3]/div/font/text()').extract_first()
            block = row.xpath('.//td[4]/div/font/text()').extract_first()

            if borough and block:
                item['borough'] = borough
                item['block'] = block

                yield item

这是输出(减去init消息)

2017-01-04 17:06:12 [scrapy.core.engine] INFO: Spider opened
2017-01-04 17:06:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-04 17:06:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-04 17:06:41 [scrapy.core.engine] DEBUG: Crawled (301) <GET https://a836-acris.nyc.gov/DS/DocumentSearch/PartyName> (referer: None)
2017-01-04 17:06:41 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <301 https://a836-acris.nyc.gov/DS/DocumentSearch/PartyName>: HTTP status code is not handled or not allowed
2017-01-04 17:06:41 [scrapy.core.engine] INFO: Closing spider (finished)
2017-01-04 17:06:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 243,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 414,
 'downloader/response_count': 1,
 'downloader/response_status_count/301': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 1, 4, 15, 6, 41, 791647),
 'log_count/DEBUG': 2,
 'log_count/INFO': 8,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 1, 4, 15, 6, 12, 467659)}
2017-01-04 17:06:41 [scrapy.core.engine] INFO: Spider closed (finished)

1 个答案:

答案 0 :(得分:0)

您禁止在start_requests上直接重定向。因此301应该到达parse方法,遗憾的是您没有允许此响应状态代码通过。

允许使用属性:

class AcrisSpider(Spider):
    ...
    handle_httpstatus_list = [301, 302]
    ...

或在请求中传递handle_httpstatus_all=True元参数,如:

return [(Request(url, meta={'dont_redirect': True, 'handle_httpstatus_all':True}, callback=self.parse)) for url in self.start_urls]