Scrapy Crawled(302)状态,如何处理

时间:2017-10-05 15:26:05

标签: python web-scraping scrapy

import scrapy


class Pttscrapper2Spider(scrapy.Spider):
    name = 'PTTscrapper2'
    allowed_domains = ['https://www.ptt.cc']
    start_urls = ['https://www.ptt.cc/bbs/HatePolitics/index.html/']
    handle_httpstatus_list = [400, 302]

    def parse(self, response):
        urls = response.css('div.r-ent > div.title > a::attr(href)').extract()
        for thread_url in urls:
            url = response.urljoin(thread_url)
            yield scrapy.Request(url=url, callback=self.parse_details)

        next_page_url = response.css('a.wide:nth-child(2)::attr(href)').extract_first()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(url=next_page_url, callback=self.parse)            


    def parse_details(self, response):
        yield {
                'title' : response.xpath('//head/title/text()').extract(),
                'stance' : response.xpath('//*[@id="main-content"]/div[@class="push"]/span[1]/text()').extract(),
                'userid' : response.xpath('//*[@id="main-content"]/div[@class="push"]/span[2]/text()').extract(),
                'comment' : response.xpath('//*[@id="main-content"]/div[@class="push"]/span[3]/text()').extract(),
                'time_of_post' : response.xpath('//*[@id="main-content"]/div[@class="push"]/span[4]/text()').extract(),
        }

我一直在使用上面的蜘蛛试图抓取一个网站,但是当我运行蜘蛛时,我收到了这些消息:

> 2017-10-05 23:14:27 [scrapy.core.engine] INFO: Spider opened
> 2017-10-05 23:14:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages
> (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-10-05 23:14:27
> [scrapy.extensions.telnet] DEBUG: Telnet console listening on
> 127.0.0.1:6023 2017-10-05 23:14:28 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from:
> <302 https://www.ptt.cc/bbs/HatePolitics/index.html/> Set-Cookie:
> __cfduid=d3ca57dcab04acfaf256438a57c547e4a1507216462; expires=Fri, 05-Oct-18 15:14:22 GMT; path=/; domain=.ptt.cc; HttpOnly
> 
> 2017-10-05 23:14:28 [scrapy.core.engine] DEBUG: Crawled (302) <GET
> https://www.ptt.cc/bbs/HatePolitics/index.html/> (referer: None)
> 2017-10-05 23:14:28 [scrapy.core.engine] INFO: Closing spider
> (finished)

我一直在想的是我的蜘蛛似乎无法访问索引中的子论坛。我已经测试了选择器指向正确的位置和request.urljoin创建了正确的绝对URL但似乎无法访问页面中的子论坛。如果有人能告诉我为什么蜘蛛无法访问链接会很棒!

1 个答案:

答案 0 :(得分:0)

刮刀有两个问题。在start_urls中,您向index.html/添加了一个尾部斜杠,这是错误的。下一个allowed_domains将使用域名,而不是网址。

将起始代码更改为下方,它可以正常工作

class Pttscrapper2Spider(scrapy.Spider):
    name = 'PTTscrapper2'
    allowed_domains = ['www.ptt.cc']
    start_urls = ['https://www.ptt.cc/bbs/HatePolitics/index.html']

运行记录

2017-10-06 13:16:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ptt.cc/bbs/HatePolitics/M.1507268600.A.57C.html> (referer: https://www.ptt.cc/bbs/HatePolitics/index.html)
2017-10-06 13:16:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ptt.cc/bbs/HatePolitics/M.1507268600.A.57C.html>
{'title': ['[黑特] 先刪文,洪慈庸和高潞那個到底撤案了沒? - 看板 HatePolitics - 批踢踢實業坊'], 'stance': ['推 ', '→ ', '噓 ', '→ ', '→ '], 'userid': ['ABA0525', 'gerund', 'AGODFATHER', 'laman45', 'victoryman'], 'comment': [': 垃圾不分藍綠黃', ': 垃圾靠弟傭 中華民國內最沒資格當立委的爛貨', ': 說什麼東西你個板啊', ': 有確定再說', ': 看起來應該是撤了'], 'time_of_post': ['10/06 13:43\n', '10/06 13:50\n', '10/06 13:57\n', '10/06 13:59\n', ' 10/06 15:27\n']}
2017-10-06 13:16:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ptt.cc/bbs/HatePolitics/M.1507275599.A.657.html> (referer: https://www.ptt.cc/bbs/HatePolitics/index.html)