爬行2页后刮痧停止

时间:2017-10-02 10:47:24

标签: python scrapy web-crawler

我使用Scrapy从论坛抓取内容。但它只是爬了2页并停止。我没有使用CrawlSpider进行下一页的回调。

我的代码:

# -*- coding: utf-8 -*-
    import scrapy
    import datetime

    from scrapy.selector import Selector
    from scrapy.utils.response import get_base_url
    from ttv.items import *
    class Scrape_TTV(scrapy.Spider):
        name = "ttv"
        allowed_domains = ['tangthuvien.vn', 'www.tangthuvien.vn']
        start_urls = [
            'http://www.tangthuvien.vn/forum/forumdisplay.php?f=142',
        ]
        def parse(self, response):
            for site in response.css('div.threadinfo'):
                yield {
                    'title':site.css('h3.threadtitle > a.title::text').extract_first(),
                    'url':site.css('h3.threadtitle > a::attr(href)').extract_first(),
                }
            page = response.css('span.prev_next a::attr(href)').extract_first()
            if page:
                yield scrapy.Request(response.urljoin(page),callback=self.parse)

这是调试:我知道有一个过滤后的副本,第142页& page = 2,但无法理解原因。请帮我这个案子?

            2017-10-02 17:08:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.tangthuvien.vn/forum/forumdisplay.php?f=142&page=2>
        {'url': u'showthread.php?t=74829', 'title': u'[T\xecnh c\u1ea3m] Y\xeau v\u1eabn n\u01a1i \u0111\xe2y - T\xecnh Kh\xf4ng Lam H\u1ec1 - HO\xc0N TH\xc0NH'}
        2017-10-02 17:08:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.tangthuvien.vn/forum/forumdisplay.php?f=142&page=2>
        {'url': u'showthread.php?t=75425', 'title': u'[Ki\u1ebfm hi\u1ec7p] Ki\u1ebfm Kh\xed Tr\u01b0\u1eddng Giang [Th\u1ea7n Ch\xe2u K\u1ef3 Hi\u1ec7p]- \xd4n Th\u1ee5y An -Ho\xe0n th\xe0nh
        '}
        2017-10-02 17:08:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.tangthuvien.vn/forum/forumdisplay.php?f=142&page=2>
        {'url': u'showthread.php?t=74276', 'title': u'[Truy\u1ec7n ng\u1eafn] Tho\xe1ng nh\u01b0 h\xf4m qua - Ki\u1ec3u Ki\u1ec3u'}
        2017-10-02 17:08:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.tangthuvien.vn/forum/forumdisplay.php?f=142&page=2>
        {'url': u'showthread.php?t=29663', 'title': u'V\xf4 h\u1ea1n kh\u1ee7ng b\u1ed1 - zhttty (Ho\xe0n th\xe0nh)'}
        2017-10-02 17:08:19 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.tangthuvien.vn/forum/forumdisplay.php?f=142> - no more duplicates will be shown (see DUPEFI
        LTER_DEBUG to show all duplicates)
        2017-10-02 17:08:19 [scrapy.core.engine] INFO: Closing spider (finished)
        2017-10-02 17:08:19 [scrapy.extensions.feedexport] INFO: Stored json feed (80 items) in: test2.json
        2017-10-02 17:08:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 1899,
         'downloader/request_count': 5,
         'downloader/request_method_count/GET': 5,
         'downloader/response_bytes': 440894,
         'downloader/response_count': 5,
         'downloader/response_status_count/200': 5,
         'dupefilter/filtered': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2017, 10, 2, 10, 8, 19, 242000),
         'item_scraped_count': 80,
         'log_count/DEBUG': 87,
         'log_count/INFO': 8,
         'request_depth_max': 4,
         'response_received_count': 5,
         'scheduler/dequeued': 4,
         'scheduler/dequeued/memory': 4,
         'scheduler/enqueued': 4,
         'scheduler/enqueued/memory': 4,
         'start_time': datetime.datetime(2017, 10, 2, 10, 8, 17, 586000)}
        2017-10-02 17:08:19 [scrapy.core.engine] INFO: Spider closed (finished)

1 个答案:

答案 0 :(得分:0)

试试这个 -

Python 2.7.3

import urlparse

base_url = 'http://www.tangthuvien.vn/forum/'
page = response.xpath('//a[@rel="next"]/@href').extract_first()
#page = response.css('span.prev_next a::attr(href)').extract_first()
if page:
    yield scrapy.Request(urlparse.urljoin(base_url,page),callback=self.parse)

更新 - 因为我对Xpath更方便。