我使用Scrapy从论坛抓取内容。但它只是爬了2页并停止。我没有使用CrawlSpider进行下一页的回调。
我的代码:
# -*- coding: utf-8 -*-
import scrapy
import datetime
from scrapy.selector import Selector
from scrapy.utils.response import get_base_url
from ttv.items import *
class Scrape_TTV(scrapy.Spider):
name = "ttv"
allowed_domains = ['tangthuvien.vn', 'www.tangthuvien.vn']
start_urls = [
'http://www.tangthuvien.vn/forum/forumdisplay.php?f=142',
]
def parse(self, response):
for site in response.css('div.threadinfo'):
yield {
'title':site.css('h3.threadtitle > a.title::text').extract_first(),
'url':site.css('h3.threadtitle > a::attr(href)').extract_first(),
}
page = response.css('span.prev_next a::attr(href)').extract_first()
if page:
yield scrapy.Request(response.urljoin(page),callback=self.parse)
这是调试:我知道有一个过滤后的副本,第142页& page = 2,但无法理解原因。请帮我这个案子?
2017-10-02 17:08:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.tangthuvien.vn/forum/forumdisplay.php?f=142&page=2>
{'url': u'showthread.php?t=74829', 'title': u'[T\xecnh c\u1ea3m] Y\xeau v\u1eabn n\u01a1i \u0111\xe2y - T\xecnh Kh\xf4ng Lam H\u1ec1 - HO\xc0N TH\xc0NH'}
2017-10-02 17:08:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.tangthuvien.vn/forum/forumdisplay.php?f=142&page=2>
{'url': u'showthread.php?t=75425', 'title': u'[Ki\u1ebfm hi\u1ec7p] Ki\u1ebfm Kh\xed Tr\u01b0\u1eddng Giang [Th\u1ea7n Ch\xe2u K\u1ef3 Hi\u1ec7p]- \xd4n Th\u1ee5y An -Ho\xe0n th\xe0nh
'}
2017-10-02 17:08:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.tangthuvien.vn/forum/forumdisplay.php?f=142&page=2>
{'url': u'showthread.php?t=74276', 'title': u'[Truy\u1ec7n ng\u1eafn] Tho\xe1ng nh\u01b0 h\xf4m qua - Ki\u1ec3u Ki\u1ec3u'}
2017-10-02 17:08:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.tangthuvien.vn/forum/forumdisplay.php?f=142&page=2>
{'url': u'showthread.php?t=29663', 'title': u'V\xf4 h\u1ea1n kh\u1ee7ng b\u1ed1 - zhttty (Ho\xe0n th\xe0nh)'}
2017-10-02 17:08:19 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.tangthuvien.vn/forum/forumdisplay.php?f=142> - no more duplicates will be shown (see DUPEFI
LTER_DEBUG to show all duplicates)
2017-10-02 17:08:19 [scrapy.core.engine] INFO: Closing spider (finished)
2017-10-02 17:08:19 [scrapy.extensions.feedexport] INFO: Stored json feed (80 items) in: test2.json
2017-10-02 17:08:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1899,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 440894,
'downloader/response_count': 5,
'downloader/response_status_count/200': 5,
'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 10, 2, 10, 8, 19, 242000),
'item_scraped_count': 80,
'log_count/DEBUG': 87,
'log_count/INFO': 8,
'request_depth_max': 4,
'response_received_count': 5,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2017, 10, 2, 10, 8, 17, 586000)}
2017-10-02 17:08:19 [scrapy.core.engine] INFO: Spider closed (finished)
答案 0 :(得分:0)
试试这个 -
Python 2.7.3
import urlparse
base_url = 'http://www.tangthuvien.vn/forum/'
page = response.xpath('//a[@rel="next"]/@href').extract_first()
#page = response.css('span.prev_next a::attr(href)').extract_first()
if page:
yield scrapy.Request(urlparse.urljoin(base_url,page),callback=self.parse)
更新 - 因为我对Xpath更方便。