scrapy:为什么不使用parse_item函数

时间:2016-08-12 07:19:07

标签: scrapy scrapy-spider

这是我的蜘蛛:

import scrapy
import urlparse
from scrapy.http import Request

class BasicSpider(scrapy.Spider):
    name = "basic2"
    allowed_domains = ["cnblogs"]
    start_urls = (
        'http://www.cnblogs.com/kylinlin/',
    )

    def parse(self, response):
        next_site = response.xpath(".//*[@id='nav_next_page']/a/@href")
        for url in next_site.extract():
            yield Request(urlparse.urljoin(response.url,url))
        
        item_selector = response.xpath(".//*[@class='postTitle']/a/@href")
        for url in item_selector.extract():
            yield Request(url=urlparse.urljoin(response.url, url),
                          callback=self.parse_item)
    
    def parse_item(self, response):
        print "+=====================>>test"

这是输出: 2016-08-12 14:46:20 [scrapy]信息:蜘蛛打开了 2016-08-12 14:46:20 [scrapy]信息:抓0页(0页/分),刮0件(0件/分)
2016-08-12 14:46:20 [scrapy] DEBUG:telnet控制台监听到了127.0.0.1:6023 2016-08-12 14:46:20 [scrapy] DEBUG:Crawled(200)http://www.cnblogs.com/robots.txt> (引用者:无)
2016-08-12 14:46:20 [scrapy] DEBUG:Crawled(200)http://www.cnblogs.com/kylinlin/> (引用者:无)
2016-08-12 14:46:20 [scrapy] DEBUG:过滤现场请求'www.cnblogs.com':http://www.cnblogs.com/kylinlin/default.html?page = 2>
2016-08-12 14:46:20 [scrapy] INFO:关闭蜘蛛(已完成)
2016-08-12 14:46:20 [scrapy] INFO:倾倒Scrapy统计:
{'downloader / request_bytes':445,
 'downloader / request_count':2,
 'downloader / request_method_count / GET':2,
 'downloader / response_bytes':5113,
 'downloader / response_count':2,
 'downloader / response_status_count / 200':2,
 'finish_reason':'完成',
 'finish_time':datetime.datetime(2016,8,12,6,46,20,420000),
 'log_count / DEBUG':4,
 'log_count / INFO':7,
 'offsite / domains':1,
 'offsite / filtered':11,
 'request_depth_max':1,
 'response_received_count':2,
 'scheduler / dequeued':1,
 'scheduler / dequeued / memory':1,
 'scheduler / enqueued':1,
 'scheduler / enqueued / memory':1,
 'start_time':datetime.datetime(2016,8,12,6,46,20,131000)} 2016-08-12 14:46:20 [scrapy]信息:蜘蛛关闭(完成)

为什么抓取的网页为0? 我无法理解为什么没有像“+ =====================>> test”这样的输出。 有人可以帮帮我吗?

1 个答案:

答案 0 :(得分:1)

2016-08-12 14:46:20 [scrapy] DEBUG: Filtered offsite request to 'www.cnblogs.com': http://www.cnblogs.com/kylinlin/default.html?page=2>

并且您的设置为:

allowed_domains = ["cnblogs"]

甚至不是域名。它应该是:

allowed_domains = ["cnblogs.com"]