这是我的蜘蛛:
import scrapy
import urlparse
from scrapy.http import Request
class BasicSpider(scrapy.Spider):
name = "basic2"
allowed_domains = ["cnblogs"]
start_urls = (
'http://www.cnblogs.com/kylinlin/',
)
def parse(self, response):
next_site = response.xpath(".//*[@id='nav_next_page']/a/@href")
for url in next_site.extract():
yield Request(urlparse.urljoin(response.url,url))
item_selector = response.xpath(".//*[@class='postTitle']/a/@href")
for url in item_selector.extract():
yield Request(url=urlparse.urljoin(response.url, url),
callback=self.parse_item)
def parse_item(self, response):
print "+=====================>>test"
这是输出:
2016-08-12 14:46:20 [scrapy]信息:蜘蛛打开了
2016-08-12 14:46:20 [scrapy]信息:抓0页(0页/分),刮0件(0件/分)
2016-08-12 14:46:20 [scrapy] DEBUG:telnet控制台监听到了127.0.0.1:6023
2016-08-12 14:46:20 [scrapy] DEBUG:Crawled(200)http://www.cnblogs.com/robots.txt> (引用者:无)
2016-08-12 14:46:20 [scrapy] DEBUG:Crawled(200)http://www.cnblogs.com/kylinlin/> (引用者:无)
2016-08-12 14:46:20 [scrapy] DEBUG:过滤现场请求'www.cnblogs.com':http://www.cnblogs.com/kylinlin/default.html?page = 2>
2016-08-12 14:46:20 [scrapy] INFO:关闭蜘蛛(已完成)
2016-08-12 14:46:20 [scrapy] INFO:倾倒Scrapy统计:
{'downloader / request_bytes':445,
'downloader / request_count':2,
'downloader / request_method_count / GET':2,
'downloader / response_bytes':5113,
'downloader / response_count':2,
'downloader / response_status_count / 200':2,
'finish_reason':'完成',
'finish_time':datetime.datetime(2016,8,12,6,46,20,420000),
'log_count / DEBUG':4,
'log_count / INFO':7,
'offsite / domains':1,
'offsite / filtered':11,
'request_depth_max':1,
'response_received_count':2,
'scheduler / dequeued':1,
'scheduler / dequeued / memory':1,
'scheduler / enqueued':1,
'scheduler / enqueued / memory':1,
'start_time':datetime.datetime(2016,8,12,6,46,20,131000)}
2016-08-12 14:46:20 [scrapy]信息:蜘蛛关闭(完成)
为什么抓取的网页为0? 我无法理解为什么没有像“+ =====================>> test”这样的输出。 有人可以帮帮我吗?
答案 0 :(得分:1)
2016-08-12 14:46:20 [scrapy] DEBUG: Filtered offsite request to 'www.cnblogs.com': http://www.cnblogs.com/kylinlin/default.html?page=2>
并且您的设置为:
allowed_domains = ["cnblogs"]
甚至不是域名。它应该是:
allowed_domains = ["cnblogs.com"]