我正在尝试制作蜘蛛来捕捉图像。我已经让蜘蛛工作了,它只是......不起作用而且不会出错。
蜘蛛:
from urlparse import urljoin
from scrapy.selector import XmlXPathSelector
from scrapy.spider import BaseSpider
from nasa.items import NasaItem
class NasaImagesSpider(BaseSpider):
name = "nasa.gov"
start_urls = ('http://www.nasa.gov/multimedia/imagegallery/iotdxml.xml',)
def parse(self, response):
xxs = XmlXPathSelector(response)
item = NasaItem()
baseLink = xxs.select('//link/text()').extract()[0]
imageLink = xxs.select('//tn/text()').extract()
imgList = []
for img in imageLink:
imgList.append(urljoin(baseLink, img))
item['image_urls'] = imgList
return item
它贯穿整个页面,并正确捕获网址。我将它传递给管道,但是......没有图片。
设置文件:
BOT_NAME = 'nasa.gov'
BOT_VERSION = '1.0'
ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline']
IMAGE_STORE = '/home/usr1/Scrapy/spiders/nasa/images'
LOG_LEVEL = "DEBUG"
SPIDER_MODULES = ['nasa.spiders']
NEWSPIDER_MODULE = 'nasa.spiders'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
和项目文件:
from scrapy.item import Item, Field
class NasaItem(Item):
image_urls = Field()
images = Field()
和输出日志:
2012-11-12 07:47:28-0500 [scrapy] INFO: Scrapy 0.14.4 started (bot: nasa)
2012-11-12 07:47:29-0500 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-11-12 07:47:29-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-11-12 07:47:29-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-11-12 07:47:29-0500 [scrapy] DEBUG: Enabled item pipelines:
2012-11-12 07:47:29-0500 [nasa.gov] INFO: Spider opened
2012-11-12 07:47:29-0500 [nasa.gov] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-11-12 07:47:29-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-11-12 07:47:29-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-11-12 07:47:29-0500 [nasa.gov] DEBUG: Crawled (200) <GET http://www.nasa.gov/multimedia/imagegallery/iotdxml.xml> (referer: None)
2012-11-12 07:47:29-0500 [nasa.gov] DEBUG: Scraped from <200 http://www.nasa.gov/multimedia/imagegallery/iotdxml.xml>
#removed output of every jpg link
2012-11-12 07:47:29-0500 [nasa.gov] INFO: Closing spider (finished)
2012-11-12 07:47:29-0500 [nasa.gov] INFO: Dumping spider stats:
{'downloader/request_bytes': 227,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2526,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 11, 12, 12, 47, 29, 802477),
'item_scraped_count': 1,
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2012, 11, 12, 12, 47, 29, 682005)}
2012-11-12 07:47:29-0500 [nasa.gov] INFO: Spider closed (finished)
2012-11-12 07:47:29-0500 [scrapy] INFO: Dumping global stats:
{'memusage/max': 104132608, 'memusage/startup': 104132608}
我被困住了。关于我做错了什么的任何建议?
[已编辑]添加了输出日志,更改了设置机器人名称。
答案 0 :(得分:1)
#pipeline file
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
class PaulsmithPipeline(ImagesPipeline):
def process_item(self, item, spider):
return item
def get_media_requests(self,item,info):
for image_url in item['image_urls']:
yield Request(image_url)
def item_completed(self,results,item,info):
image_paths=[x['path'] for ok,x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item["image_paths"]=image_paths
return item