我第一次尝试使用scrapy CrawlSpider子类。我已根据https://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider-example上的文档示例强烈创建了以下蜘蛛:
class Test_Spider(CrawlSpider):
name = "test"
allowed_domains = ['http://www.dragonflieswellness.com']
start_urls = ['http://www.dragonflieswellness.com/wp-content/uploads/2015/09/']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
# Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow='.jpg'), callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
print(response.url)
我试图让蜘蛛在预先设定的目录中循环开始,然后提取所有的' .jpg'目录中的链接,但我看到:
2016-09-29 13:07:35 [scrapy] INFO: Spider opened
2016-09-29 13:07:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-09-29 13:07:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-09-29 13:07:36 [scrapy] DEBUG: Crawled (200) <GET http://www.dragonflieswellness.com/wp-content/uploads/2015/09/> (referer: None)
2016-09-29 13:07:36 [scrapy] INFO: Closing spider (finished)
我怎样才能使这个工作?
答案 0 :(得分:1)
首先,使用规则的目的不仅是提取链接,而且最重要的是遵循它们。如果您只想提取链接(并且,比如将其保存以供日后使用),则不必指定蜘蛛规则。另一方面,如果您想要下载图像,请使用pipeline。
那就是说,蜘蛛不遵循链接的原因隐藏在LinkExtractor的实现中:
# common file extensions that are not followed if they occur in links
IGNORED_EXTENSIONS = [
# images
'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif',
'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg',
# audio
'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',
# video
'3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv',
'm4a',
# office suites
'xls', 'xlsx', 'ppt', 'pptx', 'pps', 'doc', 'docx', 'odt', 'ods', 'odg',
'odp',
# other
'css', 'pdf', 'exe', 'bin', 'rss', 'zip', 'rar',
]
修改强>
要在此示例中使用ImagesPipeline下载图像:
将此添加到设置:
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = '/home/user/some_directory' # use a correct path
创建一个新项目:
class MyImageItem(Item):
images = Field()
image_urls = Field()
修改你的蜘蛛(添加一个解析方法):
def parse(self, response):
loader = ItemLoader(item=MyImageItem(), response=response)
img_paths = response.xpath('//a[substring(@href, string-length(@href)-3)=".jpg"]/@href').extract()
loader.add_value('image_urls', [self.start_urls[0] + img_path for img_path in img_paths])
return loader.load_item()
xpath搜索以“.jpg”结尾的所有href,extract()方法创建一个列表。
加载器是一个简化创建对象的附加功能,但是没有它。
请注意,我不是专家,可能会有更好,更优雅的解决方案。但是,这个工作正常。