Question

我在python中运行Scrapy蜘蛛来从网站上抓取图像。在尝试了其他一些方法后，我试图实现一个ImagesPipeline来做到这一点。

items.py

class NHTSAItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

settings.py：

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'C:\Users\me\Desktop'

myspider.py

def parse_photo_page(self, response):
    item = NHTSAItem()
    for sel in response.xpath('//table[@id="tblData"]/tr'):
        url = sel.xpath('td/font/a/@href').extract()
        table_fields = sel.xpath('td/font/text()').extract()
        if url:
            base_url_photo = "http://www-nrd.nhtsa.dot.gov"
            full_url = base_url_photo + url[0]
            if not item:
                item['image_urls'] = [full_url]
            else: 
                item['image_urls'].append(full_url)
    return item

没有出现错误，图片无法下载。调试器甚至说“Scraped”这是日志：

DEBUG: Scraped from <200 http://www-nrd.nhtsa.dot.gov/database/VSR/veh/../SearchMedia.aspx?database=v&tstno=4000&mediatype=p&p_tstno=4000>
{'image_urls': [u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=1&database=V&type=P',
            u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=2&database=V&type=P',
            u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=3&database=V&type=P',
            u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=4&database=V&type=P',
            u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=5&database=V&type=P']}

我不关心扩展管道（制作自定义管道），默认的imagespipeline很好。图像无处可寻。我有什么想法我做错了吗？

Answer 1

以下是我从这个平行问题中找到的解决方案：Scrapy: Error 10054 after retrying image download（感谢@neverlastn）

我只是将这个片段添加到我的实际spider.py文件中。

custom_settings = { "ITEM_PIPELINES": {'scrapy.pipelines.images.ImagesPipeline': 1}, "IMAGES_STORE": saveLocation }

我认为它没有正确引用我的settings.py文件，因此没有激活图像管道。我不确定如何准确地引用我的设置文件，但这个解决方案对我来说已经足够了！

Answer 2

尝试替换with open('file.txt') as f: for line in f: entry = line.split() entry = (entry[0], entry[1], entry[2]) # (id, name, location # do what you want with entry

settings.py

使用：

IMAGES_STORE = 'C:\Users\me\Desktop'

如果有效，那么绝对路径的格式就会出现问题。然后其中任何一个应该工作：

IMAGES_STORE = import os
IMAGES_STORE = os.getcwd()

或

IMAGES_STORE = 'C:\\Users\\me\\Desktop'

P.S。这是IMAGES_STORE = 'C:/Users/me/Desktop'。其他问题/答案中的相对XPath问题也适用于此。

Scrapy ImagesPipeline不下载图像

2 个答案: