Question

我非常擅长网络抓取，而且我目前正在尝试将Scrapy应用到我正在研究的Tensorflow项目中，但由于某些原因，Scrapy没有给我任何结果。我相信在提取图像或标题本身的实际链接时，我做错了。我偶然发现了一个从imgur中提取图像的例子，这是我目前正在使用的图像。

Items.py

import scrapy

class ImgurItem(scrapy.Item):

    title = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

settings.py

BOT_NAME = 'imgur'
SPIDER_MODULES = ['imgur.spiders']
NEWSPIDER_MODULE = 'imgur.spiders'
ITEM_PIPELINES = {'imgur.pipelines.ImgurPipeline': 1}
IMAGES_STORE = 'I:\ScrapySpiders\imgur\imgur\Images'
ROBOTSTXT_OBEY = False

imgur_spider.py

 import scrapy

from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from imgur.items import ImgurItem

class ImgurSpider(CrawlSpider):
    name = 'imgur'
    allowed_domains = ['imgur.com']
    start_urls = ['http://www.imgur.com']
    rules = [Rule(LinkExtractor(allow=['/gallery/.*']), 'parse_imgur')]

    def parse_imgur(self, response):
        image = ImgurItem()
        image['title'] = response.xpath("//h1[@class='post-title']/text()").extract()
        rel = response.xpath("//img/@src").extract()
        image['image_urls'] = ['http:'+rel[0]]
        return image

pipelines.py

import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline

class ImgurPipeline(ImagesPipeline):

    def set_filename(self, response):
        #add a regex here to check the title is valid for a filename.
        return 'full/{0}.jpg'.format(response.meta['title'][0])

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url, meta={'title': item['title']})

    def get_images(self, response, request, info):
        for key, image, buf in super(ImgurPipeline, self).get_images(response, request, info):
            key = self.set_filename(response)
        yield key, image, buf

更新的错误日志：

Traceback (most recent call last):
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\files.py", line 356, in media_downloaded
    checksum = self.file_downloaded(response, request, info)
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\images.py", line 98, in file_downloaded
    return self.image_downloaded(response, request, info)
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\images.py", line 102, in image_downloaded
    for path, image, buf in self.get_images(response, request, info):
  File "I:\ScrapySpiders\imgur\imgur\pipelines.py", line 24, in get_images
    key = self.set_filename(response)
  File "I:\ScrapySpiders\imgur\imgur\pipelines.py", line 16, in set_filename
    return 'full/{0}.jpg'.format(response.meta['title'][0])
IndexError: list index out of range
2017-11-19 22:11:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://imgur.com/gallery/pKsYl>
{'image_urls': ['http://i.imgur.com/YEQb03D.jpg'], 'images': [], 'title': []}
2017-11-19 22:11:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://imgur.com/gallery/R6eQD> (referer: None)
2017-11-19 22:11:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://imgur.com/gallery/QrKeE>
{'image_urls': ['http://i.imgur.com/OpDDRWr.png'], 'images': [], 'title': []}
2017-11-19 22:11:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://imgur.com/gallery/JKz3U>
{'image_urls': ['http://i.imgur.com/VChqgP9r.jpg'], 'images': [], 'title': []}
{'image_urls': ['http://i.imgur.com/m9Cq6B1.png'], 'title': []}
2017-11-19 22:11:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://i.imgur.com/m9Cq6B1.png> (referer: None)
2017-11-19 22:11:27 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://i.imgur.com/m9Cq6B1.png> referred in <None>
2017-11-19 22:11:27 [PIL.PngImagePlugin] DEBUG: STREAM b'IHDR' 16 13
2017-11-19 22:11:27 [PIL.PngImagePlugin] DEBUG: STREAM b'IDAT' 41 8192
2017-11-19 22:11:28 [scrapy.pipelines.files] ERROR: File (unknown-error): Error processing file from <GET http://i.imgur.com/m9Cq6B1.png> referred in
<None>
Traceback (most recent call last):
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\twisted\internet\defer.py", line 1363, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://i.imgur.com/m9Cq6B1.png>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\files.py", line 356, in media_downloaded
    checksum = self.file_downloaded(response, request, info)
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\images.py", line 98, in file_downloaded
    return self.image_downloaded(response, request, info)
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\images.py", line 102, in image_downloaded
    for path, image, buf in self.get_images(response, request, info):
  File "I:\ScrapySpiders\imgur\imgur\pipelines.py", line 24, in get_images
    key = self.set_filename(response)
  File "I:\ScrapySpiders\imgur\imgur\pipelines.py", line 16, in set_filename
    return 'full/{0}.jpg'.format(response.meta['title'][0])
IndexError: list index out of range
2017-11-19 22:11:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://imgur.com/gallery/R6eQD>
{'image_urls': ['http://i.imgur.com/m9Cq6B1.png'], 'images': [], 'title': []}
2017-11-19 22:11:28 [scrapy.core.engine] INFO: Closing spider (finished)
2017-11-19 22:11:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/builtins.ValueError': 1,
 'downloader/request_bytes': 29607,
 'downloader/request_count': 122,
 'downloader/request_method_count/GET': 122,
 'downloader/response_bytes': 14490175,
 'downloader/response_count': 121,
 'downloader/response_status_count/200': 115,
 'downloader/response_status_count/301': 4,
 'downloader/response_status_count/302': 2,
 'file_count': 45,
 'file_status_count/downloaded': 45,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 11, 19, 20, 11, 28, 247434),
 'item_scraped_count': 68,
 'log_count/DEBUG': 274,
 'log_count/ERROR': 46,
 'log_count/INFO': 7,
 'log_count/WARNING': 3,
 'request_depth_max': 1,
 'response_received_count': 115,
 'scheduler/dequeued': 76,
 'scheduler/dequeued/memory': 76,
 'scheduler/enqueued': 76,
 'scheduler/enqueued/memory': 76,
 'spider_exceptions/IndexError': 1,
 'start_time': datetime.datetime(2017, 11, 19, 20, 11, 21, 643056)}
2017-11-19 22:11:28 [scrapy.core.engine] INFO: Spider closed (finished)

我知道有类似的线程指定了这个确切代码的问题，但没有一个能够帮助我并解决我遇到的问题。显然Imgur改变了网络编码，我无法弄清楚如何提取这些链接

Answer 1

这与网页抓取或imgur无关。您在此行的开头遇到python语法错误：

rel = response.xpath("//img[@src='//i.imgur.com/*.*'])".extract()

这是因为你有两个开放的parens但前一行只有一个关闭paren：

#                              v
image['title'] = response.xpath(\
    "//h1[@class='post-title']/text()".extract()
#                                             ^^

response.xpath(中的开场不平衡。

Answer 2

只需将引号移到右括号的正确一边，它就适合你：

rel = response.xpath("//img[@src='//i.imgur.com/*.*']").extract()

Answer 3

添加新答案以清理事物。这应该有效：

将parse_imgur功能修改为：

def parse_imgur(self, response):
    image = ImgurItem()
    image['title'] = response.xpath("//h1[contains(@class, 'post-title')]/text()").extract_first()
    rel = response.xpath("//img/@src").extract_first()
    try:
        image['image_urls'] = ['http:'+rel]
        return image
    except:
        pass

请注意，h1类名称末尾有空格。您可以使用@class="post-title "或我使用contains(@class, 'post-title')

的首选方式

由于我使用.extract_first()作为图片标题，您还应该修改以下内容：

def set_filename(self, response):
    return 'full/{0}.jpg'.format(response.meta['title'][0])

为：

def set_filename(self, response):
    return 'full/{0}.jpg'.format(response.meta['title'])

其他改进可能是清理帖子标题的文件名和类选择器（例如，它还会选择名为long-post-title和post-title-again的类。

Scrapy错误。列表索引超出范围

3 个答案: