我非常擅长网络抓取,而且我目前正在尝试将Scrapy应用到我正在研究的Tensorflow项目中,但由于某些原因,Scrapy没有给我任何结果。我相信在提取图像或标题本身的实际链接时,我做错了。我偶然发现了一个从imgur中提取图像的例子,这是我目前正在使用的图像。
Items.py
import scrapy
class ImgurItem(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
settings.py
BOT_NAME = 'imgur'
SPIDER_MODULES = ['imgur.spiders']
NEWSPIDER_MODULE = 'imgur.spiders'
ITEM_PIPELINES = {'imgur.pipelines.ImgurPipeline': 1}
IMAGES_STORE = 'I:\ScrapySpiders\imgur\imgur\Images'
ROBOTSTXT_OBEY = False
imgur_spider.py
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from imgur.items import ImgurItem
class ImgurSpider(CrawlSpider):
name = 'imgur'
allowed_domains = ['imgur.com']
start_urls = ['http://www.imgur.com']
rules = [Rule(LinkExtractor(allow=['/gallery/.*']), 'parse_imgur')]
def parse_imgur(self, response):
image = ImgurItem()
image['title'] = response.xpath("//h1[@class='post-title']/text()").extract()
rel = response.xpath("//img/@src").extract()
image['image_urls'] = ['http:'+rel[0]]
return image
pipelines.py
import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
class ImgurPipeline(ImagesPipeline):
def set_filename(self, response):
#add a regex here to check the title is valid for a filename.
return 'full/{0}.jpg'.format(response.meta['title'][0])
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url, meta={'title': item['title']})
def get_images(self, response, request, info):
for key, image, buf in super(ImgurPipeline, self).get_images(response, request, info):
key = self.set_filename(response)
yield key, image, buf
更新的错误日志:
Traceback (most recent call last):
File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\files.py", line 356, in media_downloaded
checksum = self.file_downloaded(response, request, info)
File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\images.py", line 98, in file_downloaded
return self.image_downloaded(response, request, info)
File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\images.py", line 102, in image_downloaded
for path, image, buf in self.get_images(response, request, info):
File "I:\ScrapySpiders\imgur\imgur\pipelines.py", line 24, in get_images
key = self.set_filename(response)
File "I:\ScrapySpiders\imgur\imgur\pipelines.py", line 16, in set_filename
return 'full/{0}.jpg'.format(response.meta['title'][0])
IndexError: list index out of range
2017-11-19 22:11:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://imgur.com/gallery/pKsYl>
{'image_urls': ['http://i.imgur.com/YEQb03D.jpg'], 'images': [], 'title': []}
2017-11-19 22:11:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://imgur.com/gallery/R6eQD> (referer: None)
2017-11-19 22:11:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://imgur.com/gallery/QrKeE>
{'image_urls': ['http://i.imgur.com/OpDDRWr.png'], 'images': [], 'title': []}
2017-11-19 22:11:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://imgur.com/gallery/JKz3U>
{'image_urls': ['http://i.imgur.com/VChqgP9r.jpg'], 'images': [], 'title': []}
{'image_urls': ['http://i.imgur.com/m9Cq6B1.png'], 'title': []}
2017-11-19 22:11:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://i.imgur.com/m9Cq6B1.png> (referer: None)
2017-11-19 22:11:27 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://i.imgur.com/m9Cq6B1.png> referred in <None>
2017-11-19 22:11:27 [PIL.PngImagePlugin] DEBUG: STREAM b'IHDR' 16 13
2017-11-19 22:11:27 [PIL.PngImagePlugin] DEBUG: STREAM b'IDAT' 41 8192
2017-11-19 22:11:28 [scrapy.pipelines.files] ERROR: File (unknown-error): Error processing file from <GET http://i.imgur.com/m9Cq6B1.png> referred in
<None>
Traceback (most recent call last):
File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\twisted\internet\defer.py", line 1363, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://i.imgur.com/m9Cq6B1.png>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\files.py", line 356, in media_downloaded
checksum = self.file_downloaded(response, request, info)
File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\images.py", line 98, in file_downloaded
return self.image_downloaded(response, request, info)
File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\images.py", line 102, in image_downloaded
for path, image, buf in self.get_images(response, request, info):
File "I:\ScrapySpiders\imgur\imgur\pipelines.py", line 24, in get_images
key = self.set_filename(response)
File "I:\ScrapySpiders\imgur\imgur\pipelines.py", line 16, in set_filename
return 'full/{0}.jpg'.format(response.meta['title'][0])
IndexError: list index out of range
2017-11-19 22:11:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://imgur.com/gallery/R6eQD>
{'image_urls': ['http://i.imgur.com/m9Cq6B1.png'], 'images': [], 'title': []}
2017-11-19 22:11:28 [scrapy.core.engine] INFO: Closing spider (finished)
2017-11-19 22:11:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/builtins.ValueError': 1,
'downloader/request_bytes': 29607,
'downloader/request_count': 122,
'downloader/request_method_count/GET': 122,
'downloader/response_bytes': 14490175,
'downloader/response_count': 121,
'downloader/response_status_count/200': 115,
'downloader/response_status_count/301': 4,
'downloader/response_status_count/302': 2,
'file_count': 45,
'file_status_count/downloaded': 45,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 11, 19, 20, 11, 28, 247434),
'item_scraped_count': 68,
'log_count/DEBUG': 274,
'log_count/ERROR': 46,
'log_count/INFO': 7,
'log_count/WARNING': 3,
'request_depth_max': 1,
'response_received_count': 115,
'scheduler/dequeued': 76,
'scheduler/dequeued/memory': 76,
'scheduler/enqueued': 76,
'scheduler/enqueued/memory': 76,
'spider_exceptions/IndexError': 1,
'start_time': datetime.datetime(2017, 11, 19, 20, 11, 21, 643056)}
2017-11-19 22:11:28 [scrapy.core.engine] INFO: Spider closed (finished)
我知道有类似的线程指定了这个确切代码的问题,但没有一个能够帮助我并解决我遇到的问题。显然Imgur改变了网络编码,我无法弄清楚如何提取这些链接
答案 0 :(得分:1)
这与网页抓取或imgur无关。您在此行的开头遇到python语法错误:
rel = response.xpath("//img[@src='//i.imgur.com/*.*'])".extract()
这是因为你有两个开放的parens但前一行只有一个关闭paren:
# v
image['title'] = response.xpath(\
"//h1[@class='post-title']/text()".extract()
# ^^
response.xpath(
中的开场不平衡。
答案 1 :(得分:0)
只需将引号移到右括号的正确一边,它就适合你:
rel = response.xpath("//img[@src='//i.imgur.com/*.*']").extract()
答案 2 :(得分:0)
添加新答案以清理事物。这应该有效:
将parse_imgur
功能修改为:
def parse_imgur(self, response):
image = ImgurItem()
image['title'] = response.xpath("//h1[contains(@class, 'post-title')]/text()").extract_first()
rel = response.xpath("//img/@src").extract_first()
try:
image['image_urls'] = ['http:'+rel]
return image
except:
pass
请注意,h1
类名称末尾有空格。您可以使用@class="post-title "
或我使用contains(@class, 'post-title')
由于我使用.extract_first()
作为图片标题,您还应该修改以下内容:
def set_filename(self, response):
return 'full/{0}.jpg'.format(response.meta['title'][0])
为:
def set_filename(self, response):
return 'full/{0}.jpg'.format(response.meta['title'])
其他改进可能是清理帖子标题的文件名和类选择器(例如,它还会选择名为long-post-title
和post-title-again
的类。