无法使用Scrapy下载图像 - 有时可以正常工作

时间:2015-06-12 12:28:43

标签: python scrapy

到目前为止,我的蜘蛛代码一直运行良好,但现在当我尝试运行一批这些蜘蛛时,一切都有效,除了对于一些蜘蛛,scrapy下载图像,其余没有。除了start_urls之外,所有的蜘蛛都是一样的。任何帮助表示赞赏!

这是我的 pipelines.py

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request

class DmozPipeline(object):
    def process_item(self, item, spider):
    return item

class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
       for image_url in item['image_urls']:
        yield Request(image_url)

        for nlabel in item['nlabel']:
        yield Request(nlabel)

        print item['image_urls']


def item_completed(self, results, item, info):
    image_paths = [x['path'] for ok, x in results if ok]
    if not image_paths:
        raise DropItem("Item contains no images")
    item['image_paths'] = image_paths
    return item

settings.py:

BOT_NAME = 'dmoz2'
BOT_VERSION = '1.0'

SPIDER_MODULES = ['dmoz2.spiders']
NEWSPIDER_MODULE = 'dmoz2.spiders'
DEFAULT_ITEM_CLASS = 'dmoz2.items.DmozItem'
ITEM_PIPELINES = ['dmoz2.pipelines.MyImagesPipeline']
IMAGES_STORE = '/ps/dmoz2/images'
IMAGES_THUMBS = {
#letting height be variable
#'small': ('', 120),
'small': (120, ''),
#'big': ('', 240),
'big': (300, ''),
}


USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

items.py:

from scrapy.item import Item, Field
from scrapy.utils.python import unicode_to_str

def u_to_str(text):
   unicode_to_str(text,'latin-1','ignore')


class DmozItem(Item):
   category_ids = Field()
   ....
   image_urls = Field()
   image_paths = Field()

   pass

myspider.py:

from scrapy.spider import BaseSpider
from scrapy.spider import Spider
from scrapy.selector import HtmlXPathSelector
from scrapy import Selector
from scrapy.utils.url import urljoin_rfc
from scrapy.utils.response import get_base_url
from dmoz2.items import DmozItem

class DmozSpider(Spider):
   name = "fritos_jun2015"
   allowed_domains = ["walmart.com"]
   start_urls = [

    "http://www.walmart.com/ip/Fritos-Bar-B-Q-Flavored-Corn-Chips-9.75- oz/36915853",
    "http://www.walmart.com/ip/Fritos-Corn-Chips-1-oz-6-count/10900088",

]


def parse(self, response):
    hxs = Selector(response)
    sites = hxs.xpath('/html/body/div[1]/section/section[4]/div[2]')
    items = []
    for site in sites:
        item = DmozItem()
        item['category_ids'] = ''
        .....
        item['image_urls'] = site.xpath('div[1]/div[3]/div[1]/div/div/div[2]/div/div/div[1]/div/div/img[2]/@src').extract()
        items.append(item)
    return items

真的想知道为什么这个蜘蛛有时会抓取图像,而有时则不然。除了来自同一allowed_domain的start_urls之外,所有蜘蛛都是相同的。此外,图像都是绝对路径,路径也是正确的。

提前致谢。 -TM

1 个答案:

答案 0 :(得分:0)

当屏幕抓取一个常见问题时,服务器会切断连接,因为您经常尝试访问它(以防止屏幕抓取工具无意中使用他们的网站并防止成本因为有人打他们的网站每毫秒等)。

尝试添加

sleep()
每个请求到walmart页面之间的

方法。这样您就不会被阻止访问服务器。