Scrapy Scraper无法正确刮擦图像

时间:2015-07-02 09:46:25

标签: javascript python ajax web-scraping scrapy

我正在尝试使用Scrapy来抓取这个网站。

首先,这是我的代码 - :

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
#from scrapy import log, signals
from scrapy.utils.log import configure_logging
#from dmoz.spiders.dmoz_spiders import DmozSpider
#from dmoz.spiders.bigbasketspider import BBSpider
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager

#query=raw_input("Enter a product to search for= ")
query='table'
query1=query.replace(" ", "+")  


class DmozItem(scrapy.Item):

    productname = scrapy.Field()
    product_link = scrapy.Field()
    current_price = scrapy.Field()
    mrp = scrapy.Field()
    offer = scrapy.Field()
    imageurl = scrapy.Field()
    outofstock_status = scrapy.Field()

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://www.pepperfry.com"]




    def start_requests(self):

        task_urls = [
        ]
        i=1
        for i in range(1,11):
            temp = "http://www.pepperfry.com/site_product/search?is_search=true&p="+str(i)+"&q="+query1
            task_urls.append(temp)
            #raw_input()
            i=i+1

        start_urls = (task_urls)
        p=len(task_urls)
        return [ Request(url = start_url) for start_url in start_urls ]


    def parse(self, response):
        print response
        items = []
        for sel in response.xpath('//html/body/div[2]/div[2]/div[2]/div[4]/div'):
            item = DmozItem()
            item['productname'] = str(sel.xpath('div[1]/a/img/@alt').extract())[3:-2]
            item['product_link'] = str(sel.xpath('div[2]/a/@href').extract())[3:-2]
            item['current_price']=str(sel.xpath('div[3]/div/span[2]/span/text()').extract())[3:-2]

            try:            
                temp1=sel.xpath('div[3]/div/span[1]/p/span')
                item['mrp'] = str(temp1.xpath('text()').extract())[3:-2]

            except:
                item['mrp'] = item['current_price']

            item['offer'] = 'No additional offer available'

            item['imageurl'] = str(sel.xpath('div[1]/a//img/@src').extract())[3:-2]
            item['outofstock_status'] = 'In Stock'
            items.append(item)


        print (items)

            #print '\n'

spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", {"dmoz"})
settings.set("CONCURRENT_REQUESTS" , 100)
settings.set( "DEPTH_PRIORITY" , 1)
settings.set("SCHEDULER_DISK_QUEUE" , "scrapy.squeues.PickleFifoDiskQueue")
settings.set( "SCHEDULER_MEMORY_QUEUE" , "scrapy.squeues.FifoMemoryQueue")
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()

网站使用XHR加载产品,我已经正确地想出了它(你可以在我的代码中看到我的start_urls数组中的XHR URL),它正在运行。下一个问题是,网站也使用AJAX / Javascript加载图像(我不确定本网站使用哪一个)。所以,如果你清楚地执行我的脚本(我的代码),你会发现有一个加载图像,即使是实际的图像也会被刮掉。

在开始刮擦之前,如何向页面发送加载图像的请求(因为图像未使用XHR加载),以便我可以刮掉所有图像?

请给我一个有效的,有效的代码(解决方案),特别是我的代码。谢谢! :)

1 个答案:

答案 0 :(得分:2)

如果我在你的一个task_urls下查看网站的来源,让我们说str(i)评估为2,我在源代码中看到图像,但图像本身不在src标记的img属性,但在data-src属性中。

如果我让一个简单的蜘蛛去寻找它,我会得到图像的URL。

for i in response.xpath("//a/img[1]"):
        print i.xpath("./@data-src").extract()

因此,请尝试将您的XPath表达式从src更改为data-src,然后尝试一下。改变这一行给出了正确(完美)的解决方案 - :

item['imageurl'] = str(sel.xpath('div[1]/a//img/@data-src').extract())[3:-2]