我已经要求你做我的蜘蛛擦

时间:2016-07-15 10:25:50

标签: web-scraping scrapy web-crawler scrapy-spider

我尝试从电子商务中提取产品

抓取网站并已通过每个产品,以便在可能的情况下提取标题信息,说明,图片和变体。

但我的蜘蛛不起作用。

import smtplib
import urlparse

from scrapy import signals
from scrapy.http import Request
from scrapy.loader import ItemLoader
from scrapy.spiders import CrawlSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from w3lib.html import replace_escape_chars, remove_tags
from scrapy.loader.processors import Compose, MapCompose

from emmiScraper.items import EmmiscraperItem


class EmmiSpider(CrawlSpider):
    name = 'emi'
    allowed_domains = ['adns-grossiste.fr']
    start_urls = ['http://adns-grossiste.fr/95-joyetech']


    rules = (
         Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=()),   callback="parse", follow= True),
)

    def parse(self, response):
        """Yields url for every item currently available on the site, and
            transports every product name to parse_item method.

         @url http://emmi.rs/konfigurator /proizvodi.10.html?advanced_search=1&productTitle=&x=0&y=0
    @scrapes urls products

    """
    urls = response.xpath('//*[@id="center_column"]/ul/li/div/div[2]  /h5/a/@href').extract()
    products = response.xpath('//*[@id="center_column"]/ul/li/div/div[2]/h5/a/text()').extract()

    for url, product in zip(urls, products):
        yield Request(urlparse.urljoin(response.url, url),
                      callback=self.parse_item,
                      meta={'product': product}
                      )

def parse_item(self, response):
         """Returns fields: url_of_item, product, img_url, description,   and   price."""

    l = ItemLoader(item=EmmiscraperItem(), response=response)
    l.default_output_processor = MapCompose(lambda v: v.strip(), replace_escape_chars)

    l.add_value('url_of_item', response.url)
    l.add_value('product', response.meta['product'])
    l.add_xpath('img_url', '//*[@id="bigpic"]/@src')
    l.add_xpath('description', '//*[@id="short_description_content"]/p[1]/span/text()')
    l.add_xpath('price', '//*[@id="our_price_display"]/text()')

    return l.load_item()

1 个答案:

答案 0 :(得分:0)

这似乎是一个scrapy常见问题解答:

CrawlSpider documentation中有一个警告框。它说:

  

编写爬网蜘蛛规则时,请避免使用parse作为回调   CrawlSpider使用parse方法本身来实现其逻辑。   因此,如果您覆盖解析方法,则爬行蜘蛛将不再存在   工作。您的代码可能无法按预期工作,因为您确实使用了   解析为回调。

您的代码可能无法正常工作,因为您确实使用parse作为回调。