scrapy从一个简单的网站提取网址

时间:2017-09-17 14:00:13

标签: scrapy

我正在尝试从基本网站中提取基本数据:vapedonia.com。这是一个简单的电子商务网站,我很容易“重新发明轮子”(主要工作在一个大的html字符串),但当我必须在称为scrapy的模具工作,它只是不起作用。

我首先分析html代码并使用插件创建我的xpath表达式。在那个插件中,一切都很顺利,但是当我创建我的代码时(或者甚至当我使用了斗志的shell)时,它都不起作用。

这是代码:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class MySpider(BaseSpider):
   name = "vapedonia"
   allowed_domains = ["vapedonia.com"]
   start_urls = ["https://www.vapedonia.com/23-e-liquidos"]

   def parse(self, response):
        hxs = HtmlXPathSelector(response)
        products = hxs.select("//div[@class='product-container clearfix']")
        for products in products:
            image = products.select("div[@class='center_block']/a/img/@src").extract()
            name = products.select("div[@class='center_block']/a/@title").extract()
            link = products.select("div[@class='right_block']/p[@class='s_title_block']/a/@href").extract()
            price = products.select("div[@class='right_block']/div[@class='content_price']/span[@class='price']").extract()
        print image, name, link, price

以下是错误:

C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample>scrapy crawl vapedonia
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test.py:1: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
  from scrapy.spider import BaseSpider
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test.py:6: ScrapyDeprecationWarning: craigslist_sample.spiders.test.MySpider inherits from deprecated class scrapy.spiders.BaseSpider, please inherit from scrapy.spiders.Spider. (warning only on first subclass, there may be others)
  class MySpider(BaseSpider):
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test2.py:1: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead
  from scrapy.contrib.spiders import CrawlSpider, Rule
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test2.py:2: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead
  from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test2.py:2: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors.sgml` is deprecated, use `scrapy.linkextractors.sgml` instead
  from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test2.py:13: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
  Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse_items", follow= True),
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test4.py:15: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
  Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse_items", follow= True),
Traceback (most recent call last):
  File "C:\Users\eric\Miniconda2\Scripts\scrapy-script.py", line 5, in <module>
    sys.exit(scrapy.cmdline.execute())
  File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\cmdline.py", line 148, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\crawler.py", line 243, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\crawler.py", line 134, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\crawler.py", line 330, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\spiderloader.py", line 61, in from_settings
    return cls(settings)
  File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\spiderloader.py", line 25, in __init__
    self._load_all_spiders()
  File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\spiderloader.py", line 47, in _load_all_spiders
    for module in walk_modules(name):
  File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\utils\misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "C:\Users\eric\Miniconda2\lib\importlib\__init__.py", line 37, in import_module
    __import__(name)
  File "C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test5.py", line 17
    link = products.select("div[@class='right_block']/p[@class='s_title_block']/a/@href").extract()
    ^
IndentationError: unexpected indent

C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample>

我不知道问题是什么,但我在蜘蛛目录/文件夹中有几个蜘蛛编码的蜘蛛。可能是蜘蛛之间的某种代码组合。

感谢。

1 个答案:

答案 0 :(得分:0)

当scrapy运行时,它会扫描项目中存在的所有刮刀,以精确其名称并运行您指定的名称。因此,如果任何刮刀有语法错误,那么它就不会工作

  File "C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test5.py", line 17
    link = products.select("div[@class='right_block']/p[@class='s_title_block']/a/@href").extract()

正如您在异常中看到的那样,test5.py中存在错误。修复此文件中的缩进,如果您不需要,请对其进行评论。这应该允许你运行蜘蛛

编辑-1:标签和空格的混合

Python依赖于缩进,并且视觉上相同的缩进可能在代码中有所不同。它可能使用不同行中的制表符和空格的混合。哪会导致错误。因此,请务必检查编辑器以显示制表符和空格字符,并将所有制表符转换为空格。