我正在尝试从基本网站中提取基本数据:vapedonia.com。这是一个简单的电子商务网站,我很容易“重新发明轮子”(主要工作在一个大的html字符串),但当我必须在称为scrapy的模具工作,它只是不起作用。
我首先分析html代码并使用插件创建我的xpath表达式。在那个插件中,一切都很顺利,但是当我创建我的代码时(或者甚至当我使用了斗志的shell)时,它都不起作用。
这是代码:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MySpider(BaseSpider):
name = "vapedonia"
allowed_domains = ["vapedonia.com"]
start_urls = ["https://www.vapedonia.com/23-e-liquidos"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
products = hxs.select("//div[@class='product-container clearfix']")
for products in products:
image = products.select("div[@class='center_block']/a/img/@src").extract()
name = products.select("div[@class='center_block']/a/@title").extract()
link = products.select("div[@class='right_block']/p[@class='s_title_block']/a/@href").extract()
price = products.select("div[@class='right_block']/div[@class='content_price']/span[@class='price']").extract()
print image, name, link, price
以下是错误:
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample>scrapy crawl vapedonia
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test.py:1: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
from scrapy.spider import BaseSpider
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test.py:6: ScrapyDeprecationWarning: craigslist_sample.spiders.test.MySpider inherits from deprecated class scrapy.spiders.BaseSpider, please inherit from scrapy.spiders.Spider. (warning only on first subclass, there may be others)
class MySpider(BaseSpider):
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test2.py:1: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead
from scrapy.contrib.spiders import CrawlSpider, Rule
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test2.py:2: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test2.py:2: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors.sgml` is deprecated, use `scrapy.linkextractors.sgml` instead
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test2.py:13: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse_items", follow= True),
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test4.py:15: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse_items", follow= True),
Traceback (most recent call last):
File "C:\Users\eric\Miniconda2\Scripts\scrapy-script.py", line 5, in <module>
sys.exit(scrapy.cmdline.execute())
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\cmdline.py", line 148, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\crawler.py", line 243, in __init__
super(CrawlerProcess, self).__init__(settings)
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\crawler.py", line 134, in __init__
self.spider_loader = _get_spider_loader(settings)
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\crawler.py", line 330, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\spiderloader.py", line 61, in from_settings
return cls(settings)
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\spiderloader.py", line 25, in __init__
self._load_all_spiders()
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\spiderloader.py", line 47, in _load_all_spiders
for module in walk_modules(name):
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\utils\misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "C:\Users\eric\Miniconda2\lib\importlib\__init__.py", line 37, in import_module
__import__(name)
File "C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test5.py", line 17
link = products.select("div[@class='right_block']/p[@class='s_title_block']/a/@href").extract()
^
IndentationError: unexpected indent
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample>
我不知道问题是什么,但我在蜘蛛目录/文件夹中有几个蜘蛛编码的蜘蛛。可能是蜘蛛之间的某种代码组合。
感谢。
答案 0 :(得分:0)
当scrapy运行时,它会扫描项目中存在的所有刮刀,以精确其名称并运行您指定的名称。因此,如果任何刮刀有语法错误,那么它就不会工作
File "C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test5.py", line 17
link = products.select("div[@class='right_block']/p[@class='s_title_block']/a/@href").extract()
正如您在异常中看到的那样,test5.py
中存在错误。修复此文件中的缩进,如果您不需要,请对其进行评论。这应该允许你运行蜘蛛
编辑-1:标签和空格的混合
Python依赖于缩进,并且视觉上相同的缩进可能在代码中有所不同。它可能使用不同行中的制表符和空格的混合。哪会导致错误。因此,请务必检查编辑器以显示制表符和空格字符,并将所有制表符转换为空格。