在同一个项目中不能有两只蜘蛛吗?

时间:2014-02-27 03:09:41

标签: web-crawler scrapy

我能够生成第一个蜘蛛ok

Thu Feb 27 - 01:59 PM > scrapy genspider confluenceChildPages confluence
Created spider 'confluenceChildPages' using template 'crawl' in module:
  dirbot.spiders.confluenceChildPages

但是当我试图生成另一只蜘蛛时,我得到了这个:

Thu Feb 27 - 01:59 PM > scrapy genspider xxx confluence
Traceback (most recent call last):
  File "/usr/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.22.2', 'scrapy')
  File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 505, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 1245, in run_script
    execfile(script_filename, namespace, namespace)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/commands/genspider.py", line 68, in run
    crawler = self.crawler_process.create_crawler()
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/crawler.py", line 87, in create_crawler
    self.crawlers[name] = Crawler(self.settings)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/crawler.py", line 25, in __init__
    self.spiders = spman_cls.from_crawler(self)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/spidermanager.py", line 35, in from_crawler
    sm = cls.from_settings(crawler.settings)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/spidermanager.py", line 31, in from_settings
    return cls(settings.getlist('SPIDER_MODULES'))
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/spidermanager.py", line 22, in __init__
    for module in walk_modules(name):
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/utils/misc.py", line 68, in walk_modules
    submod = import_module(fullpath)
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/d/Work/TollOnline/Notes/Issues/JIRA/TOL-821_Review_Toll_Online_Confluence_Pages/dirbot-master/dirbot/spiders/confluenceChildPages.py", line 4, in <module>
    from scrapybot.items import ScrapybotItem
ImportError: No module named scrapybot.items

更新:2014年2月27日星期四,下午07:35:24 - 添加@ omair_77要求的信息。

我正在使用https://github.com/scrapy/dirbot中的dirbot。

初始目录结构是:

.
./.gitignore
./dirbot
./dirbot/items.py
./dirbot/pipelines.py
./dirbot/settings.py
./dirbot/spiders
./dirbot/spiders/dmoz.py
./dirbot/spiders/__init__.py
./dirbot/__init__.py
./README.rst
./scrapy.cfg
./setup.py
然后我尝试创建两个蜘蛛:

scrapy genspider confluenceChildPagesWithTags confluence
scrapy genspider confluenceChildPages confluence

我在第二个genspider命令上得到了错误。


更新:2014年3月5日星期三,02:16:07 PM - 添加与@ Darian的答案相关的信息。显示scrapybot仅在第一个genspider命令之后弹出。

Wed Mar 05 - 02:12 PM > find .
.
./.gitignore
./dirbot
./dirbot/items.py
./dirbot/pipelines.py
./dirbot/settings.py
./dirbot/spiders
./dirbot/spiders/dmoz.py
./dirbot/spiders/__init__.py
./dirbot/__init__.py
./README.rst
./scrapy.cfg
./setup.py
Wed Mar 05 - 02:13 PM > find . -type f -print0 | xargs -0 grep -i scrapybot
Wed Mar 05 - 02:14 PM > scrapy genspider confluenceChildPages confluence
Created spider 'confluenceChildPages' using template 'crawl' in module:
  dirbot.spiders.confluenceChildPages
Wed Mar 05 - 02:14 PM > find .
.
./.gitignore
./dirbot
./dirbot/items.py
./dirbot/items.pyc
./dirbot/pipelines.py
./dirbot/settings.py
./dirbot/settings.pyc
./dirbot/spiders
./dirbot/spiders/confluenceChildPages.py
./dirbot/spiders/dmoz.py
./dirbot/spiders/dmoz.pyc
./dirbot/spiders/__init__.py
./dirbot/spiders/__init__.pyc
./dirbot/__init__.py
./dirbot/__init__.pyc
./README.rst
./scrapy.cfg
./setup.py
Wed Mar 05 - 02:17 PM > find . -type f -print0 | xargs -0 grep -i scrapybot
./dirbot/spiders/confluenceChildPages.py:from scrapybot.items import ScrapybotItem
./dirbot/spiders/confluenceChildPages.py:        i = ScrapybotItem()

并且新生成的confluenceChildPages.py是:

from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapybot.items import ScrapybotItem

class ConfluencechildpagesSpider(CrawlSpider):
    name = 'confluenceChildPages'
    allowed_domains = ['confluence']
    start_urls = ['http://www.confluence/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        sel = Selector(response)
        i = ScrapybotItem()
        #i['domain_id'] = sel.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = sel.xpath('//div[@id="name"]').extract()
        #i['description'] = sel.xpath('//div[@id="description"]').extract()
        return i

所以我可以看到它引用scrapybot,但我不确定如何解决它..非常n00b仍然。

2 个答案:

答案 0 :(得分:1)

显示目录层次结构以获得更好的解决方案。这个问题主要发生在你的蜘蛛模块被命名为与你的scrapy项目模块相同时,所以python试图导入相对于蜘蛛的项目。因此请确保您的项目模块和蜘蛛模块名称不相同

答案 1 :(得分:1)

你会在追溯中看到最后一行:

 File "/d/Work/TollOnline/Notes/Issues/JIRA/TOL-821_Review_Toll_Online_Confluence_Pages/dirbot-master/dirbot/spiders/confluenceChildPages.py", line 4, in <module>
from scrapybot.items import ScrapybotItem

这告诉我你生成的第一个蜘蛛“confluenceChildPages”认为它需要从名为scrapybot的模块中导入项目,但这不存在。如果您查看confluenceChildPages.py内部,您将能够看到导致错误的那一行。

我实际上并不确定它用于生成哪个设置,但是如果你在项目中查看(grep)scrapybot,你应该找到它从哪里得到它然后是能够将其更改为dirbot,看起来像您想要的模块。

然后,您需要删除它生成的第一个蜘蛛并重新生成它。第二次创建它时会出错,因为它会加载到您作为项目一部分生成的第一个蜘蛛中,并且因为它有导入错误,您将获得回溯。

干杯。