如何从python脚本运行scrapy

时间:2014-12-31 15:15:32

标签: python scrapy

我的文件列表:

.
|-- lf
|   |-- __init__.py
|   |-- __init__.pyc
|   |-- items.py
|   |-- items.pyc
|   |-- pipelines.py
|   |-- settings.py
|   |-- settings.pyc
|   `-- spiders
|       |-- bbc.py
|       |-- bbc.pyc
|       |-- __init__.py
|       |-- __init__.pyc
|       |-- lwifi.py
|       `-- lwifi.pyc
|-- scrapy.cfg
`-- script.py

items.py

from scrapy.item import Item, Field
class LfItem(Item):
    topic = Field();

script.py:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from lf.spiders.lwifi import LwifiSpider
from scrapy.utils.project import get_project_settings

spider = LwifiSpider(domain='Lifehacker.co.in')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()

lwifi.py:

from scrapy.spider import Spider
from scrapy.selector import Selector
class LwifiSpider(Spider):
    name = "lwifi"  
    def __init__(self, **kw):
       super(LwifiSpider, self).__init__(**kw)
       url = kw.get('url') or kw.get('domain') or 'lifehacker.co.in/others/Dont-Use-      Personal-Information-in-Your-Wi-Fi-Network-Name/articleshow/45407704.cms'
       if not url.startswith('http://') and not url.startswith('https://'):
           url = 'http://%s/' % url
       self.url = url
       self.allowed_domains = ["lifehacker.co.in/others/Dont-Use-Personal-Information-in-Your-Wi-Fi-Network-Name/articleshow/45407704.cms"]

    def start_requests(self):
        return [Request(self.url, callback=self.parse)]

    def parse(self, response):
        topic = response.xpath("//h1/text()").extract();
        print topic

我是蟒蛇和scrapy的新手。作为一个开始,我写了一个简单的scrapy蜘蛛从python脚本运行(不使用scrapinghub)。我的目标是从页面http://lifehacker.co.in/others/Dont-Use-Personal-Information-in-Your-Wi-Fi-Network-Name/articleshow/45407704.cms中删除h1。错误是

Traceback (most recent call last):
  File "script.py", line 4, in <module>
    from lf.spiders.lwifi import LwifiSpider
  File "/home/ajay/pythonpr/error/lf/lf/spiders/lwifi.py", line 7, in <module>
    class LwifiSpider(Spider):
  File "/home/ajay/pythonpr/error/lf/lf/spiders/lwifi.py", line 11, in LwifiSpider
    url = kw.get('url') or kw.get('domain') or 'lifehacker.co.in/others/Dont-Use-Personal-   Information-in-Your-Wi-Fi-Network-Name/articleshow/45407704.cms'
NameError: name 'kw' is not defined

请帮忙。

2 个答案:

答案 0 :(得分:0)

如果仔细查看回溯,您会发现错误发生在LwifiSpider类的正文中:

    File "/home/.../lwifi.py", line 11, in LwifiSpider

如果该类的__init__发生错误,您会看到这样的一行:

    File "/home/.../lwifi.py", line 11, in __init__

因此,似乎存在某种缩进错误,导致有问题的行在__init__方法的,其中kw参数不能可见

尝试重新缩进整个__init__函数,并确保您没有在任何地方混合制表符和空格(任何体面的文本编辑器都应该允许您将所有空格都可见)。

答案 1 :(得分:0)