Question

我一直在尝试从python脚本文件运行scrapy，因为我需要获取数据并将其保存到我的数据库中。但是当我用scrapy命令运行它时

scrapy crawl argos

脚本运行正常但是当我试图用脚本运行它时，请点击此链接

http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

我收到此错误

$ python pricewatch/pricewatch.py update
Traceback (most recent call last):
  File "pricewatch/pricewatch.py", line 39, in <module>
    main()
  File "pricewatch/pricewatch.py", line 31, in main
    update()
  File "pricewatch/pricewatch.py", line 24, in update
    setup_crawler("argos.co.uk")
  File "pricewatch/pricewatch.py", line 13, in setup_crawler
    settings = get_project_settings()
  File "/Library/Python/2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/utils/project.py", line 58, in get_project_settings
    settings_module = import_module(settings_module_path)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
ImportError: No module named settings

我无法理解为什么它没有找到get_project_setting（）但是在终端上使用scrapy命令运行良好

这是我的项目的屏幕截图

enter image description here

这是pricewatch.py代码：

import commands
import sys
from database import DBInstance
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log
from spiders.argosspider import ArgosSpider
from scrapy.utils.project import get_project_settings
import settings

def setup_crawler(domain):
    spider = ArgosSpider(domain=domain)
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()

def update():
    #print "Enter a product to update:"
    #product = raw_input()
    #print product
    #db = DBInstance()
    setup_crawler("argos.co.uk")
    log.start()
    reactor.run()

def main():
    try:
        if sys.argv[1] == "update":
            update()
        elif sys.argv[1] == "database":
            #db = DBInstance()
    except IndexError:
        print "You must select a command from Update, Search, History"


if  __name__ =='__main__':
    main()

Answer 1

我修好了只需要将pricewatch.py放到项目的顶级目录中，然后运行就解决了它

Answer 2

此答案是从this answer大量复制而来的，我相信它回答了您的问题，并另外提供了后裔示例。

考虑具有以下结构的项目。

my_project/
    main.py                 # Where we are running scrapy from
    scraper/
        run_scraper.py               #Call from main goes here
        scrapy.cfg                   # deploy configuration file
        scraper/                     # project's Python module, you'll import your code from here
            __init__.py
            items.py                 # project items definition file
            pipelines.py             # project pipelines file
            settings.py              # project settings file
            spiders/                 # a directory where you'll later put your spiders
                __init__.py
                quotes_spider.py     # Contains the QuotesSpider class

基本上，命令 scrapy startproject scraper在my_project文件夹中执行，我已将run_scraper.py文件添加到外部scraper文件夹，将main.py文件添加到我的根文件夹，并将quotes_spider.py添加到蜘蛛文件夹。

我的主文件：

from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()

我的run_scraper.py文件：

from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os


class Scraper:
    def __init__(self):
        settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
        os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
        self.process = CrawlerProcess(get_project_settings())
        self.spiders = QuotesSpider # The spider you want to crawl

    def run_spiders(self):
        self.process.crawl(self.spider)
        self.process.start()  # the script will block here until the crawling is finished

此外，请注意设置可能需要进行查找，因为路径需要根据根文件夹（my_project，而不是scraper）进行。所以就我而言：

SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'

等...

从python脚本运行scrapy

2 个答案: