我一直在尝试从python脚本文件运行scrapy,因为我需要获取数据并将其保存到我的数据库中。但是当我用scrapy命令运行它时
scrapy crawl argos
脚本运行正常 但是当我试图用脚本运行它时,请点击此链接
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
我收到此错误
$ python pricewatch/pricewatch.py update
Traceback (most recent call last):
File "pricewatch/pricewatch.py", line 39, in <module>
main()
File "pricewatch/pricewatch.py", line 31, in main
update()
File "pricewatch/pricewatch.py", line 24, in update
setup_crawler("argos.co.uk")
File "pricewatch/pricewatch.py", line 13, in setup_crawler
settings = get_project_settings()
File "/Library/Python/2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/utils/project.py", line 58, in get_project_settings
settings_module = import_module(settings_module_path)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
ImportError: No module named settings
我无法理解为什么它没有找到get_project_setting()但是在终端上使用scrapy命令运行良好
这是我的项目的屏幕截图
这是pricewatch.py代码:
import commands
import sys
from database import DBInstance
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log
from spiders.argosspider import ArgosSpider
from scrapy.utils.project import get_project_settings
import settings
def setup_crawler(domain):
spider = ArgosSpider(domain=domain)
settings = get_project_settings()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
crawler.start()
def update():
#print "Enter a product to update:"
#product = raw_input()
#print product
#db = DBInstance()
setup_crawler("argos.co.uk")
log.start()
reactor.run()
def main():
try:
if sys.argv[1] == "update":
update()
elif sys.argv[1] == "database":
#db = DBInstance()
except IndexError:
print "You must select a command from Update, Search, History"
if __name__ =='__main__':
main()
答案 0 :(得分:1)
答案 1 :(得分:-1)
此答案是从this answer大量复制而来的,我相信它回答了您的问题,并另外提供了后裔示例。
考虑具有以下结构的项目。
my_project/
main.py # Where we are running scrapy from
scraper/
run_scraper.py #Call from main goes here
scrapy.cfg # deploy configuration file
scraper/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
quotes_spider.py # Contains the QuotesSpider class
基本上,命令
scrapy startproject scraper
在my_project文件夹中执行,我已将run_scraper.py
文件添加到外部scraper文件夹,将main.py
文件添加到我的根文件夹,并将quotes_spider.py
添加到蜘蛛文件夹。
我的主文件:
from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()
我的run_scraper.py
文件:
from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os
class Scraper:
def __init__(self):
settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
self.process = CrawlerProcess(get_project_settings())
self.spiders = QuotesSpider # The spider you want to crawl
def run_spiders(self):
self.process.crawl(self.spider)
self.process.start() # the script will block here until the crawling is finished
此外,请注意设置可能需要进行查找,因为路径需要根据根文件夹(my_project,而不是scraper)进行。 所以就我而言:
SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'
等...