我想从几个不同的网站使用Scrapy获取数据,并对该数据进行一些分析。由于爬虫和分析数据的代码都与同一个项目相关,因此我想将所有内容存储在同一个Git存储库中。我创建了一个minimal reproducible example on Github。
项目结构如下:
./crawlers
./crawlers/__init__.py
./crawlers/myproject
./crawlers/myproject/__init__.py
./crawlers/myproject/myproject
./crawlers/myproject/myproject/__init__.py
./crawlers/myproject/myproject/items.py
./crawlers/myproject/myproject/pipelines.py
./crawlers/myproject/myproject/settings.py
./crawlers/myproject/myproject/spiders
./crawlers/myproject/myproject/spiders/__init__.py
./crawlers/myproject/myproject/spiders/example.py
./crawlers/myproject/scrapy.cfg
./scrapyScript.py
在./crawlers/myproject
文件夹中,我可以输入以下命令来执行抓取工具:
scrapy crawl example
抓取工具使用了一些下载程序中间件,特别是alecxe's优秀的scrapy-fake-useragent。来自settings.py
:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}
使用scrapy crawl ...
执行时,useragent看起来像一个真正的浏览器。以下是来自网络服务器的示例记录:
24.8.42.44 - - [16/Jun/2015:05:07:59 +0000] "GET / HTTP/1.1" 200 27161 "-" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"
查看documentation,可以从脚本中执行等效的scrapy crawl ...
。 scrapyScript.py
文件基于文档,如下所示:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.utils.project import get_project_settings
from crawlers.myproject.myproject.spiders.example import ExampleSpider
spider = ExampleSpider()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
当我执行脚本时,我可以看到抓取工具发出了页面请求。不幸的是,它忽略了DOWNLOADER_MIDDLEWARES
。例如,使用者不再被欺骗:
24.8.42.44 - - [16/Jun/2015:05:32:04 +0000] "GET / HTTP/1.1" 200 27161 "-" "Scrapy/0.24.6 (+http://scrapy.org)"
不知何故,当从脚本执行抓取工具时,似乎忽略了settings.py
中的设置。
你能看出我做错了吗?
答案 0 :(得分:1)
要让get_project_settings()
找到所需的settings.py
,请设置SCRAPY_SETTINGS_MODULE
environment variable:
import os
import sys
# ...
sys.path.append(os.path.join(os.path.curdir, "crawlers/myproject"))
os.environ['SCRAPY_SETTINGS_MODULE'] = 'myproject.settings'
settings = get_project_settings()
请注意,由于您的跑步者脚本的位置,您需要将myproject
添加到sys.path
。或者,在scrapyScript.py
目录下移动myproject
。