我一直在研究两种蜘蛛。他们共享相同的文件系统,在我开始对第二个蜘蛛做大量工作之前,第一个蜘蛛正在工作。现在我已经完成了第二个,我希望在尝试将它们拼接在一起之前给每个蜘蛛一个测试运行。当我尝试运行第一个时,它尝试执行第二个但是失败,因为它取决于第一个生成的文件。需要注意的是,我一直在谷歌驱动器上传递这个项目,所以我可以在多台机器上进行操作。
编辑:
我得到了它的工作,但也许有人可以帮助我理解为什么。这是我的第一个蜘蛛:
stockHighs.py:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.exporter import CsvItemExporter
from stockscrape.items import StockscrapeItem
class highScrape(BaseSpider):
name = "stockhighs"
allowed_domains = ["barchart.com"]
start_urls = ["http://www.barchart.com/stocks/high.php?_dtp1=0"]
def parse(self, response):
f = open("test.txt","w")
sel = HtmlXPathSelector(response)
sites = sel.select("//tbody/tr")
for site in sites:
item = StockscrapeItem()
item['symbol'] = site.select("td[contains(@class, 'ds_symbol')]/a/text()").extract()
strItem = str(item)
newItem = strItem.decode('string_escape').replace("{'symbol': [u'","").replace("']}","")
f.write("%s\n" % newItem)
f.close()
这是我的第二只蜘蛛:
epsRating.py:
# coding: utf-8
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.exporter import CsvItemExporter
import re
import csv
import urlparse
from stockscrape.items import EPSItem
from itertools import izip
class epsScrape(BaseSpider):
name = "eps"
allowed_domains = ["investors.com"]
ifile = open('test.txt', "r")
reader = csv.reader(ifile)
start_urls = []
for row in ifile:
url = row.replace("\n","")
if url == "symbol":
continue
else:
start_urls.append("http://research.investors.com/quotes/nyse-" + url + ".htm")
ifile.close()
def parse(self, response):
tempSymbol = ""
tempEps = 10
f = open("eps.txt", "a+")
sel = HtmlXPathSelector(response)
sites = sel.select("//div")
for site in sites:
item = EPSItem()
item['symbol'] = site.select("h2/span[contains(@id, 'qteSymb')]/text()").extract()
item['eps'] = site.select("table/tbody/tr/td[contains(@class, 'rating')]/span/text()").extract()
strSymb = str(item['symbol'])
newSymb = strSymb.replace("[]","").replace("[u'","").replace("']","")
strEps = str(item['eps'])
newEps = strEps.replace("[]","").replace(" ","").replace("[u'\\r\\n","").replace("']","")
if not newSymb == "":
tempSymbol = newSymb
if not newEps == "":
tempEps = int(newEps)
if not tempEps < 85:
f.write("%s\t%s\n" % (tempSymbol, str(tempEps)))
f.close()
这是我得到的错误:
$ scrapy crawl stockhighs
Traceback (most recent call last):
File "/usr/bin/scrapy", line 4, in <module>
execute()
File "/usr/lib/pymodules/python2.7/scrapy/cmdline.py", line 142, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/lib/pymodules/python2.7/scrapy/cmdline.py", line 88, in _run_print_help
func(*a, **kw)
File "/usr/lib/pymodules/python2.7/scrapy/cmdline.py", line 149, in _run_command
cmd.run(args, opts)
File "/usr/lib/pymodules/python2.7/scrapy/commands/crawl.py", line 47, in run
crawler = self.crawler_process.create_crawler()
File "/usr/lib/pymodules/python2.7/scrapy/crawler.py", line 88, in create_crawler
self.crawlers[name] = Crawler(self.settings)
File "/usr/lib/pymodules/python2.7/scrapy/crawler.py", line 26, in __init__
self.spiders = spman_cls.from_crawler(self)
File "/usr/lib/pymodules/python2.7/scrapy/spidermanager.py", line 35, in from_crawler
sm = cls.from_settings(crawler.settings)
File "/usr/lib/pymodules/python2.7/scrapy/spidermanager.py", line 31, in from_settings
return cls(settings.getlist('SPIDER_MODULES'))
File "/usr/lib/pymodules/python2.7/scrapy/spidermanager.py", line 22, in __init__
for module in walk_modules(name):
File "/usr/lib/pymodules/python2.7/scrapy/utils/misc.py", line 66, in walk_modules
submod = __import__(fullpath, {}, {}, [''])
File "/home/bwisdom/scrapy/stockscrape/spiders/epsRating.py", line 11, in <module>
class epsScrape(BaseSpider):
File "/home/bwisdom/scrapy/stockscrape/spiders/epsRating.py", line 14, in epsScrape
ifile = open('test.txt', "r")
IOError: [Errno 2] No such file or directory: 'test.txt'
我修复它的方法是创建一个空白的test.txt文件。现在我知道我在第一个蜘蛛中设置了一个“w”,所以它不会打开一个新文件,但即使我做“w +”或“a +”它也不起作用,直到我创建空的test.txt。一旦我创建了第一个就像它应该的那样运行,第二个也是如此。
我想我感到困惑的是为什么在尝试运行第一个蜘蛛时会调用第二个蜘蛛。