这是我的网页抓取工具,可生成包含标题,网址和名称
的项目import scrapy
from ..items import ContentsPageSFBItem
class BasicSpider(scrapy.Spider):
name = "contentspage_sfb"
#allowed_domains = ["web"]
start_urls = [
'https://www.safaribooksonline.com/library/view/shell-programming-in/9780134496696/',
'https://www.safaribooksonline.com/library/view/cisa-certified-information/9780134677453/'
]
def parse(self, response):
item = ContentsPageSFBItem()
#from scrapy.shell import inspect_response
#inspect_response(response, self)
content_items = response.xpath('//ol[@class="detail-toc"]//a/text()').extract()
for content_item in content_items:
item['content_item'] = content_item
item["full_url"] = response.url
item['title'] = response.xpath('//title[1]/text()').extract()
yield item
代码完美无缺。但是,由于爬行的性质,会生成大量数据。我的意图是将结果除以一个解析的URL并将结果存储在一个csv文件中。我正在使用以下代码
from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter
class ContentspageSfbPipeline(object):
def __init__(self):
self.files = {}
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, contentspage_sfb):
file = open('results/%s.csv' % contentspage_sfb.url, 'w+b')
self.files[contentspage_sfb] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = ['item']
self.exporter.start_exporting()
def spider_closed(self, contentspage_sfb):
self.exporter.finish_exporting()
file = self.files.pop(contentspage_sfb)
file.close()
def process_item(self, item, contentspage_sfb):
self.exporter.export_item(item)
return item
然而,我收到错误:
TypeError: unbound method from_crawler() must be called with ContentspageSfbPipeline instance as first argument (got Crawler instance instead)
根据建议,我在from_crawler
函数之前添加了装饰器。但是,现在我得到了属性错误。
Traceback (most recent call last):
File "/home/eadaradhiraj/program_files/venv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/eadaradhiraj/program_files/pycharm_projects/javascriptlibraries/javascriptlibraries/pipelines.py", line 39, in process_item
self.exporter.export_item(item)
AttributeError: 'ContentspageSfbPipeline' object has no attribute 'exporter'
答案 0 :(得分:2)
您缺少@classmethod
方法的from_crawler()
装饰器。
请参阅相关Meaning of @classmethod and @staticmethod for beginner?了解哪些classmethods。
此外,您不需要连接管道中的任何信号。根据{{3}}
,管道可以包含open_spider
和close_spider
方法