我正在学习scrapy。我做了所有事情除了正确调用pipelines.process_item()。它正在调用pipelines.open_spider()和pipelines.close_spider()确定。
我认为这是因为蜘蛛没有生成任何“项目”信号(不是item_passed,item_dropped或item_scraped)。
我添加了一些代码来尝试捕获这些信号,当我尝试捕获上述3个项目信号中的任何一个时,我什么都没得到。
代码捕获其他信号(如engine_started或spider_closed等)。
如果我尝试设置item ['doesnotexist']变量,它也会出错,因此它似乎使用items文件和我的用户定义的项目类“AuctionDOTcomItems”。
真的不知所措。我非常感谢任何帮助...
A)使pipelines.process_item()正常工作OR ...
B)能够手动捕获已设置项目的信号,因此我可以将控制权传递给我自己的pipelines.process_item()版本。
反应器:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.utils.project import get_project_settings
class SpiderRun:
def __init__(self, spider):
settings = get_project_settings()
mySettings = {'ITEM_PIPELINES': {'estatescraper.pipelines.EstatescraperXLSwriter':300}}
settings.overrides.update(mySettings)
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
# log.start()
reactor.run() # the script will block here until the spider_closed signal was sent
self.cleanup()
def cleanup(self):
print "SpiderRun done" #333
pass
if __name__ == "__main__":
from estatescraper import AuctionDOTcom
spider = AuctionDOTcom()
r = SpiderRun(spider)
蜘蛛:
from scrapy.xlib.pydispatch import dispatcher
from scrapy.http import Request
from scrapy.selector import Selector
from scrapy import signals
from scrapy.spider import Spider
from auctiondotcomurls import AuctionDOTcomURLs
from auctiondotcomitems import AuctionDOTcomItems
from auctiondotcomgetitems import AuctionDOTcomGetItems
import urlparse
import time
import sys
class AuctionDOTcom(Spider):
def __init__(self,
limit = 50,
miles = 250,
zip = None,
asset_types = "",
auction_types = "",
property_types = ""):
self.name = "auction.com"
self.allowed_domains = ["auction.com"]
self.start_urls = AuctionDOTcomURLs(limit, miles, zip, asset_types,
auction_types, property_types)
dispatcher.connect(self.testsignal, signals.item_scraped)
# def _item_passed(self, item):
# print "item = ", item #333
def testsignal(self):
print "in csvwrite" #333
def parse(self, response):
sel = Selector(response)
listings = sel.xpath('//div[@class="contentDetail searchResult"]')
for listing in listings:
item = AuctionDOTcomItems()
item['propertyID'] = ''.join(set(listing.xpath('./@property-id').extract()))
print "item['propertyID'] = ", item['propertyID'] #333
# item = AuctionDOTcomGetItems(listing)
# ################
# # DEMONSTRATTION ONLY
# print "######################################"
# for i in item:
# print i + ": " + str(item[i])
next = set(sel.xpath('//a[contains(text(),"Next")]//@href').extract())
for i in next:
yield Request("http://%s/%s" % (urlparse.urlparse(response.url).hostname, i), callback=self.parse)
if __name__ == "__main__":
from estatescraper import SpiderRun
from estatescraper import AuctionDOTcom
spider = AuctionDOTcom()
r = SpiderRun(spider)
管道:
import csv
from csv import DictWriter
# class TutorialPipeline(object):
# def process_item(self, item, spider):
# return item
class EstatescraperXLSwriter(object):
def __init__(self):
print "Ive started the __init__ in the pipeline" #333
self.brandCategoryCsv = csv.writer(open('test.csv', 'wb'),
delimiter=',',
quoting=csv.QUOTE_MINIMAL)
self.brandCategoryCsv.writerow(['Property ID', 'Asset Type'])
def open_spider(self, spider):
print "Hit open_spider in EstatescraperXLSwriter" #333
def process_item(self, item, spider):
print "attempting to run process_item" #333
self.brandCategoryCsv.writerow([item['propertyID'],
item['assetType']])
return item
def close_spider(self, spider):
print "Hit close_spider in EstatescraperXLSwriter" #333
pass
if __name__ == "__main__":
o = EstatescraperXLSwriter()
项目:
from scrapy.item import Item, Field
class AuctionDOTcomItems(Item):
""""""
propertyID = Field() # <uniqueID>ABCD1234</uniqueID>
输出:
Ive started the __init__ in the pipeline
Hit open_spider in EstatescraperXLSwriter
item['propertyID'] = 1590613
item['propertyID'] = 1466738
(...)
item['propertyID'] = 1639764
Hit close_spider in EstatescraperXLSwriter
SpiderRun done
记录输出:
Ive started the __init__ in the pipeline
Hit open_spider in EstatescraperXLSwriter
2014-02-27 17:44:12+0100 [auction.com] INFO: Closing spider (finished)
2014-02-27 17:44:12+0100 [auction.com] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 240,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 40640,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 2, 27, 16, 44, 12, 238000),
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 2, 27, 16, 44, 9, 203000)}
2014-02-27 17:44:12+0100 [auction.com] INFO: Spider closed (finished)
答案 0 :(得分:0)
我没有看到你在def解析中产生项目,只有Request onjects。在某些时候尝试“收益项目”列表中的列表:loop - paul t。 2月27日17:42