我有一个非常简单的代码,如下所示。刮擦是可以的,我可以看到生成正确数据的所有NaN
语句。在print
中,初始化工作正常。但是,Pipeline
函数未被调用,因为函数开头的process_item
语句永远不会被执行。
蜘蛛:comosham.py
print
项目文件:
import scrapy
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from activityadvisor.items import ComoShamLocation
from activityadvisor.items import ComoShamActivity
from activityadvisor.items import ComoShamRates
import re
class ComoSham(Spider):
name = "comosham"
allowed_domains = ["www.comoshambhala.com"]
start_urls = [
"http://www.comoshambhala.com/singapore/classes/schedules",
"http://www.comoshambhala.com/singapore/about/location-contact",
"http://www.comoshambhala.com/singapore/rates-and-offers/rates-classes",
"http://www.comoshambhala.com/singapore/rates-and-offers/rates-classes/rates-private-classes"
]
def parse(self, response):
category = (response.url)[39:44]
print 'in parse'
if category == 'class':
pass
"""self.gen_req_class(response)"""
elif category == 'about':
print 'about to call parse_location'
self.parse_location(response)
elif category == 'rates':
pass
"""self.parse_rates(response)"""
else:
print 'Cant find appropriate category! check check check!! Am raising Level 5 ALARM - You are a MORON :D'
def parse_location(self, response):
print 'in parse_location'
item = ComoShamLocation()
item['category'] = 'location'
loc = Selector(response).xpath('((//div[@id = "node-2266"]/div/div/div)[1]/div/div/p//text())').extract()
item['address'] = loc[2]+loc[3]+loc[4]+(loc[5])[1:11]
item['pin'] = (loc[5])[11:18]
item['phone'] = (loc[9])[6:20]
item['fax'] = (loc[10])[6:20]
item['email'] = loc[12]
print item['address'],item['pin'],item['phone'],item['fax'],item['email']
return item
管道文件:
import scrapy
from scrapy.item import Item, Field
class ComoShamLocation(Item):
address = Field()
pin = Field()
phone = Field()
fax = Field()
email = Field()
category = Field()
答案 0 :(得分:10)
你的问题是你从来没有真正屈服于这个项目。 parse_location返回要解析的项,但解析永远不会产生该项。
解决方案是替换:
yield self.parse_location(response)
与
echo "<form method='post'>";
更具体地说,如果没有产生任何项目,则不会调用process_item。
答案 1 :(得分:1)
在settings.py中使用ITEM_PIPELINES
:
ITEM_PIPELINES = ['project_name.pipelines.pipeline_class']
答案 2 :(得分:0)
添加上述答案,
1.请记住将以下行添加到settings.py中!
ITEM_PIPELINES = {'[YOUR_PROJECT_NAME].pipelines.[YOUR_PIPELINE_CLASS]': 300}
2.蜘蛛跑完时产生物品!
yield my_item
答案 3 :(得分:0)
这解决了我的问题: 我在调用Pipeline之前删除所有项目,因此没有调用process_item()但是调用了open_spider和close_spider。 因此,tmy解决方案只是更改了使用此管道的顺序,而不是另一个丢弃项目的管道。
Scrapy Pipeline Documentation.
请记住,只有当有要处理的项目时,Scrapy才会调用Pipeline.process_item()!