Python,Scrapy,Pipeline:函数“process_item”没有被调用

时间:2015-07-10 02:18:34

标签: python scrapy pipeline

我有一个非常简单的代码,如下所示。刮擦是可以的,我可以看到生成正确数据的所有NaN语句。在print中,初始化工作正常。但是,Pipeline函数未被调用,因为函数开头的process_item语句永远不会被执行。

蜘蛛:comosham.py

print

项目文件:

import scrapy
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from activityadvisor.items import ComoShamLocation
from activityadvisor.items import ComoShamActivity
from activityadvisor.items import ComoShamRates
import re


class ComoSham(Spider):
    name = "comosham"
    allowed_domains = ["www.comoshambhala.com"]
    start_urls = [
        "http://www.comoshambhala.com/singapore/classes/schedules",
        "http://www.comoshambhala.com/singapore/about/location-contact",
        "http://www.comoshambhala.com/singapore/rates-and-offers/rates-classes",
        "http://www.comoshambhala.com/singapore/rates-and-offers/rates-classes/rates-private-classes"
    ]

    def parse(self, response):  
        category = (response.url)[39:44]
        print 'in parse'
        if category == 'class':
            pass
            """self.gen_req_class(response)"""
        elif category == 'about':
            print 'about to call parse_location'
            self.parse_location(response)
        elif category == 'rates':
            pass
            """self.parse_rates(response)"""
        else:
            print 'Cant find appropriate category! check check check!! Am raising Level 5 ALARM - You are a MORON :D'


    def parse_location(self, response):
        print 'in parse_location'       
        item = ComoShamLocation()
        item['category'] = 'location'
        loc = Selector(response).xpath('((//div[@id = "node-2266"]/div/div/div)[1]/div/div/p//text())').extract()
        item['address'] = loc[2]+loc[3]+loc[4]+(loc[5])[1:11]
        item['pin'] = (loc[5])[11:18]
        item['phone'] = (loc[9])[6:20]
        item['fax'] = (loc[10])[6:20]
        item['email'] = loc[12]
        print item['address'],item['pin'],item['phone'],item['fax'],item['email']
        return item

管道文件:

import scrapy
from scrapy.item import Item, Field

class ComoShamLocation(Item):
    address = Field()
    pin = Field()
    phone = Field()
    fax = Field()
    email = Field()
    category = Field()

4 个答案:

答案 0 :(得分:10)

你的问题是你从来没有真正屈服于这个项目。 parse_location返回要解析的项,但解析永远不会产生该项。

解决方案是替换:

yield self.parse_location(response)

echo "<form method='post'>";

更具体地说,如果没有产生任何项目,则不会调用process_item。

答案 1 :(得分:1)

在settings.py中使用ITEM_PIPELINES

ITEM_PIPELINES = ['project_name.pipelines.pipeline_class']

答案 2 :(得分:0)

添加上述答案,
1.请记住将以下行添加到settings.py中! ITEM_PIPELINES = {'[YOUR_PROJECT_NAME].pipelines.[YOUR_PIPELINE_CLASS]': 300} 2.蜘蛛跑完时产生物品! yield my_item

答案 3 :(得分:0)

这解决了我的问题: 我在调用Pipeline之前删除所有项目,因此没有调用process_item()但是调用了open_spider和close_spider。 因此,tmy解决方案只是更改了使用此管道的顺序,而不是另一个丢弃项目的管道。

Scrapy Pipeline Documentation.

请记住,只有当有要处理的项目时,Scrapy才会调用Pipeline.process_item()!