scrapy如何防止重复数据被插入数据库

时间:2015-04-03 21:48:05

标签: python mysql scrapy

有人可以帮助我解决这个问题吗,对于scrapy / python我是一个小新手。我似乎无法阻止重复数据插入数据库。举些例子。如果我的数据库马自达的价格为4000美元。如果'car'已经存在或者'car with car'存在,我不希望蜘蛛再次插入爬行数据。

price | car
-------------
$4000 | Mazda   <----
$3000 | Mazda 3 <----
$4000 | BMW
$4000 | Mazda 3 <---- I also dont want to have two results like this 
$4000 | Mazda   <---- I don't want to have two results any help will be greatly appreciated - Thanks 


pipeline.py
-------------------
from scrapy import log  
#from scrapy.core.exceptions import DropItem  
from twisted.enterprise import adbapi  
from scrapy.http import Request  
from scrapy.exceptions import DropItem  
from scrapy.contrib.pipeline.images import ImagesPipeline  
import time  
import MySQLdb  
import MySQLdb.cursors
import socket
import select
import sys
import os
import errno  

----------------------------------
when I put this peace of code, the crawled data does not save. but when removed it does save into the database.



class DuplicatesPipeline(object):

    def __init__(self):
         self.car_seen = set()

    def process_item(self, item, spider):
        if item['car'] in self.car_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.car_seen.add(item['car'])
            return item  
 --------------------------------------  

class MySQLStorePipeline(object):  

    def __init__(self):  
        self.dbpool = adbapi.ConnectionPool('MySQLdb',  
            db = 'test',  
            user = 'root',  
            passwd = 'test',  
            cursorclass = MySQLdb.cursors.DictCursor,  
            charset = 'utf8',  
            use_unicode = False  
        )  

    def _conditional_insert(self, tx, item):  
        if item.get('price'):  
            tx.execute(\
                "insert into data ( \
                price,\
                car \
                ) \
                values (%s, %s)",
                (item['price'],
                item['car'],
                )
            ) 

    def process_item(self, item, spider):                
        query = self.dbpool.runInteraction(self._conditional_insert, item)   
        return item



settings.py
------------
SPIDER_MODULES = ['car.spiders']
NEWSPIDER_MODULE = 'car.spiders'
ITEM_PIPELINES = ['car.pipelines.MySQLStorePipeline'] 

1 个答案:

答案 0 :(得分:2)

发现问题。确保duplicatespipeline是第一个。

settings.py
ITEM_PIPELINES = {
    'car.pipelines.DuplicatesPipeline': 100,
    'car.pipelines.MySQLStorePipeline': 200,
    }