Scrapy管道SQLAlchemy在输入DB之前检查项目是否存在?

时间:2019-02-02 15:12:02

标签: python sqlalchemy scrapy scrapy-pipeline

我正在写一个抓抓的蜘蛛来抓取youtube vid和捕获,名称,子用户数,链接等。我从教程中复制了此SQLalchemy代码并使其正常工作,但是每次我运行抓取器时,我都会在数据库。

如何检查刮取的数据是否已存在于数据库中,如果已存在,请不要进入数据库。...

这是我的pipeline.py代码

from sqlalchemy.orm import sessionmaker
from models import Channels, db_connect, create_channel_table

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class YtscraperPipeline(object):

    """YTscraper  pipeline for storing scraped items in the database"""
    def __init__(self):

                #Initializes database connection and sessionmaker.
                #Creates deals table.
            engine = db_connect()
            create_channel_table(engine)
            self.Session = sessionmaker(bind=engine)

    def process_item(self, item, spider):
                """Save youtube channel in the database.

                This method is called for every item pipeline component.

                """
                session = self.Session()
                channel = Channels(**item)

                try:
                        session.add(channel)
                        session.commit()
                except:
                        session.rollback()
                        raise
                finally:
                        session.close()

                return item

这是我的模特。py

from sqlalchemy import create_engine, Column, Integer, String, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL

import settings


DeclarativeBase = declarative_base()


def db_connect():
    """
    Performs database connection using database settings from settings.py.
    Returns sqlalchemy engine instance
    """
    return create_engine(URL(**settings.DATABASE))


def create_channel_table(engine):
    """"""
    DeclarativeBase.metadata.create_all(engine)


class Channels(DeclarativeBase):
    """Sqlalchemy deals model"""
    __tablename__ = "ytchannels"

    id = Column(Integer, primary_key=True)
    ctitle = Column('title', String)
    clink = Column('link', String, nullable=True)
    csubs = Column('subs', String, nullable=True)

    date = Column('date', DateTime, nullable=True)

我不想重复添加到数据库中。我该怎么办?

这是我每次运行转储表时得到的,基本上是一遍又一遍地添加相同的信息。

 id |        title         |                           link                           |  subs   |            date            
----+----------------------+----------------------------------------------------------+---------+----------------------------
  1 | Ivan on Tech         | https://www.youtube.com/user/LiljeqvistIvan              | 195,249 | 2019-02-02 15:09:48.236281
  2 | DataDash             | https://www.youtube.com/channel/UCCatR7nWbYrkVXdxXb4cGXw | 315,691 | 2019-02-02 15:09:49.517085
  3 | Tone Vays            | https://www.youtube.com/channel/UCbiWJYRg8luWHnmNkJRZEnw | 82,588  | 2019-02-02 15:09:52.502221
  4 | Crypt0               | https://www.youtube.com/user/obham001                    | 119,046 | 2019-02-02 15:09:52.895278
  5 | The Modern Investor  | https://www.youtube.com/channel/UC-5HLi3buMzdxjdTdic3Aig | 122,228 | 2019-02-02 15:09:52.990033
  6 | Decentralized TV     | https://www.youtube.com/channel/UCueLJ4vLHTwMpYILmdBjRlg | 79,211  | 2019-02-02 15:09:53.108132
  7 | Crypto Daily         | https://www.youtube.com/channel/UC67AEEecqFEc92nVvcqKdhA | 121,341 | 2019-02-02 15:09:53.138157
  8 | RoadtoRoota          | https://www.youtube.com/user/RoadtoRoota                 | 54,954  | 2019-02-02 15:09:54.386956
  9 | Altcoin Buzz         | https://www.youtube.com/channel/UCGyqEtcGQQtXyUwvcy7Gmyg | 210,547 | 2019-02-02 15:09:54.412399
 10 | TheChartGuys         | https://www.youtube.com/channel/UCnqZ2hx679DqRi6khRUNw2g | 113,431 | 2019-02-02 15:09:55.36888
 11 | Ivan on Tech         | https://www.youtube.com/user/LiljeqvistIvan              | 195,249 | 2019-02-02 15:09:55.563061
 12 | Altcoin Daily        | https://www.youtube.com/channel/UCbLhGKVY-bJPcawebgtNfbw | 62,543  | 2019-02-02 15:09:56.327525
 13 | The Moon             | https://www.youtube.com/channel/UCc4Rz_T9Sb1w5rqqo9pL1Og | 37,291  | 2019-02-02 15:09:56.376596
 14 | Alessio Rastani      | https://www.youtube.com/user/alessiorastani              | 176,025 | 2019-02-02 15:09:56.439162
 15 | CryptosRUs           | https://www.youtube.com/channel/UCI7M65p3A-D3P4v5qW8POxQ | 51,387  | 2019-02-02 15:09:56.482699
 16 | Crypto Zombie        | https://www.youtube.com/channel/UCiUnrCUGCJTCC7KjuW493Ww | 46,715  | 2019-02-02 15:09:56.582438
 17 | Crypto Love          | https://www.youtube.com/channel/UCu7Sre5A1NMV8J3s2FhluCw | 93,999  | 2019-02-02 15:09:56.792019
 18 | Crypto Kirby Trading | https://www.youtube.com/channel/UCOaew10hdmtfa0MinTjOBqg | 31,333  | 2019-02-02 15:09:58.092356
 19 | sunny decree         | https://www.youtube.com/user/d3cr33                      | 80,294  | 2019-02-02 15:09:58.127674
 20 | Crypto Jebb          | https://www.youtube.com/channel/UCviqt5aaucA1jP3qFmorZLQ | 17,531  | 2019-02-02 15:09:58.396679
 21 | Chico Crypto         | https://www.youtube.com/channel/UCHop-jpf-huVT1IYw79ymPw | 29,144  | 2019-02-02 15:09:58.467988
 22 | Ivan on Tech         | https://www.youtube.com/user/LiljeqvistIvan              | 195,249 | 2019-02-02 15:44:46.905164
 23 | DataDash             | https://www.youtube.com/channel/UCCatR7nWbYrkVXdxXb4cGXw | 315,688 | 2019-02-02 15:44:49.13279
 24 | Crypto Daily         | https://www.youtube.com/channel/UC67AEEecqFEc92nVvcqKdhA | 121,342 | 2019-02-02 15:44:50.450665
 25 | The Modern Investor  | https://www.youtube.com/channel/UC-5HLi3buMzdxjdTdic3Aig | 122,226 | 2019-02-02 15:44:50.513322
 26 | Tone Vays            | https://www.youtube.com/channel/UCbiWJYRg8luWHnmNkJRZEnw | 82,589  | 2019-02-02 15:44:50.546499
 27 | Crypt0               | https://www.youtube.com/user/obham001                    | 119,040 | 2019-02-02 15:44:50.642958
 28 | Ivan on Tech         | https://www.youtube.com/user/LiljeqvistIvan              | 195,249 | 2019-02-02 15:44:50.951154
 29 | Decentralized TV     | https://www.youtube.com/channel/UCueLJ4vLHTwMpYILmdBjRlg | 79,211  | 2019-02-02 15:44:51.191991
 30 | Altcoin Buzz         | https://www.youtube.com/channel/UCGyqEtcGQQtXyUwvcy7Gmyg | 210,546 | 2019-02-02 15:44:51.266842
 31 | Alessio Rastani      | https://www.youtube.com/user/alessiorastani              | 176,027 | 2019-02-02 15:44:51.420558
 32 | The Moon             | https://www.youtube.com/channel/UCc4Rz_T9Sb1w5rqqo9pL1Og | 37,294  | 2019-02-02 15:44:52.020989
 33 | RoadtoRoota          | https://www.youtube.com/user/RoadtoRoota                 | 54,954  | 2019-02-02 15:44:52.177793
 34 | TheChartGuys         | https://www.youtube.com/channel/UCnqZ2hx679DqRi6khRUNw2g | 113,437 | 2019-02-02 15:44:52.245701
 35 | Altcoin Daily        | https://www.youtube.com/channel/UCbLhGKVY-bJPcawebgtNfbw | 62,538  | 2019-02-02 15:44:52.864349
 36 | Crypto Zombie        | https://www.youtube.com/channel/UCiUnrCUGCJTCC7KjuW493Ww | 46,716  | 2019-02-02 15:44:53.042814
 37 | CryptosRUs           | https://www.youtube.com/channel/UCI7M65p3A-D3P4v5qW8POxQ | 51,388  | 2019-02-02 15:44:53.246394
 38 | Crypto Kirby Trading | https://www.youtube.com/channel/UCOaew10hdmtfa0MinTjOBqg | 31,333  | 2019-02-02 15:44:53.54117
 39 | sunny decree         | https://www.youtube.com/user/d3cr33                      | 80,294  | 2019-02-02 15:44:54.288063
 40 | Crypto Love          | https://www.youtube.com/channel/UCu7Sre5A1NMV8J3s2FhluCw | 93,998  | 2019-02-02 15:44:54.591665
 41 | Crypto Jebb          | https://www.youtube.com/channel/UCviqt5aaucA1jP3qFmorZLQ | 17,531  | 2019-02-02 15:44:54.769744
 42 | Chico Crypto         | https://www.youtube.com/channel/UCHop-jpf-huVT1IYw79ymPw | 29,148  | 2019-02-02 15:44:55.791358

1 个答案:

答案 0 :(得分:2)

如果我对您的理解是正确的-您需要使用一些唯一的标识符来检查您的数据库中是否存在抓取的结果。 例如,您可以使用“标题”列。 使用这种方法,您可以像这样修改process_item方法:

class YtscraperPipeline(object):

    def __init__(self):
        #Initializes database connection and sessionmaker.
        engine = db_connect()
        create_channel_table(engine)
        Session = sessionmaker(bind=engine)
        self.session = Session()

    def process_item(self, item, spider):
        # check if item with this title exists in DB
        item_exists = self.session.query(Channels).filter_by(title=item['title']).first()
        # if item exists in DB - we just update 'date' and 'subs' columns.
        if item_exists:
            item_exists.date = item['date']
            item_exists.subs = item['subs'] 
            print('Item {} updated.'.format(item['title']))
        # if not - we insert new item to DB
        else:     
            new_item = Channels(**item)
            self.session.add(new_item)
            print('New item {} added to DB.'.format(item['title']))
        return item    

    def close_spider(self, spider):
        # We commit and save all items to DB when spider finished scraping.
        try:
            self.session.commit()
        except:
            self.session.rollback()
            raise
        finally:
            self.session.close()