我正在写一个抓抓的蜘蛛来抓取youtube vid和捕获,名称,子用户数,链接等。我从教程中复制了此SQLalchemy代码并使其正常工作,但是每次我运行抓取器时,我都会在数据库。
如何检查刮取的数据是否已存在于数据库中,如果已存在,请不要进入数据库。...
这是我的pipeline.py代码
from sqlalchemy.orm import sessionmaker
from models import Channels, db_connect, create_channel_table
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class YtscraperPipeline(object):
"""YTscraper pipeline for storing scraped items in the database"""
def __init__(self):
#Initializes database connection and sessionmaker.
#Creates deals table.
engine = db_connect()
create_channel_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
"""Save youtube channel in the database.
This method is called for every item pipeline component.
"""
session = self.Session()
channel = Channels(**item)
try:
session.add(channel)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item
这是我的模特。py
from sqlalchemy import create_engine, Column, Integer, String, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL
import settings
DeclarativeBase = declarative_base()
def db_connect():
"""
Performs database connection using database settings from settings.py.
Returns sqlalchemy engine instance
"""
return create_engine(URL(**settings.DATABASE))
def create_channel_table(engine):
""""""
DeclarativeBase.metadata.create_all(engine)
class Channels(DeclarativeBase):
"""Sqlalchemy deals model"""
__tablename__ = "ytchannels"
id = Column(Integer, primary_key=True)
ctitle = Column('title', String)
clink = Column('link', String, nullable=True)
csubs = Column('subs', String, nullable=True)
date = Column('date', DateTime, nullable=True)
我不想重复添加到数据库中。我该怎么办?
这是我每次运行转储表时得到的,基本上是一遍又一遍地添加相同的信息。
id | title | link | subs | date
----+----------------------+----------------------------------------------------------+---------+----------------------------
1 | Ivan on Tech | https://www.youtube.com/user/LiljeqvistIvan | 195,249 | 2019-02-02 15:09:48.236281
2 | DataDash | https://www.youtube.com/channel/UCCatR7nWbYrkVXdxXb4cGXw | 315,691 | 2019-02-02 15:09:49.517085
3 | Tone Vays | https://www.youtube.com/channel/UCbiWJYRg8luWHnmNkJRZEnw | 82,588 | 2019-02-02 15:09:52.502221
4 | Crypt0 | https://www.youtube.com/user/obham001 | 119,046 | 2019-02-02 15:09:52.895278
5 | The Modern Investor | https://www.youtube.com/channel/UC-5HLi3buMzdxjdTdic3Aig | 122,228 | 2019-02-02 15:09:52.990033
6 | Decentralized TV | https://www.youtube.com/channel/UCueLJ4vLHTwMpYILmdBjRlg | 79,211 | 2019-02-02 15:09:53.108132
7 | Crypto Daily | https://www.youtube.com/channel/UC67AEEecqFEc92nVvcqKdhA | 121,341 | 2019-02-02 15:09:53.138157
8 | RoadtoRoota | https://www.youtube.com/user/RoadtoRoota | 54,954 | 2019-02-02 15:09:54.386956
9 | Altcoin Buzz | https://www.youtube.com/channel/UCGyqEtcGQQtXyUwvcy7Gmyg | 210,547 | 2019-02-02 15:09:54.412399
10 | TheChartGuys | https://www.youtube.com/channel/UCnqZ2hx679DqRi6khRUNw2g | 113,431 | 2019-02-02 15:09:55.36888
11 | Ivan on Tech | https://www.youtube.com/user/LiljeqvistIvan | 195,249 | 2019-02-02 15:09:55.563061
12 | Altcoin Daily | https://www.youtube.com/channel/UCbLhGKVY-bJPcawebgtNfbw | 62,543 | 2019-02-02 15:09:56.327525
13 | The Moon | https://www.youtube.com/channel/UCc4Rz_T9Sb1w5rqqo9pL1Og | 37,291 | 2019-02-02 15:09:56.376596
14 | Alessio Rastani | https://www.youtube.com/user/alessiorastani | 176,025 | 2019-02-02 15:09:56.439162
15 | CryptosRUs | https://www.youtube.com/channel/UCI7M65p3A-D3P4v5qW8POxQ | 51,387 | 2019-02-02 15:09:56.482699
16 | Crypto Zombie | https://www.youtube.com/channel/UCiUnrCUGCJTCC7KjuW493Ww | 46,715 | 2019-02-02 15:09:56.582438
17 | Crypto Love | https://www.youtube.com/channel/UCu7Sre5A1NMV8J3s2FhluCw | 93,999 | 2019-02-02 15:09:56.792019
18 | Crypto Kirby Trading | https://www.youtube.com/channel/UCOaew10hdmtfa0MinTjOBqg | 31,333 | 2019-02-02 15:09:58.092356
19 | sunny decree | https://www.youtube.com/user/d3cr33 | 80,294 | 2019-02-02 15:09:58.127674
20 | Crypto Jebb | https://www.youtube.com/channel/UCviqt5aaucA1jP3qFmorZLQ | 17,531 | 2019-02-02 15:09:58.396679
21 | Chico Crypto | https://www.youtube.com/channel/UCHop-jpf-huVT1IYw79ymPw | 29,144 | 2019-02-02 15:09:58.467988
22 | Ivan on Tech | https://www.youtube.com/user/LiljeqvistIvan | 195,249 | 2019-02-02 15:44:46.905164
23 | DataDash | https://www.youtube.com/channel/UCCatR7nWbYrkVXdxXb4cGXw | 315,688 | 2019-02-02 15:44:49.13279
24 | Crypto Daily | https://www.youtube.com/channel/UC67AEEecqFEc92nVvcqKdhA | 121,342 | 2019-02-02 15:44:50.450665
25 | The Modern Investor | https://www.youtube.com/channel/UC-5HLi3buMzdxjdTdic3Aig | 122,226 | 2019-02-02 15:44:50.513322
26 | Tone Vays | https://www.youtube.com/channel/UCbiWJYRg8luWHnmNkJRZEnw | 82,589 | 2019-02-02 15:44:50.546499
27 | Crypt0 | https://www.youtube.com/user/obham001 | 119,040 | 2019-02-02 15:44:50.642958
28 | Ivan on Tech | https://www.youtube.com/user/LiljeqvistIvan | 195,249 | 2019-02-02 15:44:50.951154
29 | Decentralized TV | https://www.youtube.com/channel/UCueLJ4vLHTwMpYILmdBjRlg | 79,211 | 2019-02-02 15:44:51.191991
30 | Altcoin Buzz | https://www.youtube.com/channel/UCGyqEtcGQQtXyUwvcy7Gmyg | 210,546 | 2019-02-02 15:44:51.266842
31 | Alessio Rastani | https://www.youtube.com/user/alessiorastani | 176,027 | 2019-02-02 15:44:51.420558
32 | The Moon | https://www.youtube.com/channel/UCc4Rz_T9Sb1w5rqqo9pL1Og | 37,294 | 2019-02-02 15:44:52.020989
33 | RoadtoRoota | https://www.youtube.com/user/RoadtoRoota | 54,954 | 2019-02-02 15:44:52.177793
34 | TheChartGuys | https://www.youtube.com/channel/UCnqZ2hx679DqRi6khRUNw2g | 113,437 | 2019-02-02 15:44:52.245701
35 | Altcoin Daily | https://www.youtube.com/channel/UCbLhGKVY-bJPcawebgtNfbw | 62,538 | 2019-02-02 15:44:52.864349
36 | Crypto Zombie | https://www.youtube.com/channel/UCiUnrCUGCJTCC7KjuW493Ww | 46,716 | 2019-02-02 15:44:53.042814
37 | CryptosRUs | https://www.youtube.com/channel/UCI7M65p3A-D3P4v5qW8POxQ | 51,388 | 2019-02-02 15:44:53.246394
38 | Crypto Kirby Trading | https://www.youtube.com/channel/UCOaew10hdmtfa0MinTjOBqg | 31,333 | 2019-02-02 15:44:53.54117
39 | sunny decree | https://www.youtube.com/user/d3cr33 | 80,294 | 2019-02-02 15:44:54.288063
40 | Crypto Love | https://www.youtube.com/channel/UCu7Sre5A1NMV8J3s2FhluCw | 93,998 | 2019-02-02 15:44:54.591665
41 | Crypto Jebb | https://www.youtube.com/channel/UCviqt5aaucA1jP3qFmorZLQ | 17,531 | 2019-02-02 15:44:54.769744
42 | Chico Crypto | https://www.youtube.com/channel/UCHop-jpf-huVT1IYw79ymPw | 29,148 | 2019-02-02 15:44:55.791358
答案 0 :(得分:2)
如果我对您的理解是正确的-您需要使用一些唯一的标识符来检查您的数据库中是否存在抓取的结果。
例如,您可以使用“标题”列。
使用这种方法,您可以像这样修改process_item
方法:
class YtscraperPipeline(object):
def __init__(self):
#Initializes database connection and sessionmaker.
engine = db_connect()
create_channel_table(engine)
Session = sessionmaker(bind=engine)
self.session = Session()
def process_item(self, item, spider):
# check if item with this title exists in DB
item_exists = self.session.query(Channels).filter_by(title=item['title']).first()
# if item exists in DB - we just update 'date' and 'subs' columns.
if item_exists:
item_exists.date = item['date']
item_exists.subs = item['subs']
print('Item {} updated.'.format(item['title']))
# if not - we insert new item to DB
else:
new_item = Channels(**item)
self.session.add(new_item)
print('New item {} added to DB.'.format(item['title']))
return item
def close_spider(self, spider):
# We commit and save all items to DB when spider finished scraping.
try:
self.session.commit()
except:
self.session.rollback()
raise
finally:
self.session.close()