将蜘蛛结果保存到数据库

时间:2015-02-03 14:08:58

标签: python python-3.x sqlalchemy web-scraping lxml

目前正在考虑将我的已删除数据保存到数据库的好方法。

应用流程:

  1. 运行蜘蛛(数据抓取器),位于spiders /
  2. 中的文件
  3. 成功收集数据后,使用pipeline.py中的类将数据/项目(title,link,pubDate)保存到数据库中

  4. 我希望你能帮忙解决如何通过pipeline.py将spider.py中的数据(title,link,pubDate)保存到数据库中,目前我没有将这些文件连接在一起。成功抓取数据后,需要通知管道,接收数据并保存


    我非常感谢你的帮助


    Spider.py

    import urllib.request
    import lxml.etree as ET   
    
    opener = urllib.request.build_opener()
    tree = ET.parse(opener.open('https://nordfront.se/feed'))
    
    
    items = [{'title': item.find('title').text, 'link': item.find('link').text, 'pubdate': item.find('pubDate').text} for item in tree.xpath("/rss/channel/item")]
    
    for item in items:
        print(item['title'], item['link'], item['pubdate'])
    


    Models.py

    from sqlalchemy import create_engine, Column, Integer, String, DateTime
    from sqlalchemy.ext.declarative import declarative_base
    from sqlalchemy.engine.url import URL
    from sqlalchemy import UniqueConstraint
    import datetime
    
    import settings
    
    
    def db_connect():
        """
        Performs database connection using database settings from settings.py.
        Returns sqlalchemy engine instance
        """
        return create_engine(URL(**settings.DATABASE))
    
    
    DeclarativeBase = declarative_base()
    
    # <--snip-->
    
    def create_presstv_table(engine):
    
        DeclarativeBase.metadata.create_all(engine)
    
    def create_nordfront_table(engine):
    
        DeclarativeBase.metadata.create_all(engine)
    
    def _get_date():
        return datetime.datetime.now()
    
    
    class Nordfront(DeclarativeBase):
        """Sqlalchemy deals model"""
        __tablename__ = "nordfront"
    
        id = Column(Integer, primary_key=True)
        title = Column('title', String)
        description = Column('description', String, nullable=True)
        link = Column('link', String, unique=True)
        date = Column('date', String, nullable=True)
        created_at = Column('created_at', DateTime, default=_get_date)
    


    Pipeline.py

    from sqlalchemy.orm import sessionmaker
    from models import Nordfront, db_connect, create_nordfront_table
    
        class NordfrontPipeline(object):
            """Pipeline for storing scraped items in the database"""
            def __init__(self):
                """
                Initializes database connection and sessionmaker.
                Creates deals table.
                """
                engine = db_connect()
                create_nordfront_table(engine)
                self.Session = sessionmaker(bind=engine)
    
    
    
    
            def process_item(self, item, spider):
                """Save data in the database.
    
                This method is called for every item pipeline component.
    
                """
                session = self.Session()
                deal = Nordfront(**item)
    
                if session.query(Nordfront).filter_by(link=item['link']).first() == None:
                    try:
                        session.add(deal)
                        session.commit()
                    except:
                        session.rollback()
                        raise
                    finally:
                        session.close()
    
                    return item
    


    Settings.py

    DATABASE = {'drivername': 'postgres',
                'host': 'localhost',
                'port': '5432',
                'username': 'toothfairy',
                'password': 'password123',
                'database': 'news'}
    

1 个答案:

答案 0 :(得分:2)

据我了解,这是一个特定于Scrapy的问题。如果是这样,您只需要settings.py中的activate your pipeline

ITEM_PIPELINES = {
    'myproj.pipeline.NordfrontPipeline': 100
}

这会让引擎知道将已爬网的项目发送到管道(参见control flow):

enter image description here


如果我们不是在谈论Scrapy,那么直接从您的蜘蛛中呼叫process_item()

from pipeline import NordfrontPipeline

...

pipeline = NordfrontPipeline()
for item in items:
    pipeline.process_item(item, None)

您也可以从spider管道方法中删除process_item()参数,因为它没有被使用。