Scrapy仅为某些网站连接到MySQL

时间:2018-01-04 20:51:51

标签: python mysql scrapy

当我刮“http://www.example.com”时,我能够连接到MySQL并将值插入数据库。

但是,当我试图搜索其他网站时,即 - “https://www.nytimes.com”,我失去了联系。我无法弄清楚原因:

尝试抓取https://www.nytimes.com时出错:

2018-01-04 14:38:01 [scrapy.middleware] INFO: Enabled item pipelines:
['properties.pipelines.MysqlWriter']

2018-01-04 14:38:02 [basic] ERROR: Can't connect to MySQL:mysql://root:password@localhost:3306/cat

我的管道:

    import traceback

import dj_database_url
import MySQLdb

from twisted.internet import defer
from twisted.enterprise import adbapi
from scrapy.exceptions import NotConfigured


class MysqlWriter(object):
    """
    A spider that writes to MySQL databases
    """

    @classmethod
    def from_crawler(cls, crawler):
        """Retrieves scrapy crawler and accesses pipeline's settings"""

        # Get MySQL URL from settings
        mysql_url = crawler.settings.get('MYSQL_PIPELINE_URL', None)

        # If doesnt exist, disable the pipeline
        if not mysql_url:
            raise NotConfigured

        # Create the class
        return cls(mysql_url)

    def __init__(self, mysql_url):
        """Opens a MySQL connection pool"""

        # Store the url for future reference
        self.mysql_url = mysql_url
        # Report connection error only once
        self.report_connection_error = True

        # Parse MySQL URL and try to initialize a connection
        conn_kwargs = MysqlWriter.parse_mysql_url(mysql_url)
        self.dbpool = adbapi.ConnectionPool('MySQLdb',
                                            charset='utf8',
                                            use_unicode=True,
                                            connect_timeout=5,
                                            **conn_kwargs)

    def close_spider(self, spider):
        """Discard the database pool on spider close"""
        self.dbpool.close()

    @defer.inlineCallbacks
    def process_item(self, item, spider):
        """Processes the item. Does insert into MySQL"""

        logger = spider.logger

        try:
            yield self.dbpool.runInteraction(self.do_replace, item)
        except MySQLdb.OperationalError:
            if self.report_connection_error:
                logger.error("Can't connect to MySQL: %s" % self.mysql_url)
                self.report_connection_error = False
        except:
            print traceback.format_exc()

        # Return the item for the next stage
        defer.returnValue(item)

    @staticmethod
    def do_replace(tx, item):
        """Does the actual REPLACE INTO"""

        sql = """REPLACE INTO text2 (url, text)
        VALUES (%s,%s)"""

        args = (
            item["url"],
            item["words"],
        )

        tx.execute(sql, args)

    @staticmethod
    def parse_mysql_url(mysql_url):
        """
        Parses mysql url and prepares arguments for
        adbapi.ConnectionPool()
        """

        params = dj_database_url.parse(mysql_url)

        conn_kwargs = {}
        conn_kwargs['host'] = params['HOST']
        conn_kwargs['user'] = params['USER']
        conn_kwargs['passwd'] = params['PASSWORD']
        conn_kwargs['db'] = params['NAME']
        conn_kwargs['port'] = params['PORT']

        # Remove items with empty values
        conn_kwargs = dict((k, v) for k, v in conn_kwargs.iteritems() if v)

        return conn_kwargs

我的蜘蛛:

from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from properties.items import PropertiesItem
import datetime
import urlparse
import socket
import scrapy


class BasicSpider(scrapy.Spider):
    name = "basic"
    allowed_domains = ["web"]

    # Start on a property page
    #start_urls = [i.strip() for i in open('urls.txt').readlines()]
    start_urls = ('http://www.nytimes.com',)

    def parse(self, response):
        """ This function parses a property page.
        @url http://web:9312/properties/property_000000.html
        @returns items 1
        @scrapes title price description address image_urls
        @scrapes url project spider server date
        """
        # Create the loader using the response
        l = ItemLoader(item=PropertiesItem(), response=response)

        # Load fields using XPath expressions
        l.add_xpath('words', '//p/text()',
                    MapCompose(unicode.strip, unicode.title))

        # Housekeeping fields
        l.add_value('url', response.url)
        # l.add_value('project', self.settings.get('BOT_NAME'))
        #l.add_value('spider', self.name)
        #l.add_value('server', socket.gethostname())
        #l.add_value('date', datetime.datetime.now())

        return l.load_item()

1 个答案:

答案 0 :(得分:0)

我弄清楚出了什么问题 - 我试图在mysql的单个列中插入一个列表。我的旧代码:

sql = """REPLACE INTO text2 (url, text)
    VALUES (%s,%s)"""

    args = (
        item["url"],
        item["words"],

我的新代码:

sql = """REPLACE INTO text2 (url, text)
    VALUES (%s,%s)"""

    args = (
        item["url"],
        str(item["words"]),