Question

在ETL过程中，我想定期查询数据库“A”（例如，所有行的时间戳都大于程序的最后一次运行）并将该数据移动到数据库“B”中以进行进一步处理。两者都是PostgreSQL数据库。我想在Python脚本中进行这种数据传输，使用SQLAlchemy连接到两个数据库。什么是最不凌乱，最不脆弱的方式呢？

我知道Postgres的COPY TO和COPY FROM命令允许通过中间文件（see here）将表行和查询结果从一个数据库服务器传输到另一个数据库服务器。从Unix命令行，您甚至可以将数据库A的输出作为输入传递给数据库B，而不需要大的中间文件（see excellent instructions here）。我想知道的是如何使用两个SQLAlchemy连接在Python脚本中完成最后一个技巧，而不是使用subprocess来运行shell命令。

import sqlalchemy
dbA = sqlalchemy.create_engine(connection_string_A)
dbB = sqlalchemy.create_engine(connection_string_B)

# how do I do this part?
dbA.execute('SELECT (column) FROM widgets...') # somehow pipe output into...
dbB.execute('INSERT INTO widgets (column) ...') # without holding lots of data in memory or on disk

对于记录，此时我没有使用SQLAlchemy的任何ORM功能，只是裸SQL查询。

Answer 1

你在问题中询问了两件不同的事情。一个是如何将CSV从COPY FROM管道传输到COPY TO;另一个是如何将SELECT查询中的行管道化为INSERT。

将SELECT查询中的行管道化为INSERT是一种谎言，因为虽然您可以从SELECT查询中流式传输行，但您无法将行流式传输到{ {1}}，因此您必须批量执行多个INSERT。这种方法由于INSERT而具有很高的开销，但由于往返于CSV而导致数据丢失的问题较少。我将重点介绍为什么将INSERT中的CSV格式化为COPY FROM非常棘手，以及如何实现它。

COPY TO允许您通过（同步）copy_expert函数执行psycopg2命令。它要求您传入COPY的可读文件对象和COPY FROM的可写文件对象。要完成您所描述的内容，您需要两个单独的线程来运行这两个命令中的每一个，一个文件对象使用COPY TO方法阻止write()命令无法跟上，以及一个文件对象使用COPY FROM方法阻止read()命令无法跟上。这是一个典型的生产者 - 消费者问题，要想做对，可能会很棘手。

这是我快速写的一篇（Python 3）。它可能充满了bug。如果您发现死锁（请编辑欢迎），请告诉我。

COPY TO

用法示例：

from threading import Lock, Condition, Thread


class Output(object):
    def __init__(self, pipe):
        self.pipe = pipe

    def read(self, count):
        with self.pipe.lock:
            # wait until pipe is still closed or buffer is not empty
            while not self.pipe.closed and len(self.pipe.buffer) == 0:
                self.pipe.empty_cond.wait()

            if len(self.pipe.buffer) == 0:
                return ""

            count = max(count, len(self.pipe.buffer))
            res, self.pipe.buffer = \
                self.pipe.buffer[:count], self.pipe.buffer[count:]
            self.pipe.full_cond.notify()
        return res

    def close(self):
        with self.pipe.lock:
            self.pipe.closed = True
            self.pipe.full_cond.notify()


class Input(object):
    def __init__(self, pipe):
        self.pipe = pipe

    def write(self, s):
        with self.pipe.lock:
            # wait until pipe is closed or buffer is not full
            while not self.pipe.closed \
                    and len(self.pipe.buffer) > self.pipe.bufsize:
                self.pipe.full_cond.wait()

            if self.pipe.closed:
                raise Exception("pipe closed")

            self.pipe.buffer += s
            self.pipe.empty_cond.notify()

    def close(self):
        with self.pipe.lock:
            self.pipe.closed = True
            self.pipe.empty_cond.notify()


class FilePipe(object):
    def __init__(self, bufsize=4096):
        self.buffer = b""
        self.bufsize = 4096
        self.input = Input(self)
        self.output = Output(self)
        self.lock = Lock()
        self.full_cond = Condition(self.lock)
        self.empty_cond = Condition(self.lock)
        self.closed = False

Answer 2

如果数据不是很大（可以保存在单个主机的主内存中），你可以尝试我的开源ETL工具包基于pandas / python3 / sqlalchemy，bailaohe/parade，我提供了一个{ {3}}。您可以使用pandas对数据进行转换并直接返回结果数据帧。通过一点配置，可以将pandas数据帧转储到不同的目标连接。

对于您的问题，您可以使用parade生成一个简单的SQL类型任务，如下所示：

# -*- coding:utf-8 -*-
from parade.core.task import SqlETLTask
from parade.type import stdtypes


class CopyPostgres(SqlETLTask):

    @property
    def target_conn(self):
        """
        the target connection to write the result
        :return:
        """
        return 'target_postgres'

    @property
    def source_conn(self):
        """
        the source connection to write the result
        :return:
        """
        return 'source_postgres'

    @property
    def etl_sql(self):
        """
        the single sql statement to process etl
        :return:
        """
        return """SELECT (column) FROM widgets"""

您甚至可以使用多个任务组成DAG工作流，并使用Parade直接安排工作流。希望这会有所帮助。

如何使用SQLAlchemy将数据直接从一个postgresql数据库传输到另一个postgresql数据库？

2 个答案: