使用sqlalchemy从MySQL数据库中读取大量数据集并插入postgres数据库而不会出现内存问题

时间:2018-01-19 12:08:53

标签: python postgresql sqlalchemy

我在MySQL数据库中有一个1000万行表,我需要阅读,在我的客户机上进行一些验证检查并加载到postgres数据库中的表中。我能够成功地将数据存入我的机器,但在尝试处理数据并加载到postgres数据库时,我遇到内存不足问题

有没有办法使用迭代器处理内存中的数据并以块的形式插入postgres?

以下是我目前的代码:

from sqlalchemy import create_engine, MetaData, Table

# MySQL database connection
source_engine = create_engine('mysql+pymysql://user:pwd@serveraddress:3306/dbname')
source_connection = engine.connect()

# Read the entire data
data = source_connection.execute('SELECT * FROM table')

# close the MySQL connection
source_connection.close()

# function to transform data
def transform(data):

    def process_row(row):
    """do data validation on the row"""
    return row

    # process and return the incoming dataset as a list of dicts
    processed_data = [dict(zip(data.keys(), process_row(d)) for d in data]
    return processed_data

transformed_data = transform(data)

# Postgres database connection
dest_connection = create_engine('postgresql://user:pwd@serveraddress:5432/dbname')
dest_meta = MetaData(bind=dest_connection, reflect=True, schema='test')

table = Table('table_name', self.meta, autoload=True)
dest_connection.execute(table.insert().values(transformed_data))

dest_connection.close()

有人能建议一个简单的方法吗?

1 个答案:

答案 0 :(得分:2)

你走在正确的道路上!我在几周前工作的代码遇到了同样的问题。

实现所需内容并避免内存问题的一种方法是在循环查询的函数内部执行读取部分,并以yield结束。这样可以节省内存并以块的形式运行。缺点是它需要更多的时间来执行,但你肯定会节省大量的计算机马力。我没有关于您的数据的大量信息,但代码看起来像这样:

from sqlalchemy import create_engine, MetaData, Table

# MySQL database connection
source_engine = create_engine('mysql+pymysql://user:pwd@serveraddress:3306/dbname')
source_connection = engine.connect()

# Read the entire data
def read_data():
    ''' reads all the data and returns it row by row to save memory'''
    data = source_connection.execute('SELECT * FROM table')
    batch_counter = 0
    batch_of_rows = []
    for row in data:
        batch_of_rows.append(row)
        batch_counter = batch_counter + 1
        if batch counter == 5000: # set this to be the batch size that optimizes your code for memory and time of execution.
            batch_counter = 0
            yield batch_of_rows

# close the MySQL connection
source_connection.close()

# function to transform data
def transform(data):

    def process_row(row):
    """do data validation on the row"""
    return row

    # process and return the incoming dataset as a list of dicts
    processed_data = [dict(zip(data.keys(), process_row(d)) for d in data]
    return processed_data


# Postgres database connection
dest_connection = create_engine('postgresql://user:pwd@serveraddress:5432/dbname')
dest_meta = MetaData(bind=dest_connection, reflect=True, schema='test')

table = Table('table_name', self.meta, autoload=True)
for data_row in read_data():
    transformed_data = transform(data)
    dest_connection.execute(table.insert().values(transformed_data))

dest_connection.close()

我认为这将解决你的记忆问题。

注意:如果您需要一些关于产量的额外说明,请访问this stackoverflow question