Question

我有一个Spark作业可以非常快速地处理数据，但是当它尝试将结果写入postgresql数据库时，它很慢。以下是大部分相关代码：

import psycopg2

def save_df_to_db(records):
    # each item in record is a dictionary with 'url', 'tag', 'value' as keys
    db_conn = psycopg2.connect(connect_string)
    db_conn.autocommit = True
    cur = db_conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
    upsert_query = """INSERT INTO mytable (url, tag, value)
                      VALUES (%(url)s, %(tag)s, %(value)s) ON CONFLICT (url, tag) DO UPDATE SET value = %(value)s"""

    try:
        cursor.executemany(upsert_query, records)
    except Exception as e:
        print "Error in executing save_df_to_db: ", e.message

data = [...] # initial data
rdd = sc.parallelize(data)
rdd = ... # Some simple RDD transforms...
rdd.foreachPartition(save_df_to_db)

该表还有一个关于url + tag唯一的约束。我正在寻找提高此代码速度的解决方案。欢迎提出任何建议或建议。

Answer 1

我认为主要瓶颈是cursor.executemany和connection.autocommit的组合。正如executemany

的官方文档中所解释的那样

在当前的实现中，这个方法并不比在循环中执行ha execute()更快。

由于您将它与connection.autocommit结合使用，因此您可以在每次插入后有效地提交它。

Psycopg提供fast execution helpers：

可用于执行批量操作。手动处理提交也更有意义。

您还可以使用大量并发写入和索引更新来限制数据库服务器。通常我建议写入磁盘并使用COPY执行批量导入，但不保证在此处提供帮助。

由于您使用没有时间戳的可变记录，因此您不能只删除索引并在导入后重新创建它作为提高性能的另一种方法。

Answer 2

感谢您的回复。由于我使用的psycopg2版本不支持批处理执行，因此我不得不依赖于使用copy命令稍微不同的方法。我写了一个小功能，帮助将节省时间从20分钟减少到大约30秒。这是功能。它将pandas数据帧作为输入并将其写入表（curso）：

import StringIO
import pandas as pd

def write_dataframe_to_table(cursor, table, dataframe, batch_size=100, null='None'):
    """
    Write a pandas dataframe into a postgres table.
    It only works if the table columns have the same name as the dataframe columns.
    :param cursor: the psycopg2 cursor object
    :param table: the table name
    :param dataframe: the dataframe
    :param batch_size: batch size
    :param null: textual representation of NULL in the file. The default is the string None.
    """
    for i in range(0, len(dataframe), batch_size):
        chunk_df = dataframe[i: batch_size + i]
        content = "\n".join(chunk_df.apply(lambda x: "\t".join(map(str, x)), axis=1))
        cursor.copy_from(StringIO.StringIO(content), table, columns=list(chunk_df.columns), null=null)

pyspark + psycopg2将结果写入数据库的速度很慢

2 个答案: