我们有一个pyspark脚本,可以从hive表中读取数据并将其写入Postgres,核心部分粘贴在下面,大多数情况下工作正常,但有时在目标Postgres表中会得到重复的记录。
任何想法都可能是重复的原因?
def write_data(partition):
output = StringIO()
writer = csv.writer(output, delimiter='\t')
writer.writerows(partition)
output.seek(0)
conn = psycopg2.connect(connection_url)
cur = conn.cursor()
query = "COPY {} FROM STDIN WITH (FORMAT CSV, DELIMITER E'\t')".format(table_name)
cur.copy_expert(query, output)
conn.commit()
cur.close()
conn.close()
df = read_hive_table(table_name)
df.foreachPartition(write_data)