Question

我目前的ETL架构如下：

s3 --> staging table --> python dataframe --> destination table

s3中的记录被加载到临时表
Python脚本连接到临时表
Python脚本每隔一小时运行一次，以完成一些复杂的操作转化
将python的结果数据帧上传到目标表

但是，我遇到了目标表中重复记录的问题：

| Time | New records (S3) | Redshift staging table (postgre) | Python DataFrame | Redshift Destination Table (postgre) | Duplicate records |
|------|------------------|----------------------------------|------------------|--------------------------------------|-------------------|
| 9am  | 3 records        | 3 records                        | 3 records        | 3 records                            | 0 (3-3)           |
| 10am | 2 records        | 5 (3+2) records                  | 5 records        | 8 (3+5) records                      | 3 (8-5)           |
| 11am | 4 records        | 9 (5+4) records                  | 9 records        | 17 (9+8) records                     | 8 (17-9)          |

所以在上午11点，登台表有9条记录，但是目的地表有17条记录（目的地表格上午11点共有8条重复记录）

如何确保目标表中的总记录与登台表中的记录匹配

（我无法消除临时表。现在，我正在过滤目标表以仅选择唯一的记录。有更好的方法吗？）

Answer 1

您的阶段和目标表都在Postgres中，因此只需编写逻辑，将阶段表中的数据与dest表进行比较，并删除dest表中已存在的阶段中的所有记录。

DELETE FROM staging
WHERE EXISTS(SELECT 1 FROM dest WHERE dest.id = staging.id);

https://www.postgresql.org/docs/8.1/static/functions-subquery.html

ETL过程中数据库中重复记录的问题

1 个答案: