Question

我的桌子有超过6500万行和140列。数据来自多个来源，至少每个月提交一次。

我寻找一种快速的方法来从这些数据中获取特定字段，只有它们是唯一的。事实是，我想处理所有信息，以链接发送哪个发票与哪些识别号码，并由谁发送。问题是，我不想迭代超过6500万条记录。如果我能得到不同的价值，那么我只需要处理500万条记录而不是6500万条记录。请参阅下文，了解数据的说明和样本的SQL Fiddle

如果说客户每月提交一个与invoice_number相关联的passport_number_1, national_identity_number_1 and driving_license_1，我只想要一行显示。即4个字段必须是唯一的

如果他们提交上述内容30个月，那么在第31个月他们发送invoice_number链接到passport_number_1, national_identity_number_2 and driving_license_1，我也想选择此行，因为national_identity字段是新的整行是独一无二的

按linked to我的意思是它们出现在同一行
对于所有字段，可以在一个点上出现Null。
＆＃39; pivot / composite＆＃39;列是invoice_number和由...所提交。如果其中任何一个都没有，请删除该行
我还需要将database_id包含在上面的数据中。即由postgresql数据库自动生成的primary_id
唯一不需要退回的字段是other_column 和yet_another_column。请记住，该表有140列，所以不要需要他们
使用结果，创建一个将保持此唯一的新表记录

有关尝试重新创建方案的信息，请参阅此SQL fiddle。

从那个小提琴中，我期待得到如下结果：

第1,2和第2行第11行：其中只有一个应保留，因为它们正好是相同。优选地，行具有最小id。
第4行和第9行：其中一个将被删除，因为它们正好是相同。
第5,7行和第5行8：因为他们缺少了，所以会被删除 invoice_number或submitted_by。
结果将有行（1,2或11），3，（4或9），6和10。

Answer 1

从具有四个不同字段的组中获取一个代表性行（带有附加字段）：

SELECT 
distinct on (
  invoice_number
  , passport_number
  , national_id_number
  , driving_license_number
)
  * -- specify the columns you want here
FROM my_table
where invoice_number is not null
and submitted_by is not null
;

请注意，除非您指定排序（documentation on distinct）

，否则无法准确返回哪一行

修改

要按id订购此结果，只需将order by id添加到最后即可，但可以通过eiter使用CTE来完成

with distinct_rows as ( SELECT distinct on ( invoice_number , passport_number , national_id_number , driving_license_number -- ... ) * -- specify the columns you want here FROM my_table where invoice_number is not null and submitted_by is not null ) select * from distinct_rows order by id;

或将原始查询作为子查询

select * from ( SELECT distinct on ( invoice_number , passport_number , national_id_number , driving_license_number -- ... ) * -- specify the columns you want here FROM my_table where invoice_number is not null and submitted_by is not null ) t order by id;

Answer 2

从这些数据中获取特定字段的快速方法

我不这么认为。我认为您的意思是要从表中选择一个 distinct 行，而这些行不是唯一的。

据我所知，你只需要

SELECT distinct invoice_number, passport_number, 
                driving_license_number, national_id_number
FROM my_table
where invoice_number is not null
and submitted_by is not null;

在您的SQLFiddle示例中，它产生5行。

在许多字段中获取不同的信息，其中一些字段为NULL

2 个答案: