我拥有的：

Question

我拥有的：

data_source_1表
data_source_2表
data_sources_view查看

关于表格：

`data_source_1`

没有重复：

db=# select count(*) from (select distinct * from data_source_1);
count 
--------
543243
(1 row)

db=# select count(*) from (select * from data_source_1);
count 
--------
543243
(1 row)

`data_source_2`

没有重复：

db=# select count(*) from (select * from data_source_2);
count 
-------
5304
(1 row)

db=# select count(*) from (select distinct * from data_source_2);
count 
-------
5304
(1 row)

`data_sources_view`

有重复：

db=# select count(*) from (select distinct * from data_sources_vie);
count 
--------
538714
(1 row)

db=# select count(*) from (select * from data_sources_view);
count 
--------
548547
(1 row)

视图很简单：

CREATE VIEW data_sources_view
AS SELECT * 
FROM (
      (
       SELECT a, b, 'data_source_1' as source
       FROM data_source_1
      )
      UNION ALL 
      ( 
       SELECT a, b, 'data_source_2' as source
       FROM data_source_2
      )
);

我想知道的是：

如何在源表没有重复的视图中使用重复+ 'data_source_x' as source消除了重叠数据的可能性。
如何识别重复？

我尝试了什么：

db# create table t1 as select * from data_sources_view;
SELECT
db=# 
db=# create table t2 as select distinct * from data_sources_view;
SELECT
db=# create table t3 as select * from t1 minus select * from t2;
SELECT
db=# select 't1' as table_name, count(*) from t1 UNION ALL
db-# select 't2' as table_name, count(*) from t2 UNION ALL
db-# select 't3' as table_name, count(*) from t3;
table_name | count 
------------+--------
t1 | 548547
t3 | 0
t2 | 538714
(3 rows)

数据库：

Redshift（PostgreSQL）

Answer 1

原因是您的数据源有两列以上。如果你这样做了：

select count(*) from (select distinct a, b from data_source_1);

和

select count(*) from (select distinct a, b from data_source_2);

您应该会发现它们与您在同一张桌子上的count(*)不同。

Answer 2

UNION vs UNION ALL

UNION - 如果TOP查询中存在数据，则在底部查询中将其抑制。

输出

FOO

UNION ALL - 数据重复，因为两个表中都存在数据（显示两个记录）

输出

FOO

需要帮助识别表中的重复项

我拥有的：

关于表格：

`data_source_1`

`data_source_2`

`data_sources_view`

我想知道的是：

我尝试了什么：

数据库：

2 个答案: