虽然我的帖子类似于this,但我仍然认为这与此不同。
我有2个CSV文件。
File A File B
+-------------------------------------------------------------------+
| Name | Country Name | Country |
+-------------------------------------------------------------------+
| Ferrari | Italy Jaguar | British |
| Mercedes | Germany Chevrolet | America |
| Jaguar | British Bugatti | Italy |
| Nissan | Japan Tata | India |
| Chevrolet | USA Nissan | Japan |
+-------------------------------------------------------------------+
以上内容仅供参考。一般来说,我在两个文件中都有更多的行和列,但它们的结构是相同的。
我被要求有效地对所有列进行行级别比较。因此,我建议使用HSQLDB而不是以编程方式进行
CREATE TEXT TABLE
和SET SOURCE
分别包含文件,然后在文件之间执行EXCEPT
操作。我做了它的编码,它就像一个魅力。下面是我为实现相同目的而编写的SQL部分。
CREATE TABLE COMPARE_TABLE AS (SELECT SRC.*, 'SRC-TGT' compare_order FROM TABLEA SRC EXCEPT SELECT TGT.*, 'SRC-TGT' compare_order FROM TABLEB TGT) WITH DATA;
INSERT INTO COMPARE_TABLE SELECT TGT.*, 'TGT-SRC' compare_order FROM TABLEB TGT EXCEPT SELECT SRC.*, 'TGT-SRC' compare_order FROM TABLEA SRC;
这给了我一个结果表,如下所示(考虑上面的样本数据):
比较表
+------------+----------+---------------+
| Name | Country | Compare_order |
+------------+----------+---------------+
| Ferrari | Italy | SRC-TGT |
| Mercedes | Germany | SRC-TGT |
| Chevrolet | USA | SRC-TGT |
| Chevrolet | America | TGT-SRC |
| Bugatti | Italy | TGT-SRC |
| Tata | India | TGT-SRC |
+------------+----------+---------------+
从这里开始,我需要确定每一行的原因是什么不匹配?至少,我希望将其归为三类:
最后,我希望我的表格如下所示:
COMPARE_TABLE
+------------+----------+---------------+------------------------+
| Name | Country | Compare_order | Failure_Reason |
+------------+----------+---------------+------------------------+
| Ferrari | Italy | SRC-TGT | New at Source |
| Mercedes | Germany | SRC-TGT | New at Source |
| Chevrolet | USA | SRC-TGT | Country value mismatch |
| Chevrolet | America | TGT-SRC | Country value mismatch |
| Bugatti | Italy | TGT-SRC | New at Target |
| Tata | India | TGT-SRC | New at Target |
+------------+----------+---------------+------------------------+
我该怎么做呢?我们甚至可以在SQL中执行此操作吗?
非常感谢任何帮助。
答案 0 :(得分:0)
您可以进行基本比较:
select name, country,
(case when sum(which = 'src') > 0 and sum(which = 'tgt') then 'DROPPED'
when sum(which = 'src') = 0 and sum(which = 'tgt') then 'NEW'
end) as OP
from ((select 'src' as which, name, country
from tableA
) union all
(select 'tgt', name, country
from tableB
)
) ab
group by name, country;
但这并没有给你逐列比较。这有点困难。我假设Name
是唯一的,因此它可以用作密钥。以下是比较,但每个名称产生一行:
select name,
(case when src.country is null then dest.country
when tgt.country is null then tgt.country
when src.country = tgt.country then dest.country
else (src.country, '-->', tgt.country)
end) as country,
(case when src.country is null then 'new'
when tgt.country is null then 'dropped'
when src.country = tgt.country then 'same'
else 'changed'
end) as country,
from (select name from TableA union select name from TableB
) names left outer join
TableA src
on names.name = src.name left outer join
TableB tgt
on names.name = tgt.name;
当列中的值发生变化时,为每个名称获取多行似乎更加困难,尽管它也是可能的。