通过SQL查询安全地规范化数据

时间:2009-06-12 17:18:06

标签: sql denormalization

假设我有一个客户表:

CREATE TABLE customers (
    customer_number  INTEGER,
    customer_name    VARCHAR(...),
    customer_address VARCHAR(...)
)

此表具有主键。但是,对于任何给定的customer_namecustomer_addresscustomer_number 应该是唯一的。

此表包含许多重复客户的情况并不少见。为了解决这种重复问题,以下查询仅用于隔离唯一客户:

SELECT
  DISTINCT customer_number, customer_name, customer_address
FROM customers

幸运的是,该表传统上包含准确的数据。也就是说,任何customer_name从未发生过冲突的customer_addresscustomer_number。但是,假设冲突的数据确实进入了表格。我希望编写一个失败的查询,而不是为有问题的customer_number返回多行。

例如,我尝试了此查询但没有成功:

SELECT
  customer_number, DISTINCT(customer_name, customer_address)
FROM customers
GROUP BY customer_number

有没有办法使用标准SQL编写这样的查询?如果没有,是否在Oracle特定的SQL中有解决方案?

编辑:奇异查询背后的基本原理:

说实话,这个客户表实际上并不存在(谢天谢地)。我创建它希望它足够清楚地证明查询的需要。然而,人们(幸运的是)认为,根据这个例子,对这种查询的需求是我最不担心的。因此,我现在必须剥离一些抽象,并希望恢复我的声誉,因为他们建议如此憎恶桌子......

我从外部系统收到一个包含发票(每行一个)的平面文件。我逐行阅读这个文件,将其字段插入此表:

CREATE TABLE unprocessed_invoices (
    invoice_number   INTEGER,
    invoice_date     DATE,
    ...
    // other invoice columns
    ...
    customer_number  INTEGER,
    customer_name    VARCHAR(...),
    customer_address VARCHAR(...)
)

如您所见,来自外部系统的数据是非规范化的。也就是说,外部系统在同一行上包括发票数据及其相关的客户数据。多个发票可能会共享同一个客户,因此可能会有重复的客户数据。

在保证所有客户都在系统注册之前,系统无法开始处理发票。因此,系统必须识别唯一的客户并在必要时进行注册。这就是我想要查询的原因:因为我正在处理我无法控制的非规范化数据

SELECT
  customer_number, DISTINCT(customer_name, customer_address)
FROM unprocessed_invoices
GROUP BY customer_number

希望这有助于澄清问题的原始意图。

编辑:好/坏数据的示例

澄清:customer_namecustomer_address只需要与特定customer_number 唯一

 customer_number | customer_name | customer_address
----------------------------------------------------
 1               | 'Bob'         | '123 Street'
 1               | 'Bob'         | '123 Street'
 2               | 'Bob'         | '123 Street'
 2               | 'Bob'         | '123 Street'
 3               | 'Fred'        | '456 Avenue'
 3               | 'Fred'        | '789 Crescent'

前两行很好,因为customer_name 1的customer_addresscustomer_number相同。

中间两行很好,因为customer_name 2 customer_addresscustomer_number相同(即使另一个customer_number具有相同的customer_namecustomer_address)。

最后两行是不合适,因为customer_address 3有两个不同的customer_number es。

如果针对所有这六行运行,我正在查找的查询将失败。但是,如果实际只存在前四行,则视图应返回:

 customer_number | customer_name | customer_address
----------------------------------------------------
 1               | 'Bob'         | '123 Street'
 2               | 'Bob'         | '123 Street'

我希望这可以澄清“冲突customer_namecustomer_address”的含义。它们必须是customer_number唯一的。

我很感谢那些正在解释如何从外部系统正确导入数据的人。事实上,我已经做了大部分工作。我故意隐藏了我正在做的所有细节,以便更容易专注于手头的问题。此查询不是唯一的验证形式。我只是觉得它会有一个很好的画龙点睛(最后的防守,可以这么说)。这个问题只是为了研究SQL的可能性而设计的。 :)

8 个答案:

答案 0 :(得分:3)

你的方法存在缺陷。您不希望成功存储的数据然后在选择上抛出错误 - 这是一个等待发生的地雷并且意味着您永远不知道选择何时可能失败。

我建议您在表格中添加一个唯一的密钥,然后慢慢开始修改您的应用程序以使用此密钥,而不是依赖任何有意义的数据组合。

然后,您可以停止关注重复数据,这首先不是真正的重复数据。两个具有相同名称的人完全可以共享相同的地址。

您还可以通过此方法获得性能提升。

顺便说一句,我强烈建议您规范化您的数据,即将名称分解为FirstName和LastName(也可选择MiddleName),并将地址字段分解为每个组件的单独字段(Address1,Address2,City,州,国家,邮编或其他)

更新:如果我理解您的情况(我不确定),您希望防止表格中出现重复的名称和地址组合(即使这是现实生活中可能发生的事情)。最好通过这两个字段上的唯一约束或索引来防止数据被插入。也就是说,在插入之前捕获错误。这将告诉您导入文件或您生成的应用程序逻辑是错误的,然后您可以选择采取适当的措施。

我仍然坚持认为,当您在游戏中查询错误时抛出错误,无法对其进行任何操作。

答案 1 :(得分:2)

标量子查询必须只返回一行(每个结果集行...),以便您可以执行以下操作:

select distinct
       customer_number,
       (
       select distinct
              customer_address
         from customers c2
        where c2.customer_number = c.customer_number
       ) as customer_address
  from customers c

答案 2 :(得分:0)

使查询失败可能会很棘手......

这将显示表中是否有任何重复记录:

select customer_number, customer_name, customer_address
from customers
group by customer_number, customer_name, customer_address
having count(*) > 1

如果您只为所有三个字段添加唯一索引,则没有人可以在表格中创建重复记录。

答案 3 :(得分:0)

事实上的密钥是名称+地址,因此您需要分组。

SELECT
  Customer_Name,
  Customer_Address,
  CASE WHEN Count(DISTINCT Customer_Number) > 1
    THEN 1/0 ELSE 0 END as LandMine
FROM Customers
GROUP BY Customer_Name, Customer_Address

如果你想从Customer_Number的角度来看,那么这也很好。

SELECT *, 
CASE WHEN Exists((
  SELECT top 1 1
  FROM Customers c2
  WHERE c1.Customer_Number != c2.Customer_Number
    AND c1.Customer_Name = c2.Customer_Name
    AND c1.Customer_Address = c2.Customer_Address
)) THEN 1/0 ELSE 0 END as LandMine
FROM Customers c1
WHERE Customer_Number = @Number

答案 4 :(得分:0)

如果你想让它失败,你需要有一个索引。如果您不想拥有索引,那么您只需创建一个临时表即可完成此操作。

CREATE TABLE #temp_customers 
    (customer_number int, 
    customer_name varchar(50), 
    customer_address varchar(50),
    PRIMARY KEY (customer_number),
     UNIQUE(customr_name, customer_address))

INSERT INTO #temp_customers
SELECT DISTINCT customer_number, customer_name, customer_address
FROM customers

SELECT customer_number, customer_name, customer_address
FROM #temp_customers

DROP TABLE #temp_customers

如果存在问题,这将失败但会使您的重复记录不会导致问题。

答案 5 :(得分:0)

如果您有脏数据,我会先清理它。

使用此功能查找重复的客户记录...

Select * From customers
Where customer_number in 
  (Select Customer_number from customers
  Group by customer_number Having count(*) > 1)

答案 6 :(得分:0)

让我们使用您的不同查询

将数据放入临时表或表变量中
select distinct customer_number, customer_name, customer_address, 
  IDENTITY(int, 1,1) AS ID_Num
into #temp 
from unprocessed_invoices

就个人而言,如果可能的话,我会为未被发票的发票添加一个不公平的。如果没有创建具有标识列的登台表,我就不会进行导入,因为它更容易删除重复的记录。

现在让我们查询表格以查找问题记录。我假设您希望看到导致问题的原因不仅仅是失败。

Select t1.* from #temp t1
join #temp t2 
  on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address 
where t1.customer_number <> t2.customer_number

select t1.* from #temp t1
join 
(select customer_number from #temp group by customer_number having count(*) >1) t2
  on t1.customer_number = t2.customer_number

您可以使用这些查询的变体来删除#temp中的问题记录(取决于您是选择保留还是删除所有可能的问题),然后从#temp插入生产表。您还可以将问题记录提供给任何向您提供数据的人员。

答案 7 :(得分:-1)

Select t1.* from #temp t1
join #temp t2 
  on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address 
where t1.customer_number <> t2.customer_number

select t1.* from #temp t1
join 
(select customer_number from #temp group by customer_number having count(*) >1) t2
  on t1.customer_number = t2.customer_number