基于SQL的数据差异:最长的公共子序列

时间:2010-07-17 00:28:59

标签: sql algorithm diff

我正在寻找研究论文或着作,将Longest Common Subsquence算法应用于SQL表以获取数据差异视图。关于如何解决表差异问题的其他方法也受到欢迎。面临的挑战是SQL表有这种令人讨厌的习惯,即相当大而且应用为文本处理而设计的直接算法可能会导致程序永远不会结束......

给表Original

Key  Content
1    This row is unchanged
2    This row is outdated
3    This row is wrong
4    This row is fine as it is

和表New

Key Content
1   This row was added
2   This row is unchanged
3   This row is right
4   This row is fine as it is
5   This row contains important additions

我需要找出Diff

+++ 1 This row was added
--- 2 This row is outdated
--- 3 This row is wrong
+++ 3 This row is right
+++ 5 This row contains important additions

2 个答案:

答案 0 :(得分:1)

答案 1 :(得分:0)

对于你所追求的事情来说,这可能太简单了,而且不是研究:-),而只是概念性的。我想你想要比较处理开销的不同方法(?)。

- 这是你不想要的一半(A)

SELECT o.Key FROM tbl_ORIGINAL o INNER JOIN tbl_NEW n WHERE o.Content = n.Content

- 这是你不想要的另一半(B)

SELECT n.Key FROM tbl_ORIGINAL o INNER JOIN tbl_NEW n WHERE o.Content = n.Content

- 这是你想要的一半(C)

SELECT '+++' as diff, n.key, Content FROM tbl_New n WHERE n.KEY NOT IN( B )

- 这是你想要的另一半(D)

SELECT '---' as diff, o.key, Content FROM tbl_Original o WHERE o.Key NOT IN ( A )

- 结合C& d

( C )
Union
( D )
Order By diff, key

...改进

  • 尝试创建的索引视图 首先是基础表
  • 尝试减少长度 内容字段为它的最小值 唯一性(试验/错误),然后 使用那个较短的结果来做你的 比较

- 例如得到最小长度(1000是任意的 - 只需要退出)

declare @i int
set @i = 1
While i < 1000 and Exists (
Select Count(key), Left(content,@i) From Table Having Count(key) > 1 )
BEGIN
   i = @i + 1
END