这里我们比较表
中的条目CREATE TABLE a
(id INT PRIMARY KEY,
p1 INT, p2 INT, p3 INT, .. , p15 INT)
p(n)取0到2的值
我必须使用参数的独特组合获得所有条目。这不是一项艰巨的任务,所以我创建了一个像这样的表
CREATE TEMPORARY TABLE b AS
(SELECT
t1.id,
t2.p1, t2.p2, t2.p3, t2.p4, t2.p5, t2.p6, t2.p7, t2.p8,
t2.p9, t2.p10, t2.p11, t2.p12, t2.p13, t2.p14, t2.p15
FROM
(
SELECT
p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15
FROM
a
GROUP BY
p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15
HAVING COUNT(*) = 1
)t2
LEFT JOIN a t1 on
t2.p1 = t1.p1
AND t2.p2 = t1.p2
AND t2.p3 = t1.p3
AND t2.p4 = t1.p4
AND t2.p5 = t1.p5
AND t2.p6 = t1.p6
AND t2.p7 = t1.p7
AND t2.p8 = t1.p8
AND t2.p9 = t1.p9
AND t2.p10 = t1.p10
AND t2.p11 = t1.p11
AND t2.p12 = t1.p12
AND t2.p13 = t1.p13
AND t2.p14 = t1.p14
AND t2.p15 = t1.p15)
在这里,我们可以获得独特的参数组合。
下一步是针对表A中的每条记录查找表B中由一,二和三个参数不同的所有记录。记录因单个参数而异,不应超过一个,两个不同参数的记录不超过两个等。
例如:
id | p(n)
-----+----------------
1 |000000000000000
2 |000000000000001
我创建了一个表格
的临时表CREATE TEMPORARY TABLE c AS
(
SELECT
cnt, id1, id2
FROM
(
SELECT
(t1.p1 = t2.p1)+(t1.p2 = t2.p2)
+(t1.p3 = t2.p3) +(t1.p4 = t2.p4) +(t1.p5 = t2.p5)
+(t1.p6 = t2.p6) +(t1.p7 = t2.p7) +(t1.p8 = t2.p8)
+(t1.p9 = t2.p9) +(t1.p10 = t2.p10) +(t1.p11 = t2.p11)
+(t1.p12 = t2.p12) +(t1.p13 = t2.p13) +(t1.p14 = t2.p14)
+(t1.p15 = t2.p15) AS cnt,
t1.id id1,
t2.id id2
FROM
b AS t1,
a AS t2
)
WHERE
(cnt BETWEEN 12 AND 14)
AND (id1 < id2)
)
在这里,我得到一个表,其中包含不同于1,2和3个参数的对
但我在表中遇到了很多关于100,000个条目的问题条目。该表太大(在家用PC上处理数据)并且表的创建给出了很长的表。
也许这是获得一切的唯一方法,但是任何人都可以知道解决这个问题的分析方法比蛮力夫妇(可能不是SQL)。当然,这将更快解决......
任何提示都将不胜感激!谢谢!
答案 0 :(得分:1)
如果您想要一个只包含唯一条目的表,您可以创建第二个表,其中所有列都作为复合主键的一部分:
CREATE TABLE b (
(id INT, p1 INT, p2 INT, p3 INT, .. , p15 INT)
PRIMARY KEY (p1, p2, p3, .. , p15))
IGNORE SELECT * FROM a;
答案 1 :(得分:0)
可能这不是你问题的完整答案,但如果我有这样的任务,我首先要尝试的是概括一个查询。当我必须指定3个以上相似的列并且它非常容易出错时,这对我来说非常困难
因此,我建议尝试将列数据转换为行并比较差异,例如(选择您喜欢的任何数据透视方法,我刚刚使用了union for sqlfiddle,您可以使用hstore
在此处发布PostgreSQL columns to rows with no explicilty specifying column names / columns ):
with cte1 as (
select id, 'p1' as name, p1 as value from a
union all
select id, 'p2' as name, p2 as value from a
union all
select id, 'p3' as name, p3 as value from a
union all
select id, 'p4' as name, p4 as value from a
), cte2 as (
select
c1.id, sum(case when c1.value = c2.value then 0 else 1 end) as diff
from cte1 as c1
inner join cte1 as c2 on c2.id <> c1.id and c2.name = c1.name
group by c1.id, c2.id
)
select
id, diff, count(*) as cnt
from cte2
group by id, diff
order by id, diff
我假设你的桌子没有重复,你可以事先消除它们。
<强>更新强>
我不知道它是否对你有所帮助,请看看这个问题PostgreSQL, find strings differ by n characters,我已经让它试着帮助你,检查一下Erwin Brandstetter的答案。
我用不同的方法创建了一个sql fiddle demo,看起来像使用levenshtein是最快的方法,但它并不比原来的方法快得多。