重复删除无法在具有许多NULL

时间:2017-12-05 16:30:36

标签: mysql sql mariadb

也许我一直盯着屏幕太长时间,但我有以下[传统]表格我在搞乱:

describe t3_test;
+--------------------+------------------+------+-----+---------+----------------+
| Field              | Type             | Null | Key | Default | Extra          |
+--------------------+------------------+------+-----+---------+----------------+
| provnum            | varchar(24)      | YES  | MUL | NULL    |                |
| trgt_mo            | datetime         | YES  |     | NULL    |                |
| mcare              | varchar(2)       | YES  |     | NULL    |                |
| bed2prsn_asst      | varchar(2)       | YES  |     | NULL    |                |
| trnsfr2prsn_asst   | varchar(2)       | YES  |     | NULL    |                |
| tlt2prsn_asst      | varchar(2)       | YES  |     | NULL    |                |
| hygn2prsn_asst     | varchar(2)       | YES  |     | NULL    |                |
| bath2psrn_asst     | varchar(2)       | YES  |     | NULL    |                |
| ampmcare2prsn_asst | varchar(2)       | YES  |     | NULL    |                |
| any2prsn_asst      | varchar(2)       | YES  |     | NULL    |                |
| n                  | float            | YES  |     | NULL    |                |
| pct                | float            | YES  |     | NULL    |                |
| trgt_qtr           | varchar(12)      | YES  |     | NULL    |                |
| recno              | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| enddate            | date             | YES  |     | NULL    |                |
+--------------------+------------------+------+-----+---------+----------------+
15 rows in set (0.00 sec)

我的数据看起来像这样..

"555223","2008-10-01 00:00:00",NULL,"1",NULL,NULL,NULL,NULL,NULL,NULL,"40","93.0233","2008Q4","5767343","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,"1",NULL,NULL,NULL,NULL,NULL,NULL,"40","93.0233","2008Q4","4075309","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,"0",NULL,NULL,NULL,NULL,NULL,NULL,"3","6.97674","2008Q4","4075308","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,"0",NULL,NULL,NULL,NULL,NULL,NULL,"3","6.97674","2008Q4","5767342","2008-12-31"
"555223","2008-10-01 00:00:00","N",NULL,"1",NULL,NULL,NULL,NULL,NULL,"36","83.7209","2008Q4","4075327","2008-12-31"
"555223","2008-10-01 00:00:00","N","1",NULL,NULL,NULL,NULL,NULL,NULL,"36","83.7209","2008Q4","4075323","2008-12-31"
"555223","2008-10-01 00:00:00","Y","1",NULL,NULL,NULL,NULL,NULL,NULL,"4","9.30233","2008Q4","4075325","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,NULL,"0",NULL,NULL,NULL,NULL,NULL,"3","6.97674","2008Q4","4075310","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,NULL,"1",NULL,NULL,NULL,NULL,NULL,"40","93.0233","2008Q4","4075311","2008-12-31"    

表格的前两行显然是dupes(减去A.I.索引“recno”)。我已经尝试了六打欺骗程序并且它们不会自动删除。

此时我不确定到底出了什么问题?有可能在某个地方有一个看不见的角色吗?字母是否可能采用不同的字符编码?当我按照列出的方式将数据转储到CSV时,它看起来没有任何不同。

你有一个可以在这个文件结构上运行的删除例程,它会删除任何欺骗(减去recno字段)吗?我一直盯着这两天,出于某种原因,它逃脱了我。 (顺便说一下,我知道bathd2psrn_asst的列名异常 - 不是吗)

此(原始)表中包含超过1300万条记录。并且大小超过3GB所以我正在寻找最有效的杀死欺骗的方法..任何想法?

以下是我使用的一种无效技术的示例:

DELETE a FROM t3_test as a, t3_test as b WHERE
(a.provnum=b.provnum)
AND (a.trgt_mo=b.trgt_mo OR a.trgt_mo IS NULL AND b.trgt_mo IS NULL)
AND (a.mcare=b.mcare OR a.mcare IS NULL AND b.mcare IS NULL)
AND (a.bed2prsn_asst=b.bed2prsn_asst OR a.bed2prsn_asst IS NULL AND b.bed2prsn_asst IS NULL)
AND (a.trnsfr2prsn_asst=b.trnsfr2prsn_asst OR a.trnsfr2prsn_asst IS NULL AND b.trnsfr2prsn_asst IS NULL)
AND (a.tlt2prsn_asst=b.tlt2prsn_asst OR a.tlt2prsn_asst IS NULL AND b.tlt2prsn_asst IS NULL)
AND (a.hygn2prsn_asst=b.hygn2prsn_asst OR a.hygn2prsn_asst IS NULL AND b.hygn2prsn_asst IS NULL)
AND (a.bath2psrn_asst=b.bath2psrn_asst OR a.bath2psrn_asst IS NULL AND b.bath2psrn_asst IS NULL)
AND (a.ampmcare2prsn_asst=b.ampmcare2prsn_asst OR a.ampmcare2prsn_asst IS NULL AND b.ampmcare2prsn_asst IS NULL)
AND (a.any2prsn_asst=b.any2prsn_asst OR a.any2prsn_asst IS NULL AND b.any2prsn_asst IS NULL)
AND (a.n=b.n OR a.n IS NULL AND b.n IS NULL)
AND (a.pct=b.pct OR a.pct IS NULL AND b.pct IS NULL)
AND (a.trgt_qtr=b.trgt_qtr OR a.trgt_qtr IS NULL AND b.trgt_qtr IS NULL)
AND (a.enddate=b.enddate OR a.enddate IS NULL AND b.enddate IS NULL)
AND (a.recno>b.recno);

3 个答案:

答案 0 :(得分:0)

对于这么大的表,delete可能效率很低 - 删除所需的所有日志记录都非常麻烦。

我可能会建议您尝试truncate / insert方法:

create table temp_t3_test as (
     select provnum, targ_mo, . . .,
            min(recno) as recno,
            enddate
     from t3_test
     group by provnum, targ_mo, . . ., enddate;

truncate table t3_test;

insert into t3_test(provnum, targ_mo, . . . , recno, enddate)
    select *
    from temp_t3_test;

答案 1 :(得分:0)

尝试:

CREATE TABLE t3_new AS 
         ( 
                  SELECT   provnum, 
                           trgt_mo, 
                           mcare, 
                           bed2prsn_asst, 
                           trnsfr2prsn_asst, 
                           tlt2prs‌​n_asst, 
                           hygn2prsn_ass‌​t, 
                           bath2psrn_asst, 
                           amp‌​mcare2prsn_asst, 
                           any2‌​prsn_asst, 
                           n, 
                           pct, 
                           trgt‌​_qtr, 
                           Min(recno), 
                           endd‌​ate 
                  FROM     t3_test 
                  GROUP BY provnum, 
                           trgt_mo, 
                           mcare, 
                           bed2prsn_asst, 
                           trnsfr2prsn_asst, 
                           tlt2prs‌​n_asst, 
                           hygn2prsn_ass‌​t, 
                           bath2psrn_asst, 
                           amp‌​mcare2prsn_asst, 
                           any2‌​prsn_asst, 
                           n, 
                           pct, 
                           trgt‌​_qtr,
                           enddate

         )

当你使用min(recno)时,你实际上并不只选择一行。您选择所有recno的最小值并对所有行使用相同的值。要删除较少的行,您可以使用我所使用的distinct或group by。我想说你可以从临时表中删除rec no并在你再次创建的表中使用一个新的自动增量列来避免ID中的间隙。

这与Gordon Linoff建议的方法一起使用。

答案 2 :(得分:0)

在这种情况下,问题不在于SQL语句。这是DATA的一个问题,但它不可见。

两个字段指定类型" float"保持隐藏的十进制值,彼此略有不同。将这些字段转换为DECIMAL(a,b)类型会使dupe显示并通过常规方式正确删除。

特别感谢Gordon Linoff建议调查此事。