也许我一直盯着屏幕太长时间,但我有以下[传统]表格我在搞乱:
describe t3_test;
+--------------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+------------------+------+-----+---------+----------------+
| provnum | varchar(24) | YES | MUL | NULL | |
| trgt_mo | datetime | YES | | NULL | |
| mcare | varchar(2) | YES | | NULL | |
| bed2prsn_asst | varchar(2) | YES | | NULL | |
| trnsfr2prsn_asst | varchar(2) | YES | | NULL | |
| tlt2prsn_asst | varchar(2) | YES | | NULL | |
| hygn2prsn_asst | varchar(2) | YES | | NULL | |
| bath2psrn_asst | varchar(2) | YES | | NULL | |
| ampmcare2prsn_asst | varchar(2) | YES | | NULL | |
| any2prsn_asst | varchar(2) | YES | | NULL | |
| n | float | YES | | NULL | |
| pct | float | YES | | NULL | |
| trgt_qtr | varchar(12) | YES | | NULL | |
| recno | int(10) unsigned | NO | PRI | NULL | auto_increment |
| enddate | date | YES | | NULL | |
+--------------------+------------------+------+-----+---------+----------------+
15 rows in set (0.00 sec)
我的数据看起来像这样..
"555223","2008-10-01 00:00:00",NULL,"1",NULL,NULL,NULL,NULL,NULL,NULL,"40","93.0233","2008Q4","5767343","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,"1",NULL,NULL,NULL,NULL,NULL,NULL,"40","93.0233","2008Q4","4075309","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,"0",NULL,NULL,NULL,NULL,NULL,NULL,"3","6.97674","2008Q4","4075308","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,"0",NULL,NULL,NULL,NULL,NULL,NULL,"3","6.97674","2008Q4","5767342","2008-12-31"
"555223","2008-10-01 00:00:00","N",NULL,"1",NULL,NULL,NULL,NULL,NULL,"36","83.7209","2008Q4","4075327","2008-12-31"
"555223","2008-10-01 00:00:00","N","1",NULL,NULL,NULL,NULL,NULL,NULL,"36","83.7209","2008Q4","4075323","2008-12-31"
"555223","2008-10-01 00:00:00","Y","1",NULL,NULL,NULL,NULL,NULL,NULL,"4","9.30233","2008Q4","4075325","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,NULL,"0",NULL,NULL,NULL,NULL,NULL,"3","6.97674","2008Q4","4075310","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,NULL,"1",NULL,NULL,NULL,NULL,NULL,"40","93.0233","2008Q4","4075311","2008-12-31"
表格的前两行显然是dupes(减去A.I.索引“recno”)。我已经尝试了六打欺骗程序并且它们不会自动删除。
此时我不确定到底出了什么问题?有可能在某个地方有一个看不见的角色吗?字母是否可能采用不同的字符编码?当我按照列出的方式将数据转储到CSV时,它看起来没有任何不同。
你有一个可以在这个文件结构上运行的删除例程,它会删除任何欺骗(减去recno字段)吗?我一直盯着这两天,出于某种原因,它逃脱了我。 (顺便说一下,我知道bathd2psrn_asst的列名异常 - 不是吗)
此(原始)表中包含超过1300万条记录。并且大小超过3GB所以我正在寻找最有效的杀死欺骗的方法..任何想法?
以下是我使用的一种无效技术的示例:
DELETE a FROM t3_test as a, t3_test as b WHERE
(a.provnum=b.provnum)
AND (a.trgt_mo=b.trgt_mo OR a.trgt_mo IS NULL AND b.trgt_mo IS NULL)
AND (a.mcare=b.mcare OR a.mcare IS NULL AND b.mcare IS NULL)
AND (a.bed2prsn_asst=b.bed2prsn_asst OR a.bed2prsn_asst IS NULL AND b.bed2prsn_asst IS NULL)
AND (a.trnsfr2prsn_asst=b.trnsfr2prsn_asst OR a.trnsfr2prsn_asst IS NULL AND b.trnsfr2prsn_asst IS NULL)
AND (a.tlt2prsn_asst=b.tlt2prsn_asst OR a.tlt2prsn_asst IS NULL AND b.tlt2prsn_asst IS NULL)
AND (a.hygn2prsn_asst=b.hygn2prsn_asst OR a.hygn2prsn_asst IS NULL AND b.hygn2prsn_asst IS NULL)
AND (a.bath2psrn_asst=b.bath2psrn_asst OR a.bath2psrn_asst IS NULL AND b.bath2psrn_asst IS NULL)
AND (a.ampmcare2prsn_asst=b.ampmcare2prsn_asst OR a.ampmcare2prsn_asst IS NULL AND b.ampmcare2prsn_asst IS NULL)
AND (a.any2prsn_asst=b.any2prsn_asst OR a.any2prsn_asst IS NULL AND b.any2prsn_asst IS NULL)
AND (a.n=b.n OR a.n IS NULL AND b.n IS NULL)
AND (a.pct=b.pct OR a.pct IS NULL AND b.pct IS NULL)
AND (a.trgt_qtr=b.trgt_qtr OR a.trgt_qtr IS NULL AND b.trgt_qtr IS NULL)
AND (a.enddate=b.enddate OR a.enddate IS NULL AND b.enddate IS NULL)
AND (a.recno>b.recno);
答案 0 :(得分:0)
对于这么大的表,delete
可能效率很低 - 删除所需的所有日志记录都非常麻烦。
我可能会建议您尝试truncate
/ insert
方法:
create table temp_t3_test as (
select provnum, targ_mo, . . .,
min(recno) as recno,
enddate
from t3_test
group by provnum, targ_mo, . . ., enddate;
truncate table t3_test;
insert into t3_test(provnum, targ_mo, . . . , recno, enddate)
select *
from temp_t3_test;
答案 1 :(得分:0)
尝试:
CREATE TABLE t3_new AS
(
SELECT provnum,
trgt_mo,
mcare,
bed2prsn_asst,
trnsfr2prsn_asst,
tlt2prsn_asst,
hygn2prsn_asst,
bath2psrn_asst,
ampmcare2prsn_asst,
any2prsn_asst,
n,
pct,
trgt_qtr,
Min(recno),
enddate
FROM t3_test
GROUP BY provnum,
trgt_mo,
mcare,
bed2prsn_asst,
trnsfr2prsn_asst,
tlt2prsn_asst,
hygn2prsn_asst,
bath2psrn_asst,
ampmcare2prsn_asst,
any2prsn_asst,
n,
pct,
trgt_qtr,
enddate
)
当你使用min(recno)时,你实际上并不只选择一行。您选择所有recno的最小值并对所有行使用相同的值。要删除较少的行,您可以使用我所使用的distinct或group by。我想说你可以从临时表中删除rec no并在你再次创建的表中使用一个新的自动增量列来避免ID中的间隙。
这与Gordon Linoff建议的方法一起使用。
答案 2 :(得分:0)
在这种情况下,问题不在于SQL语句。这是DATA的一个问题,但它不可见。
两个字段指定类型" float"保持隐藏的十进制值,彼此略有不同。将这些字段转换为DECIMAL(a,b)类型会使dupe显示并通过常规方式正确删除。
特别感谢Gordon Linoff建议调查此事。