免责声明 - I've read many fantastic questions & their answers already and have given it a try too. The only issue is that considering the database size, the system gets stuck at "loading" and it just sits there. By looking at the total number of rows, I've seen changes happening but its just not significant when it doesn't give a warning or do it in pieces. I do have a fair idea of tweaking available code and make it work but I'm not a full time/advance developer (yet!)
问题 - 我一直在研究一个包含产品信息但具有重复值的数据库(将多个CVS导入数据库时,将“产品代码”列设置为唯一是愚蠢的) 。我需要帮助删除“产品代码”的重复基础,但希望“保留一个”在“规范”列下具有最大信息。
数据库 - MySQL 总记录 - 3600万+ 总列数 - 不超过15(但相关性较低) 问题 - 多个重复值基于“产品代码”,但在“规范列”中保留一个具有最大字符数的
数据库详情; 表名 - 专业
列名称为; productid - VARCHAR, manPartId - VARCHAR, 规范 - TEXT
到目前为止,我已经从中挑选了以下代码,并尝试了一下,但系统陷入“加载”并且没有任何反应。我认为这是因为它有大量的记录。
我尝试在phpMyAdmin“SQL”部分中运行的代码是;
--------------------------------------------
delete pro
from pro
inner join (
select max(productid) as lastId, manPartId
from pro
group by manPartId
having count(*) > 1) duplic on duplic.manPartId = pro.manPartId
where pro.productid < duplic.lastId;
--------------------------------------------
上述代码已经从MySQL delete duplicate records but keep latest
的原始代码进行了调整请帮助并了解我哪里出错了。另请注意,我确实理解上面的代码只能解决“删除所有但保留一个”的问题,而不是“在规范列中保留一个基础的总文本”。
非常感谢提前!
编辑 - 根据aendeerei的建议,我对细节做了一些修改。
-------------------------------------------------------
productid | manPartId | specification
-------------------------------------------------------
1 ABC1 5MP camera, 2500 MaH, Steel body
2 ABC2 2MP camera, Steel body
3 ABC3 5MP, 6500 MaH, Red
4 ABC1 2500 MaH, Steel body
5 ABC2 5MP camera, plastic body
6 ABC4 5MP camera, 2500 MaH, Steel body
7 ABC5 15MP camera, 4500 MaH
8 ABC2 5MP camera
9 ABC3 15MP, 6500 MaH, Blue body
10 ABC5 2500 MaH, Steel body
-------------------------------------------
在上面的例子中,我正在考虑删除重复的基础manPartId但是想要保留一个在规范字段中具有最大(字符)的记录。
运行查询后,我希望在规范列下看到以下更新数据具有唯一的manPartId和最大文本;
-------------------------------------------------------
productid | manPartId | specification
---------------------------------------------------------------
1 ABC1 5MP camera, 2500 MaH, Steel body
5 ABC2 5MP camera, plastic body
6 ABC4 5MP camera, 2500 MaH, Steel body
7 ABC5 15MP camera, 4500 MaH, Long life
9 ABC3 15MP, 6500 MaH, Blue body
---------------------------------------------------------------
如果还不清楚,请接受我的道歉!
答案 0 :(得分:1)
首先,基础,找到所有部分最长的长度(查询#1)
SELECT
manPartID,
MAX( CHAR_LENGTH( specification )) longestLength
from
pro
group by
manPartID
将其作为基线,现在查找具有相同最长长度的所有部分。但是如果有多个具有完全相同的长度,则需要选择一个,例如要保留的第一个ProductID或最新的ProductID ...(查询#2)
SELECT
p.manPartID,
MAX( p.productid ) as ProductID
from
pro p
JOIN
( Entire Query #1 above ) byLen
ON p.manPartID = byLen.manPartID
AND char_length( p.specification ) = byLen.LongestLength
group by
p.manPartID
所以在这一点上,你只有一个&#34; ProductID&#34;对于单个&#34; manPartID&#34;基于最长的规范...现在,您可以从主表中删除它不是上述之一,如下所示。我正在对#2查询执行LEFT JOIN,因为我想要比较所有记录并仅删除保留结果集中未找到的记录。
DELETE FROM Pro
LEFT JOIN (entire query #2 above) Keep
ON Pro.ProductID = Keep.ProductID
where Keep.ProductID IS NULL
现在,在一张包含3600万条记录的表格中,您可能希望在吹走数据之前确保上述工作。因此,我不会删除,而是创建一个新的产品辅助表并插入其中,以确认您正在获得您希望的...
INSERT INTO SomeTempTable
SELECT p1.*
from Pro p1
JOIN ( query #2 above ) Keep
ON p1.ProductID = Keep.ProductID
请注意,这是一个JOIN(不是删除中使用的左连接)因为我只想要保留那些产品
我确信桌面上还有其他元素,所以为了帮助查询性能,我会在你的&#34; Pro&#34;管道表上有以下索引。
(manPartID, specification, productID)
这样可以在索引之外完成工作,而不必遍历每条记录的所有数据页。
答案 1 :(得分:1)
嗯,我在这里说不出多少。只需按照步骤(三个+中间步骤)仔细阅读我的评论。我为您选择了一种方便的方法:每步执行一个简单的查询。它也可以以其他方式完成,例如,使用存储过程,或许多。但这对你来说并不会更好,因为你的任务是一次性的过程,也是一个非常明智的过程。最好控制所有操作结果。
您在评论中问我,您应该将什么用作任务的界面。那么,MySQL Workbench对于这样的操作来说是一个很好的操作,但它会破坏/冻结很多。 phpMyAdmin的?嗯...我现在用SequelPRO而且我必须说,我真的很喜欢它。它可以管理你的任务吗?我不知道。但是我肯定知道一个可以:我曾经使用过的最好的MySQL软件 - 我当然也会把它买给个人使用 - 是SQLyog。一个非常强大,稳定和强大的应用程序。特别是当你处理重复数据库/数据库导出时:它永远不会令人失望。
我看到您VARCHAR
列为productid
列的数据类型。像这样:
`productid` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
并且,如果您希望永远不会在manPartid
列中包含重复项,请在其上创建UNIQUE
索引。
我还建议您保持统一的命名约定。像:
productId
或product_id
,而不是productid
manPartId
或man_part_id
,而不是manPartid
并将名称products
提供给products表。
现在,我将我的答案分为两部分:“要遵循的步骤”和“结果”。对于每个步骤,我都发布了相应的步骤结果。
开始做之前:
祝你好运!
=================================================================
STEP 1:
=================================================================
Create a new table proTmp with the following columns:
- manPartid: definition identical with pro.manPartid
- maxLenSpec: maximum specification length of each pro.manPartid.
=================================================================
CREATE TABLE `proTmp` (
`manPartId` varchar(255) DEFAULT NULL,
`maxLenSpec` bigint(20) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
===============================================================
STEP 2:
===============================================================
- Truncate table proTmp;
- Get a dataset with all [pro.manPartid, pro.maxLenSpec] pairs;
- Store the dataset into table proTmp.
===============================================================
TRUNCATE proTmp;
INSERT INTO proTmp (
SELECT
pro.manPartid
, MAX(LENGTH(pro.specification)) AS maxLenSpec
FROM pro
GROUP BY pro.manPartid
);
=============================================================
INTERMEDIARY STEP - JUST FOR TEST.
IT ONLY DISPLAYS THE RECORDS WHICH WILL BE DELETED IN STEP 3:
=============================================================
Left join tables pro and proTmp and display only the
records with pro.lenSpec = proTmp.maxLenSpec.
- lenSpec: length of pro.specification
=============================================================
a) Get pro.*, pro.lenSpec and proTmp.* columns, ordered by pro.manPartid.
_________________________________________________________________________
SELECT
a.*
, LENGTH(a.specification) as lenSpec
, b.*
FROM pro AS a
LEFT JOIN proTmp AS b ON b.manPartid = a.manPartid
WHERE LENGTH(a.specification) = b.maxLenSpec
ORDER BY a.manPartid;
b) Get only pro.productid column, ordered by pro.productid.
___________________________________________________________
SELECT a.productid
FROM pro AS a
LEFT JOIN proTmp AS b ON b.manPartid = a.manPartid
WHERE LENGTH(a.specification) = b.maxLenSpec
ORDER BY a.productid;
====================================================================
STEP 3:
====================================================================
Delete all records from pro having pro.lenSpec != proTmp.maxLenSpec.
IMPORTANT: ordered by pro.productid !!!
====================================================================
DELETE FROM pro
WHERE
pro.productid NOT IN (
SELECT a.productid
FROM (SELECT * FROM pro AS tmp) AS a
LEFT JOIN proTmp AS b ON b.manPartid = a.manPartid
WHERE LENGTH(a.specification) = b.maxLenSpec
ORDER BY a.productid
);
------------------------------------------------------------------------------------------------------------------
NOTA BENE:
------------------------------------------------------------------------------------------------------------------
NOTICE THAT I ADDED A NEW RECORD INTO TABLE pro, WITH THE productid = 11 & manPartid = "ABC1". ITS specification
COLUMN HAS THE SAME MAXIMUM LENGTH AS THE RECORD WITH THE productid = 1 & manPartid = "ABC1" !!! IN THE END,
AFTER STEP 3, E:G: AFTER DELETION OF DUPLICATES, BOTH RECORDS SHOULD STILL EXIST IN TABLE pro, BECAUSE THEY BOTH
HAVE THE MAXIMUM LENGTH of specification COLUMN. THEREFORE, THERE WILL STILL EXIST SUCH DUPLICATES IN THE TABLE
pro AFTER DELETION. IN ORDER TO DECIDE WHICH ONLY ONE OF THESE DUPLICATES SHOULD REMAIN IN THE TABLE, YOU MUST
THINK ABOUT SOME OTHER CONDITIONS AS THE ONES WE KNOW FROM YOU IN THIS MOMENT. BUT, FIRST THINGS FIRST...
SEE ALSO THE RESULTS AFTER RUNNING STEP 3.
------------------------------------------------------------------------------------------------------------------
=================================================================
CREATION SYNTAX AND CONTENT OF TABLE pro, USED BY ME:
=================================================================
CREATE TABLE `pro` (
`productid` varchar(255) DEFAULT NULL,
`manPartId` varchar(255) DEFAULT NULL,
`specification` text
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
--------------------------------------------------------
productid manPartId specification
--------------------------------------------------------
1 ABC1 5MP camera, 2500 MaH, Steel body
10 ABC5 2500 MaH, Steel body
2 ABC2 2MP camera, Steel body
3 ABC3 5MP, 6500 MaH, Red
4 ABC1 2500 MaH, Steel body
5 ABC2 5MP camera, plastic body
6 ABC4 5MP camera, 2500 MaH, Steel body
7 ABC5 15MP camera, 4500 MaH
8 ABC2 5MP camera
9 ABC3 15MP, 6500 MaH, Blue body
11 ABC1 12345678901234567890123456789012
===============================================================
STEP 1 - RESULTS: Creation of table proTmp
===============================================================
Just the table proTmp was created, without any content.
===============================================================
STEP 2 - RESULTS: Table proTmp content
===============================================================
----------------------
manPartId maxLenSpec
----------------------
ABC1 32
ABC2 24
ABC3 25
ABC4 32
ABC5 21
============================================================
INTERMEDIARY STEP RESULTS - JUST FOR TEST.
IT ONLY DISPLAYS THE RECORDS WHICH WILL BE DELETED IN STEP 3
============================================================
a) Get pro.*, pro.lenSpec and proTmp.* columns, ordered by pro.manPartid.
_________________________________________________________________________
----------------------------------------------------------------------------------------------
productid manPartId specification lenSpec manPartId maxLenSpec
----------------------------------------------------------------------------------------------
1 ABC1 5MP camera, 2500 MaH, Steel body 32 ABC1 32
11 ABC1 12345678901234567890123456789012 32 ABC1 32
5 ABC2 5MP camera, plastic body 24 ABC2 24
9 ABC3 15MP, 6500 MaH, Blue body 25 ABC3 25
6 ABC4 5MP camera, 2500 MaH, Steel body 32 ABC4 32
7 ABC5 15MP camera, 4500 MaH 21 ABC5 21
b) Get only pro.productid column, ordered by pro.productid.
___________________________________________________________
---------
productid
---------
1
11
5
6
7
9
===========================================================================================
STEP 3 - RESULTS: Table pro after deletion of all duplicates by the two conditions
===========================================================================================
From the log after running the DELETE query:
"No errors, 5 rows affected, taking 6.5 ms"
NOTA BENE: NOTICE THAT THERE ARE STILL TWO RECORDS WITH THE manPartid = "ABC1",
BECAUSE THEY BOTH HAD THE SAME MAXIMUM LENGTH OF THE specification COLUMN !!!
--------------------------------------------------------
productid manPartId specification
--------------------------------------------------------
1 ABC1 5MP camera, 2500 MaH, Steel body
11 ABC1 12345678901234567890123456789012
5 ABC2 5MP camera, plastic body
9 ABC3 15MP, 6500 MaH, Blue body
6 ABC4 5MP camera, 2500 MaH, Steel body
7 ABC5 15MP camera, 4500 MaH
I hope it all works.
第1步:必须执行此步骤。否则你会收到一份错误的记录列表删除!:
将productid
列从VARCHAR
转换为:
`productid` bigint(20) unsigned NOT NULL
第2步:运行以下DELETE
查询。
DELETE FROM pro
WHERE pro.productid NOT IN (
SELECT max(b.productid) AS maxPartid
FROM (SELECT * FROM pro AS a) AS b
GROUP BY b.manPartid
);