3800万条记录 - 删除重复行的基础列名称,只保留一个

时间:2017-07-05 04:35:36

标签: php mysql

免责声明 - I've read many fantastic questions & their answers already and have given it a try too. The only issue is that considering the database size, the system gets stuck at "loading" and it just sits there. By looking at the total number of rows, I've seen changes happening but its just not significant when it doesn't give a warning or do it in pieces. I do have a fair idea of tweaking available code and make it work but I'm not a full time/advance developer (yet!)

问题 - 我一直在研究一个包含产品信息但具有重复值的数据库(将多个CVS导入数据库时​​,将“产品代码”列设置为唯一是愚蠢的) 。我需要帮助删除“产品代码”的重复基础,但希望“保留一个”在“规范”列下具有最大信息。

数据库 - MySQL 总记录 - 3600万+ 总列数 - 不超过15(但相关性较低) 问题 - 多个重复值基于“产品代码”,但在“规范列”中保留一个具有最大字符数的

数据库详情; 表名 - 专业

列名称为; productid - VARCHAR, manPartId - VARCHAR, 规范 - TEXT

到目前为止,我已经从中挑选了以下代码,并尝试了一下,但系统陷入“加载”并且没有任何反应。我认为这是因为它有大量的记录。

我尝试在phpMyAdmin“SQL”部分中运行的代码是;

--------------------------------------------
      delete pro
      from pro
      inner join (
         select max(productid) as lastId, manPartId
           from pro
          group by manPartId
         having count(*) > 1) duplic on duplic.manPartId = pro.manPartId
         where pro.productid < duplic.lastId;
--------------------------------------------

上述代码已经从MySQL delete duplicate records but keep latest

的原始代码进行了调整

请帮助并了解我哪里出错了。另请注意,我确实理解上面的代码只能解决“删除所有但保留一个”的问题,而不是“在规范列中保留一个基础的总文本”。

非常感谢提前!

编辑 - 根据aendeerei的建议,我对细节做了一些修改。

-------------------------------------------------------
productid  | manPartId    |  specification 
-------------------------------------------------------
1            ABC1           5MP camera, 2500 MaH, Steel body
2            ABC2           2MP camera, Steel body
3            ABC3           5MP, 6500 MaH, Red
4            ABC1           2500 MaH, Steel body
5            ABC2           5MP camera, plastic body
6            ABC4           5MP camera, 2500 MaH, Steel body
7            ABC5           15MP camera, 4500 MaH 
8            ABC2           5MP camera
9            ABC3           15MP, 6500 MaH, Blue body
10           ABC5           2500 MaH, Steel body
-------------------------------------------

在上面的例子中,我正在考虑删除重复的基础manPartId但是想要保留一个在规范字段中具有最大(字符)的记录。

运行查询后,我希望在规范列下看到以下更新数据具有唯一的manPartId和最大文本;

-------------------------------------------------------
productid  | manPartId    |  specification 
---------------------------------------------------------------
1            ABC1           5MP camera, 2500 MaH, Steel body
5            ABC2           5MP camera, plastic body
6            ABC4           5MP camera, 2500 MaH, Steel body
7            ABC5           15MP camera, 4500 MaH, Long life
9            ABC3           15MP, 6500 MaH, Blue body
---------------------------------------------------------------

如果还不清楚,请接受我的道歉!

2 个答案:

答案 0 :(得分:1)

首先,基础,找到所有部分最长的长度(查询#1)

SELECT 
      manPartID,
      MAX( CHAR_LENGTH( specification )) longestLength
   from
      pro
   group by
      manPartID

将其作为基线,现在查找具有相同最长长度的所有部分。但是如果有多个具有完全相同的长度,则需要选择一个,例如要保留的第一个ProductID或最新的ProductID ...(查询#2)

SELECT
      p.manPartID,
      MAX( p.productid ) as ProductID
   from
      pro p
         JOIN
            ( Entire Query #1 above ) byLen
          ON p.manPartID = byLen.manPartID
          AND char_length( p.specification ) = byLen.LongestLength
   group by
      p.manPartID

所以在这一点上,你只有一个&#34; ProductID&#34;对于单个&#34; manPartID&#34;基于最长的规范...现在,您可以从主表中删除它不是上述之一,如下所示。我正在对#2查询执行LEFT JOIN,因为我想要比较所有记录并仅删除保留结果集中未找到的记录。

DELETE FROM Pro
   LEFT JOIN (entire query #2 above) Keep
      ON Pro.ProductID = Keep.ProductID
   where Keep.ProductID IS NULL

现在,在一张包含3600万条记录的表格中,您可能希望在吹走数据之前确保上述工作。因此,我不会删除,而是创建一个新的产品辅助表并插入其中,以确认您正在获得您希望的...

INSERT INTO SomeTempTable
SELECT p1.*
   from Pro p1
      JOIN ( query #2 above ) Keep
         ON p1.ProductID = Keep.ProductID

请注意,这是一个JOIN(不是删除中使用的左连接)因为我只想要保留那些产品

我确信桌面上还有其他元素,所以为了帮助查询性能,我会在你的&#34; Pro&#34;管道表上有以下索引。

(manPartID, specification, productID)

这样可以在索引之外完成工作,而不必遍历每条记录的所有数据页。

答案 1 :(得分:1)

嗯,我在这里说不出多少。只需按照步骤(三个+中间步骤)仔细阅读我的评论。我为您选择了一种方便的方法:每步执行一个简单的查询。它也可以以其他方式完成,例如,使用存储过程,或许多。但这对你来说并不会更好,因为你的任务是一次性的过程,也是一个非常明智的过程。最好控制所有操作结果。

您在评论中问我,您应该将什么用作任务的界面。那么,MySQL Workbench对于这样的操作来说是一个很好的操作,但它会破坏/冻结很多。 phpMyAdmin的?嗯...我现在用SequelPRO而且我必须说,我真的很喜欢它。它可以管理你的任务吗?我不知道。但是我肯定知道一个可以:我曾经使用过的最好的MySQL软件 - 我当然也会把它买给个人使用 - 是SQLyog。一个非常强大,稳定和强大的应用程序。特别是当你处理重复数据库/数据库导出时:它永远不会令人失望。

我看到您VARCHAR列为productid列的数据类型。像这样:

`productid` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)

并且,如果您希望永远不会在manPartid列中包含重复项,请在其上创建UNIQUE索引。

我还建议您保持统一的命名约定。像:

  • productIdproduct_id,而不是productid
  • manPartIdman_part_id,而不是manPartid

并将名称products提供给products表。

现在,我将我的答案分为两部分:“要遵循的步骤”和“结果”。对于每个步骤,我都发布了相应的步骤结果。

开始做之前:

备份您的数据!

祝你好运!

跟随的步骤:

=================================================================
STEP 1:
=================================================================
Create a new table proTmp with the following columns:
- manPartid: definition identical with pro.manPartid
- maxLenSpec: maximum specification length of each pro.manPartid.
=================================================================

    CREATE TABLE `proTmp` (
      `manPartId` varchar(255) DEFAULT NULL,
      `maxLenSpec` bigint(20) DEFAULT NULL
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;



===============================================================
STEP 2:
===============================================================
- Truncate table proTmp;
- Get a dataset with all [pro.manPartid, pro.maxLenSpec] pairs;
- Store the dataset into table proTmp.
===============================================================

    TRUNCATE proTmp;
    INSERT INTO proTmp (
        SELECT 
            pro.manPartid
            , MAX(LENGTH(pro.specification)) AS maxLenSpec
        FROM pro
        GROUP BY pro.manPartid
    );



=============================================================
INTERMEDIARY STEP - JUST FOR TEST.
IT ONLY DISPLAYS THE RECORDS WHICH WILL BE DELETED IN STEP 3:
=============================================================
Left join tables pro and proTmp and display only the 
records with pro.lenSpec = proTmp.maxLenSpec.
- lenSpec: length of pro.specification
=============================================================


a) Get pro.*, pro.lenSpec and proTmp.* columns, ordered by pro.manPartid.
_________________________________________________________________________

    SELECT 
        a.*
        , LENGTH(a.specification) as lenSpec
        , b.*
    FROM pro AS a
    LEFT JOIN proTmp AS b ON b.manPartid = a.manPartid
    WHERE LENGTH(a.specification) = b.maxLenSpec
    ORDER BY a.manPartid;


b) Get only pro.productid column, ordered by pro.productid.
___________________________________________________________

    SELECT a.productid
    FROM pro AS a
    LEFT JOIN proTmp AS b ON b.manPartid = a.manPartid
    WHERE LENGTH(a.specification) = b.maxLenSpec
    ORDER BY a.productid;



====================================================================
STEP 3:
====================================================================
Delete all records from pro having pro.lenSpec != proTmp.maxLenSpec.
IMPORTANT: ordered by pro.productid !!!
====================================================================

    DELETE FROM pro
    WHERE 
        pro.productid NOT IN (
            SELECT a.productid
            FROM (SELECT * FROM pro AS tmp) AS a
            LEFT JOIN proTmp AS b ON b.manPartid = a.manPartid
            WHERE LENGTH(a.specification) = b.maxLenSpec
            ORDER BY a.productid
        );

执行的操作的结果:

------------------------------------------------------------------------------------------------------------------
NOTA BENE:
------------------------------------------------------------------------------------------------------------------
NOTICE THAT I ADDED A NEW RECORD INTO TABLE pro, WITH THE productid = 11 & manPartid = "ABC1". ITS specification 
COLUMN HAS THE SAME MAXIMUM LENGTH AS THE RECORD WITH THE productid = 1 & manPartid = "ABC1" !!! IN THE END, 
AFTER STEP 3, E:G: AFTER DELETION OF DUPLICATES, BOTH RECORDS SHOULD STILL EXIST IN TABLE pro, BECAUSE THEY BOTH
HAVE THE MAXIMUM LENGTH of specification COLUMN. THEREFORE, THERE WILL STILL EXIST SUCH DUPLICATES IN THE TABLE 
pro AFTER DELETION. IN ORDER TO DECIDE WHICH ONLY ONE OF THESE DUPLICATES SHOULD REMAIN IN THE TABLE, YOU MUST
THINK ABOUT SOME OTHER CONDITIONS AS THE ONES WE KNOW FROM YOU IN THIS MOMENT. BUT, FIRST THINGS FIRST...
SEE ALSO THE RESULTS AFTER RUNNING STEP 3.
------------------------------------------------------------------------------------------------------------------


=================================================================
CREATION SYNTAX AND CONTENT OF TABLE pro, USED BY ME:
=================================================================

CREATE TABLE `pro` (
  `productid` varchar(255) DEFAULT NULL,
  `manPartId` varchar(255) DEFAULT NULL,
  `specification` text
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


    --------------------------------------------------------
    productid   manPartId   specification
    --------------------------------------------------------
    1           ABC1        5MP camera, 2500 MaH, Steel body
    10          ABC5        2500 MaH, Steel body
    2           ABC2        2MP camera, Steel body
    3           ABC3        5MP, 6500 MaH, Red
    4           ABC1        2500 MaH, Steel body
    5           ABC2        5MP camera, plastic body
    6           ABC4        5MP camera, 2500 MaH, Steel body
    7           ABC5        15MP camera, 4500 MaH
    8           ABC2        5MP camera
    9           ABC3        15MP, 6500 MaH, Blue body
    11          ABC1        12345678901234567890123456789012


===============================================================
STEP 1 - RESULTS: Creation of table proTmp
===============================================================

Just the table proTmp was created, without any content.


===============================================================
STEP 2 - RESULTS: Table proTmp content
===============================================================

    ----------------------
    manPartId   maxLenSpec
    ----------------------
    ABC1        32
    ABC2        24
    ABC3        25
    ABC4        32
    ABC5        21


============================================================
INTERMEDIARY STEP RESULTS - JUST FOR TEST.
IT ONLY DISPLAYS THE RECORDS WHICH WILL BE DELETED IN STEP 3
============================================================


a) Get pro.*, pro.lenSpec and proTmp.* columns, ordered by pro.manPartid.
_________________________________________________________________________

    ----------------------------------------------------------------------------------------------
    productid   manPartId   specification                       lenSpec     manPartId   maxLenSpec
    ----------------------------------------------------------------------------------------------
    1           ABC1        5MP camera, 2500 MaH, Steel body    32          ABC1        32
    11          ABC1        12345678901234567890123456789012    32          ABC1        32
    5           ABC2        5MP camera, plastic body            24          ABC2        24
    9           ABC3        15MP, 6500 MaH, Blue body           25          ABC3        25
    6           ABC4        5MP camera, 2500 MaH, Steel body    32          ABC4        32
    7           ABC5        15MP camera, 4500 MaH               21          ABC5        21


b) Get only pro.productid column, ordered by pro.productid.
___________________________________________________________

    ---------
    productid
    ---------
    1
    11
    5
    6
    7
    9


===========================================================================================
STEP 3 - RESULTS: Table pro after deletion of all duplicates by the two conditions
===========================================================================================

From the log after running the DELETE query:
"No errors, 5 rows affected, taking 6.5 ms"

NOTA BENE: NOTICE THAT THERE ARE STILL TWO RECORDS WITH THE manPartid = "ABC1",
BECAUSE THEY BOTH HAD THE SAME MAXIMUM LENGTH OF THE specification COLUMN !!!

    --------------------------------------------------------
    productid   manPartId   specification
    --------------------------------------------------------
    1           ABC1        5MP camera, 2500 MaH, Steel body
    11          ABC1        12345678901234567890123456789012
    5           ABC2        5MP camera, plastic body
    9           ABC3        15MP, 6500 MaH, Blue body
    6           ABC4        5MP camera, 2500 MaH, Steel body
    7           ABC5        15MP camera, 4500 MaH

I hope it all works.

编辑1:

除了最大的'productid`之外,删除所有记录:

第1步:必须执行此步骤。否则你会收到一份错误的记录列表删除!

productid列从VARCHAR转换为:

`productid` bigint(20) unsigned NOT NULL

第2步:运行以下DELETE查询。

DELETE FROM pro
WHERE pro.productid NOT IN (
    SELECT max(b.productid) AS maxPartid
    FROM (SELECT * FROM pro AS a) AS b
    GROUP BY b.manPartid
);