删除数据库中的重复值

时间:2017-08-16 02:09:34

标签: mysql sql database

我有一个MySql表,每天都会用价格值填充。即使价格没有变化,它也会每天记录一个条目。我想删除一些重复太多的行。我希望在价格变动之前保留第一个价格和最后价格。

示例1)

   id name     price date
    1 Product1 $6 13/07/2017
    2 Product1 $6 14/07/2017
    3 Product1 $6 15/07/2017
    4 Product1 $7 16/07/2017
    5 Product1 $6 17/07/2017
    6 Product1 $6 18/07/2017
    7 Product1 $6 19/07/2017

从该列表中删除ID为2和6的记录,结果如下:

   id name     price date
    1 Product1 $6 13/07/2017
    3 Product1 $6 15/07/2017
    4 Product1 $7 16/07/2017
    5 Product1 $6 17/07/2017
    7 Product1 $6 19/07/2017

示例2)

   id name     price date
    1 Product1 $6 13/07/2017
    2 Product1 $6 14/07/2017
    3 Product1 $6 15/07/2017
    4 Product1 $6 16/07/2017
    5 Product1 $6 17/07/2017
    6 Product1 $6 18/07/2017
    7 Product1 $6 19/07/2017

此处没有价格变动,因此我可以删除2到6之间的所有记录:

   id name     price date
    1 Product1 $6 13/07/2017
    7 Product1 $6 19/07/2017

Id不应该是一个增量,并且日期不会每天更新。

13 个答案:

答案 0 :(得分:5)

您可以使用一些创意自连接逻辑来执行此操作。

想想表中的三个假设行。

  • 要保留的行。
  • 第b行具有相同的产品名称和价格,以及日期后1天的日期。你想删除它。
  • 行c具有相同的产品名称和价格,以及b之后的第1天的日期。你想保留这个。

因此,如果您可以执行自联接以匹配这三行,则删除行b。

DELETE b FROM MyTable AS a 
JOIN MyTable AS b ON a.name=b.name AND a.price=b.price AND a.date=b.date + INTERVAL 1 DAY 
JOIN MyTable AS c ON b.name=c.name AND b.price=c.price AND b.date=c.date + INTERVAL 1 DAY;

即使有多行符合行b的条件,这仍然有效。它将删除第一个,然后继续删除符合条件的后续行。

如果您使用DATE数据类型并将日期存储为“YYYY-MM-DD”,而不是“DD-MM-YYYY”,则此方法有效。无论如何你应该这样做。

答案 1 :(得分:3)

您要删除产品名称和价格与日期加/减一天的行相同的行。

DELETE row_mid
FROM 
  record_table AS row_mid
  JOIN record_table AS row_prev
  JOIN record_table AS row_next
WHERE
  row_mid.name = row_prev.name 
  AND row_mid.price = row_prev.price
  AND row_mid.date = DATE_SUB(row_prev.date, INTERVAL 1 DAY)
  AND row_mid.name = row_next.name
  AND row_mid.price = row_next.price
  AND row_mid.date = DATE_ADD(row_next.date, INTERVAL 1 DAY);

答案 2 :(得分:3)

你的MySQL是否足够新以支持CTE?这是我在日期安排中看到的一个非常有趣的问题。代码看起来总是很尴尬。要在没有删除的情况下检查结果,可以使用select和delete切换注释标记,并注释掉t。[Name]为空行。

WITH

cte AS  (
        SELECT a.ID
            , a.[Name]
            , a.[Date]
            , a.Price
            , NextDate = max(npc.[Date])    -- Next Price change
            , PrevDate = max(lpc.[Date])    -- Next Price change
        FROM    yourTable as a  -- Base Table
            LEFT JOIN
                yourTable as npc    -- Looking for Next Price Change
            ON a.[Name] = npc.[Name]
                and a.[Date] < npc.[Date]
                and a.Price <> npc.Price
            LEFT JOIN
                yourTable as lpc    -- Looking for Last Price Change
            ON a.[Name] = lpc.[Name]
                and a.[Date] > lpc.[Date]
                and a.Price <> lpc.Price
        GROUP BY a.ID, a.[Name], a.[Date], a.Price
    ) 

----SELECT f.*, [Check] = CASE WHEN t.[Name] is null THEN 'DELETE' ELSE '' END
DELETE f
FROM 
        yourTable as f
    LEFT JOIN
        (
            SELECT [Name], [GoodDate] = Max([Date])
            FROM cte
            GROUP BY [Name], PrevDate
            UNION
            SELECT [Name], [GoodDate] = Min([Date])
            FROM cte
            GROUP BY [Name], PrevDate
            UNION
            SELECT [Name], [GoodDate] = Max([Date])
            FROM cte
            GROUP BY [Name], NextDate
            UNION
            SELECT [Name], [GoodDate] = Min([Date])
            FROM cte
            GROUP BY [Name], NextDate
        ) as t
    ON t.[Name] = f.[Name] and t.[GoodDate] = f.[Date]
WHERE t.[Name] is null
--ORDER BY f.[Name], f.[Date]

答案 3 :(得分:3)

您可以检测prev Idnext Id,然后选择要删除的行:

SELECT * 
FROM 
  (SELECT 
      *,
      (SELECT next_id.id 
       FROM a next_id 
       WHERE next_id.id > current.id 
       ORDER BY next_id.id ASC LIMIT 1) as next_id,
      (SELECT prev_id.id 
       FROM a prev_id 
       WHERE prev_id.id < current.id 
       ORDER BY prev_id.id DESC LIMIT 1) as prev_id 
   FROM a current) t
WHERE 
   EXISTS (SELECT 1 
           FROM a next 
           WHERE next.name = t.name AND t.price = next.price AND next.id=t.next_id) 
   AND
   EXISTS (SELECT 1 
           FROM a prev 
           WHERE prev.name = t.name AND t.price = prev.price AND prev.id=t.prev_id)

我在两个示例中测试了这些查询。 Demo

<强>更新即可。如果Id列不唯一,则逻辑必须从prev Id + next Id更正为prev Date + next Date。无论如何,一般概念将保持不变。查询将如下所示:

SELECT * 
FROM 
  (SELECT 
      *,
      (SELECT next_date.date 
       FROM a next_date 
       WHERE next_date.date > current.date AND next_date.name = current.name
       ORDER BY next_date.date ASC LIMIT 1) as next_date,
      (SELECT prev_date.date
       FROM a prev_date 
       WHERE prev_date.date < current.date AND prev_date.name = current.name
       ORDER BY prev_date.date DESC LIMIT 1) as prev_date
   FROM a current) t
WHERE 
   EXISTS (SELECT 1 
           FROM a next 
           WHERE next.name = t.name AND t.price = next.price AND next.date=t.next_date) 
   AND
   EXISTS (SELECT 1 
           FROM a prev 
           WHERE prev.name = t.name AND t.price = prev.price AND prev.date=t.prev_date)
第二次查询

Demo

答案 4 :(得分:2)

你的所有数据都会被重复,你要保留一个吗?你的解释很混乱。

您可以以相同的价格保存最旧的数据并删除其他数据:

>>> import re
>>> re.split("[ #]+", '2 #room 2.# 5 1 -1 -1')
['2', 'room', '2.', '5', '1', '-1', '-1']

答案 5 :(得分:2)

我无法为您的场景编写确切的代码,但您可以编写一个Function \ Procedure并遵循此伪代码

r = allrows
tobeDeleted = []
unique = []
for (var i=0;i<rows.length; i++){
    unique.push(rows[i]->id);
    dd = true;
    while (dd){
        if ((rows[i]->price == rows[i+1]->price) AND (rows[i]->name == rows[i+1]->price)){
            tobeDeleted.push(rows[i]->id);
            i++;
        }else{
            dd= false;
        }
    }
}

//tobeDeleted contains ids of rows to be deleted
//

答案 6 :(得分:2)

尝试以下查询,希望对您有帮助。

(我没有mysql,我已经尝试将语法转换为我的sql--所以如果有任何语法错误我很抱歉。)

(我已经在sqlserver上测试了它的随机日期和不同的产品,效果很好并得到你想要的结果)

/* get the data grouped by name with NewField continousDate to create continous dates for every product depends on the order of date
then save it to temporary table called tempWithContinousDate*/

CREATE TEMPORARY Table tempWithContinousDate Table  (id INT,name varchar(50),price DECIMAL(12,2),date DATE,continousDate DATE)

insert into tempWithContinousDate(id,name,price,date,continousDate)
select id,name,price,date,Date_Add(minimumDate,INTERVAL rn DAY)ContinousDate
from(
select t1.id,t1.name,t1.price,t1.date,min(t2.Date)minimumDate,count(*) rn
          from 
             (select id,name,price,date from yourTable) t1
          inner join 
            (select id,name,price,date from yourTable) t2 
          on t1.name=t2.name and t1.date>=t2.date
 group by t1.id,t1.name,t1.price,t1.date
 ) t




/* get the data grouped by name and price with NewField GroupDate to group every continous dates 
then save it to temporary table called tempData*/
CREATE TEMPORARY Table tempData (id INT,name varchar(50),price DECIMAL(12,2),date DATE,groupDate DATE)

insert into tempData(id,name,price,date,groupDate)
select id,name,price,date,DATE_SUB(continousDate, INTERVAL rowNumber DAY) groupDate
from(
select t1.id,t1.name,t1.price,t1.date,t1.continousDate,count(*) rowNumber
          from 
             (select id,name,price,date,continousDate from tempWithContinousDate) t1
          inner join 
            (select id,name,price,date,continousDate from tempWithContinousDate) t2 
          on t1.name=t2.name and t1.price=t2.price and t1.date>=t2.date
 group by t1.id,t1.name,t1.price,t1.date,t1.continousDate
 ) t



 /*select * from yourTable where id  in*/
 delete from yourTable where id not in
(select id from 
 (

/* query to order every continous data asscending using the date field */
select firstData.id,firstData.name,firstData.price,firstData.date,count(*) rn 
from  tempData firstData
left join  tempData secondData
on firstData.name=secondData.name and firstData.price=secondData.price and firstData.groupDate=secondData.groupDate
and firstData.date>=secondData.date
group by firstData.id,firstData.name,firstData.price,firstData.date


/* query to order every continous data  Descending using the date field */
union all
select firstData.id,firstData.name,firstData.price,firstData.date,count(*) rn 
from  tempData firstData
left join  tempData secondData
on firstData.name=secondData.name and firstData.price=secondData.price and firstData.groupDate=secondData.groupDate
and firstData.date<=secondData.date
group by firstData.id,firstData.name,firstData.price,firstData.date

 )allData where rn=1  

)       

答案 7 :(得分:1)

您可以使用下面的代码。让我知道它是否有效。

DELETE FROM record_table
WHERE id NOT IN (
    (SELECT MIN(id) FROM record_table GROUP BY name, price),
    (SELECT MAX(id) FROM record_table GROUP BY name, price)
)

答案 8 :(得分:1)

您可以使用<div id="symbols1" innerHTML="{{holder}}"></div>

EXISTS

DELETE FROM test t1 WHERE EXISTS ( SELECT * FROM test t2 WHERE t1.name = t2.name AND t1.price = t2.price AND t1.day = DATE_SUB(t2.DAY, INTERVAL 1 DAY) ) AND EXISTS( SELECT * FROM test t3 WHERE t1.name = t3.name AND t1.price = t3.price AND t1.day = DATE_ADD(t3.DAY, INTERVAL 1 DAY) ) 构建以解决您的问题

IN

sqlfiddle demo

答案 9 :(得分:1)

您可以使用以下逻辑:

  1. 按价格排名
  2. 按ID,名称,价格分组
  3. 获取最短日期
  4. 获取最长日期
  5. 按照查询和小提琴示例:

    SET @prev_value = NULL;
    SET @rank_count = 0;
    
    select distinct
      `name`,
      `price`,
      `date`
    from 
    (
      (
      select 
        id,
        name,
        price,
        CASE
          WHEN @prev_value = price THEN @rank_count
          WHEN @prev_value := price THEN @rank_count := @rank_count + 1
        END AS rank,
        min(`date`) as `date`
      from 
        `prices`
       group by 
         `name`, 
         `price`, 
         `rank`
       )
       union distinct
       (
       select 
        id,
        name,
        price,
        CASE
          WHEN @prev_value = price THEN @rank_count
          WHEN @prev_value := price THEN @rank_count := @rank_count + 1
        END AS rank,
        max(`date`) as `date`
      from 
        `prices`
       group by 
         `name`, 
         `price`, 
         `rank`
      )
      order by `id`, `date`
    ) as `result`
    

    sqlfiddle

答案 10 :(得分:1)

我们必须问自己,我们何时必须删除记录?

答案:可以删除记录,

  • 如果存在另一条记录,名称相同,价格相同,日期较早,则没有同名记录,两个日期之间有另一个价格。

  • 如果存在另一条记录,名称相同,价格相同,日期较晚,则没有同名记录,两个日期之间有另一个价格。

将两个要求放入SQL中会产生以下结果:

DELETE FROM PriceTable t
WHERE 
  EXISTS ( SELECT *
           FROM PriceTable tmp1 
           WHERE t.name  = tmp1.name  AND 
                 t.price = tmp1.price AND 
                 t.date  > tmp1.date  AND
                 NOT EXISTS (SELECT * 
                             FROM PriceTable tmp2
                             WHERE t.name    = tmp2.name  AND 
                                   t.price  != tmp2.price AND 
                                   t.date    > tmp2.date  AND 
                                   tmp1.date < tmp2.date 
                            )
         )
  AND
  EXISTS ( SELECT *
           FROM PriceTable tmp1 
           WHERE t.name  = tmp1.name  AND 
                 t.price = tmp1.price AND 
                 t.date  < tmp1.date  AND
                 NOT EXISTS (SELECT * 
                             FROM PriceTable tmp2
                             WHERE t.name    = tmp2.name  AND 
                                   t.price  != tmp2.price AND 
                                   t.date    < tmp2.date  AND 
                                   tmp1.date > tmp2.date 
                            ) 
         );

答案 11 :(得分:1)

编辑:经过进一步考虑后,似乎无法用用户定义的变量技巧来解决这个问题(注意使用这些的其他解决方案)。虽然我认为以下解决方案“最有可能在99%的时间内工作”,但MySQL并不保证变量评估的顺序:link 1link 2

原始答案:

(我的工作假设是products.name定义为NOT NULLproducts.idproducts.price都不是负数[如果处理否定数据,可以提供一个简单的补丁,太])。

查询:

SET
    @one_prior_id := NULL,
    @one_prior_price := NULL,
    @one_prior_name := NULL,
    @two_prior_id := NULL,
    @two_prior_price := NULL,
    @two_prior_name := NULL
;

SELECT @two_prior_id AS id_to_delete
FROM (
    SELECT *
    FROM products
    ORDER BY name, date
) AS t
WHERE IF(
    (
        (name  = @one_prior_name)
        AND
        (name  = @two_prior_name)
        AND
        (price = @one_prior_price)
        AND
        (price = @two_prior_price)
    ), (
        GREATEST(
            1,
            IFNULL(@two_prior_id := @one_prior_id, 0),
            IFNULL(@two_prior_price := @one_prior_price, 0),
            LENGTH(IFNULL(@two_prior_name := @one_prior_name, 0)),
            IFNULL(@one_prior_id := id, 0),
            IFNULL(@one_prior_price := price, 0),
            LENGTH(IFNULL(@one_prior_name := name, 0))
        )
    ), (
        LEAST(
            0,
            IFNULL(@two_prior_id := @one_prior_id, 0),
            IFNULL(@two_prior_price := @one_prior_price, 0),
            LENGTH(IFNULL(@two_prior_name := @one_prior_name, 0)),
            IFNULL(@one_prior_id := id, 0),
            IFNULL(@one_prior_price := price, 0),
            LENGTH(IFNULL(@one_prior_name := name, 0))
        )
    )
)

查询的返回,基于您的“示例1:”

+--------------+
| id_to_delete |
+--------------+
|            2 |
|            6 |
+--------------+

查询的返回,基于您的“示例2:”

+--------------+
| id_to_delete |
+--------------+
|            2 |
|            3 |
|            4 |
|            5 |
|            6 |
+--------------+

查询的工作原理:

  • 通过ORDER BY对products表进行简单的“分区”

  • 循环排序的结果集,跟踪2组变量:第1组用于保存“一个先前”行的价格和名称(“前一个”行直接位于当前行的上方)第二组变量用于保存'两个先前'行('两个先前'行直接在'一个先前'行之上)。

  • GREATESTLEAST相同,只是前者返回的值将为IF评估为true,后者将评估为false。这些函数的真正意义在于更新循环变量。

  • 有关子查询中变量更新的详细信息,请参阅this

实际的DELETE:

SET
    @one_prior_id := NULL,
    @one_prior_price := NULL,
    @one_prior_name := NULL,
    @two_prior_id := NULL,
    @two_prior_price := NULL,
    @two_prior_name := NULL
;

DELETE FROM products WHERE id IN (
    SELECT * FROM (
        SELECT @two_prior_id AS id_to_delete
        FROM (
            SELECT *
            FROM products
            ORDER BY name, date
        ) AS t1
        WHERE IF(
            (
                (name  = @one_prior_name)
                AND
                (name  = @two_prior_name)
                AND
                (price = @one_prior_price)
                AND
                (price = @two_prior_price)
            ), (
                GREATEST(
                    1,
                    IFNULL(@two_prior_id := @one_prior_id, 0),
                    IFNULL(@two_prior_price := @one_prior_price, 0),
                    LENGTH(IFNULL(@two_prior_name := @one_prior_name, 0)),
                    IFNULL(@one_prior_id := id, 0),
                    IFNULL(@one_prior_price := price, 0),
                    LENGTH(IFNULL(@one_prior_name := name, 0))
                )
            ), (
                LEAST(
                    0,
                    IFNULL(@two_prior_id := @one_prior_id, 0),
                    IFNULL(@two_prior_price := @one_prior_price, 0),
                    LENGTH(IFNULL(@two_prior_name := @one_prior_name, 0)),
                    IFNULL(@one_prior_id := id, 0),
                    IFNULL(@one_prior_price := price, 0),
                    LENGTH(IFNULL(@one_prior_name := name, 0))
                )
            )
        )
    ) AS t2
)

重要提示

看看上面的删除查询如何做2个内部选择?确保包含此内容,否则您将无意中删除最后一行!尝试在没有SELECT (...) AS t2的情况下执行,看看我的意思。

答案 12 :(得分:1)

这是我为此问题提交的第二个答案,但我想我这次终于得到了答案:

DELETE FROM products WHERE id IN (
    SELECT id_to_delete
    FROM (
        SELECT
            t0.id AS id_to_delete,
            t0.price,
            (
                SELECT t1.price
                FROM products AS t1
                WHERE (t0.date < t1.date)
                    AND (t0.name = t1.name)
                ORDER BY t1.date ASC
                LIMIT 1
            ) AS next_price,
            (
                SELECT t2.price
                FROM products AS t2
                WHERE (t0.date > t2.date)
                    AND (t0.name = t2.name)
                ORDER BY t2.date DESC
                LIMIT 1
            ) AS prev_price
        FROM products AS t0
        HAVING (price = next_price) AND (price = prev_price)
    ) AS t
)

这是@vadim_hr答案的修改版本。

编辑:下面是一个不同的查询,可以过滤JOIN而不是子查询。对于大型数据集,JOIN可能比前一个查询(上图)更快,但我会将性能测试留给您。

http://sqlfiddle.com/#!9/ee0655/8

SELECT M.id as id_to_delete
FROM
(
    SELECT
        *,
        (@j := @j + 1) AS j
    FROM
    (SELECT * FROM products ORDER BY name ASC, date ASC) AS mmm
    JOIN
    (SELECT @j := 1) AS mm
) AS M     -- the middle table
JOIN
(
    SELECT
        *,
        (@i := @i + 1) AS i
    FROM
    (SELECT * FROM products ORDER BY name ASC, date ASC) AS lll
    JOIN
    (SELECT @i := 0) AS ll
) AS L     -- the left table
ON M.j = L.i
    AND M.name = L.name
    AND M.price = L.price
JOIN
(
    SELECT
        *,
        (@k := @k + 1) AS k
    FROM
    (SELECT * FROM products ORDER BY name ASC, date ASC) AS rrr
    JOIN
    (SELECT @k := 2) AS rr
) AS R     -- the right table
ON M.j = R.k
    AND M.name = R.name
    AND M.price = R.price

两个查询都完成了相同的结束,并且他们都假定每个namedate行都是唯一的(如下面的评论中所述)。