Question

假设我们有下表

orderId  productId   orderDate              amount    
1        2           2017-01-01 20:00:00    10 
1        2           2017-01-01 20:00:01    10 
1        3           2017-01-01 20:30:10    5 
1        4           2017-01-01 22:31:10    1

其中前两行是重复的（例如软件故障的结果），因为 orderId + productId 必须形成唯一键

我想删除此类重复项。如何以最有效的方式做到这一点？

如果orderDate没有一秒差异，我们可以使用

SELECT DISTINCT * FROM `table`

有了差异，可以使用groupby：

SELECT `orderId`,`productId`,MIN(`orderDate`),MIN(`amount`)
FROM table
GROUP BY `orderId`,`productCode`

如果有很多列，我发现后一个命令非常累人。还有什么其他选择？

更新：我正在使用Snowflake。

Answer 1

如果你的dbms支持name <- c('a','b','c','d') hight <- c('tall','short','tall','short') df <- data.frame(name, hight) df$hight <- as.character(df$hight) delRow<-which(df=='tall', arr.ind=T)[,1] df<-df[-delRow,] df$hight <- as.factor(df$hight) summary(df$hight)窗口函数，那么

short 
    2

Answer 2

您可以使用NOT EXISTS排除匹配程度更高的记录：

select * from mytable
where not exists
(
  select *
  from mytable other
  where other.orderid   = mytable.orderid
    and other.productid = mytable.productid
    and other.orderdate < mytable.orderdate
);

Answer 3

这好像您希望获得具有共同orderdate和orderid的记录中具有最小productid值的记录。这可以用SQL表示如下：

select * from mytable t where t.orderdate = 
  (select min(t2.orderdate)
   from mytable t2
   where t2.orderid = t.orderid 
     and t2.productid = t.productid);

请注意，此查询无法消除orderid，productid和orderdate列中的完全重复项;但实际上并没有要求这样做。

根据两列

3 个答案: