Question

我正在尝试通过比较两组数据来确定丢失数据的地方。

第一组数据包含一个截断的非唯一条形码，以及第二组的时间戳，我发现它也不是唯一的。由于该表是由每晚创建的文本备份创建的，因此存储在名为restoredData的表中。

第二组实际上是两个表，一个称为items和itemss_archive。它们也具有非唯一的短条形码和非唯一的时间戳。

restoredData拥有2,437,910条记录，每item条记录。 items有405,009，而items_archive有1,589,768，总共1,994,777行。因此，restoredData中的记录至少比items和items_archive的并集中的记录多443,113。

但是，每当我尝试LEFT JOIN restoredData到items和items_archive的并集时，我得到2,437,910个匹配项，搜索LEFT JOIN在哪里null，即在item + items_archive中没有匹配记录的地方，我得到的计数为0。我尝试加入条形码，时间戳记，并且两者同时具有相同的结果。

这肯定是由于我所有可用键上的非唯一性。但是，如果我只允许将(SELECT t_stamp, barcode FROM items UNION ALL SELECT t_stamp, barcode FROM items_archive) as allItems中的一行仅用于连接一次，即，使其不能与restoredData中的多个事物匹配，那么我认为它将为我提供信息我实际上正在寻找的记录是通过文本记录的，但是从item和items_archive表中丢失了。

在SQL中可以做到这一点吗？还是我必须使用python编程地完成此操作，逐行遍历restoredData，找到一个匹配项，如果有匹配项，将其删除，使其无法再次使用？

另一件事，我知道这不能正确匹配，因为在我的item和items_archive表中，我有一个特殊的条形码“ NO_READ”，发生在读取条形码的错误期间，但是在{ {1}}。

我正在使用MySQL 5.6。

供参考

restoredData

举个例子，我可能有条形码1，时间戳1在我的restoredData table, 2,437,910 records barCode (Varchar(13), non-unique), t_stamp (Datetime, non-unique) items and items_archive table 1,994,777 records total barCode (Varchar(13), non-unique), t_stamp (Datetime, non-unique)中出现4次，而在我的restoredData + items表中只出现一次，其结果是

items_archive

我想要的是这个

 restoredData                 items+items_archive
 barcodeCol  t_stampCol       barcode2Col  t_stamp2Col
 barcode1    timestamp1       barcode1     timestamp1             
 barcode1    timestamp1       barcode1     timestamp1             
 barcode1    timestamp1       barcode1     timestamp1             
 barcode1    timestamp1       barcode1     timestamp1

Answer 1

我能想到的唯一方法是用索引创建一些临时表，然后使用索引来创建排名，以便您可以用它在两个数据集之间创建唯一的列：-

CREATE TEMPORARY TABLE items_full (t_stamp datetime, barcode varchar(13), idx int NOT NULL AUTO_INCREMENT)

CREATE TEMPORARY TABLE restored_data (t_stamp datetime, barcode varchar(13), idx int NOT NULL AUTO_INCREMENT)

Insert into items_full
SELECT t_stamp, barcode FROM items 
UNION ALL 
SELECT t_stamp, barcode FROM items_archive

Insert into restored_data
SELECT t_stamp, barcode FROM restoreddata


Select t_stamp, barcode, DENSE_RANK() OVER (Partition By barcode, t_stamp order by idx) as myrank from items_full bb

left join 

(select t_stamp, barcode, DENSE_RANK() OVER (Partition By barcode, t_stamp order by idx) as myrank from restored_data) aa 

on bb.t_stamp=aa.t_stamp and bb.barcode=aa.barcode and bb.myrank=aa.myrank

where aa.t_stamp is null

Answer 2

我将从数数开始。如果每个条形码和时间戳的计数不匹配，则必须检查相关记录。

select
  r.barcode,
  r.t_stamp,
  r.cnt as recover_count,
  i.cnt as itemtables_count
from
(
  select barcode, t_stamp, count(*) as cnt
  from restoreddata
  group by barcode, t_stamp
) r
left join
(
  select barcode, t_stamp, count(*) as cnt
  from
  (
    select barcode, t_stamp from items
    union all
    select barcode, t_stamp from items_archive
  ) both
  group by barcode, t_stamp
) i on  i.barcode = r.barcode 
    and i.t_stamp = r.t_stamp
    and i.cnt <> r.cnt;

有没有一种方法可以限制左联接的右表中的行只能使用一次？

2 个答案: