Question

我使用Pandas预处理CSV数据集并将其转换为SQLite数据库。

我在两个实体A和B之间存在多对多关系，由结点DataFrame A2B.columns == ['AId', 'BId']表示。 A的唯一性约束是每个A与B的关系不同。

我希望根据此约束有效删除重复项A。我是这样做的Pandas：

AId_dedup = A2B.groupby('AId').BId.apply(tuple).drop_duplicates().index

转换为元组可以比较与BIds相关的AId个集合。

关系A2B可以看作是（稀疏布尔）矩阵，其中1s表示A和B之间存在链接。我想删除此矩阵的重复行，唉pd.unstack()无法生成稀疏矩阵。（它还需要有效的行散列）

我的问题是：

我想做什么？关于关系代数？
使用Pandas或SQL以及（最好使用SQLite）引擎可以更有效地完成吗？

我想使用此操作在生物网络中查找同义词（重复对象），其中交互表示为表格。

编辑：以下是我想要的示例：

+-----+-----+
| Aid | Bid |
+-----+-----+
|   1 |   1 |
|   1 |   2 |
|   1 |   3 |
|   2 |   1 |
|   2 |   2 |
|   2 |   3 |
|   3 |   1 |
|   3 |   2 |
|   3 |   3 |
|   3 |   4 |
+-----+-----+

A2B = A2B.groupby('AId').BId.apply(tuple)
+-----+-----------+
| Aid |    Bid    |
+-----+-----------+
|   1 | (1,2,3)   |
|   2 | (1,2,3)   |
|   3 | (1,2,3,4) |
+-----+-----------+

A2B = A2B.drop_duplicates()
+-----+-----------+
| Aid |    Bid    |
+-----+-----------+
|   1 | (1,2,3)   |
|   3 | (1,2,3,4) |
+-----+-----------+

回到交界处表（在Pandas中不那么容易）：

+-----+-----+
| Aid | Bid |
+-----+-----+
|   1 |   1 |
|   1 |   2 |
|   1 |   3 |
|   3 |   1 |
|   3 |   2 |
|   3 |   3 |
|   3 |   4 |
+-----+-----+

Answer 1

如果您可以重新创建A2B表，那么：创建具有唯一约束的新（'AId'，'BId'），然后插入如下数据：

insert into new_A2B select distinct AId, BId from A2B;

然后通过带有ON CONFLICT子句的sqlite进行新插入，如下所示：

insert or ignore into new_A2B values (aid, bid);

如果您无法重新创建A2B表，则在从中选择行时使用distinct 编辑：
您可以通过此查询找到重复ID：

select A2B.aid, dup.aid
from A2B
left join A2B as dup on dup.bid = A2B.bid
group by A2B.aid, dup.aid
having count(A2B.bid) = count(dup.bid)
and count(A2B.bid) = (select count(bid) from A2B where aid = dup.aid)

如果需要，您可以添加where条件来查找重复项，仅用于较低的ID

where A2B.aid < dup.aid

也许这个查询会更快：

with
  c as (select aid, count(1) as c
    from A2B
    group by aid) 
select A2B.aid, dup.aid
from A2B
inner join c as ac on ac.aid = A2B.aid
left join A2B as dup on dup.bid = A2B.bid and A2B.aid < dup.aid 
and exists(select 1 from c where aid = dup.aid and c = ac.c)
group by A2B.aid, dup.aid
having count(A2B.bid) = count(dup.bid)
and count(A2B.bid) = (select count(bid) from A2B where aid = dup.aid)

编辑：还有一个可以测试的解决方案（有可能，这是最快的查询）：

with
  c as (select aid, min(bid) as f, max(bid) as l, count(1) as c
    --, sum(bid) as s
    from A2B
    group by aid) 
select f.aid, dup.aid
from c as f inner join c as dup 
on f.aid < dup.aid and f.f = dup.f and f.l - dup.l and f.c = dup.c 
--and f.s = dup.s
Where f.c = (
  select count(1) 
  where A2B as t1 
  inner join A2B as t2
  on t1.aid < t2.aid and t1.bid = t2.bid and t1.aid = f.aid and t2.aid = dup.aid)

您还可以尝试取消注释总和（出价）作为s＆amp;和f.s = dup.s

查找共享相同多对多关系的重复记录集

1 个答案: