查找共享相同多对多关系的重复记录集

时间:2016-04-02 07:46:56

标签: python sql sqlite pandas many-to-many

我使用Pandas预处理CSV数据集并将其转换为SQLite数据库。

我在两个实体AB之间存在多对多关系,由结点DataFrame A2B.columns == ['AId', 'BId']表示。 A的唯一性约束是每个AB的关系不同。

我希望根据此约束有效删除重复项A。我是这样做的Pandas:

AId_dedup = A2B.groupby('AId').BId.apply(tuple).drop_duplicates().index

转换为元组可以比较与BIds相关的AId个集合。

关系A2B可以看作是(稀疏布尔)矩阵,其中1s表示AB之间存在链接。我想删除此矩阵的重复行,唉pd.unstack()无法生成稀疏矩阵。 (它还需要有效的行散列)

我的问题是:

  • 我想做什么?关于关系代数?
  • 使用Pandas或SQL以及(最好使用SQLite)引擎可以更有效地完成吗?

我想使用此操作在生物网络中查找同义词(重复对象),其中交互表示为表格。

编辑:以下是我想要的示例:

+-----+-----+
| Aid | Bid |
+-----+-----+
|   1 |   1 |
|   1 |   2 |
|   1 |   3 |
|   2 |   1 |
|   2 |   2 |
|   2 |   3 |
|   3 |   1 |
|   3 |   2 |
|   3 |   3 |
|   3 |   4 |
+-----+-----+

A2B = A2B.groupby('AId').BId.apply(tuple)
+-----+-----------+
| Aid |    Bid    |
+-----+-----------+
|   1 | (1,2,3)   |
|   2 | (1,2,3)   |
|   3 | (1,2,3,4) |
+-----+-----------+

A2B = A2B.drop_duplicates()
+-----+-----------+
| Aid |    Bid    |
+-----+-----------+
|   1 | (1,2,3)   |
|   3 | (1,2,3,4) |
+-----+-----------+

回到交界处表(在Pandas中不那么容易):

+-----+-----+
| Aid | Bid |
+-----+-----+
|   1 |   1 |
|   1 |   2 |
|   1 |   3 |
|   3 |   1 |
|   3 |   2 |
|   3 |   3 |
|   3 |   4 |
+-----+-----+

1 个答案:

答案 0 :(得分:0)

如果您可以重新创建A2B表,那么:创建具有唯一约束的新('AId','BId'),然后插入如下数据:

insert into new_A2B select distinct AId, BId from A2B;

然后通过带有ON CONFLICT子句的sqlite进行新插入,如下所示:

insert or ignore into new_A2B values (aid, bid);

如果您无法重新创建A2B表,则在从中选择行时使用distinct 编辑:
您可以通过此查询找到重复ID:

select A2B.aid, dup.aid
from A2B
left join A2B as dup on dup.bid = A2B.bid
group by A2B.aid, dup.aid
having count(A2B.bid) = count(dup.bid)
and count(A2B.bid) = (select count(bid) from A2B where aid = dup.aid) 

如果需要,您可以添加where条件来查找重复项,仅用于较低的ID

where A2B.aid < dup.aid

也许这个查询会更快:

with
  c as (select aid, count(1) as c
    from A2B
    group by aid) 
select A2B.aid, dup.aid
from A2B
inner join c as ac on ac.aid = A2B.aid
left join A2B as dup on dup.bid = A2B.bid and A2B.aid < dup.aid 
and exists(select 1 from c where aid = dup.aid and c = ac.c)
group by A2B.aid, dup.aid
having count(A2B.bid) = count(dup.bid)
and count(A2B.bid) = (select count(bid) from A2B where aid = dup.aid) 

编辑: 还有一个可以测试的解决方案(有可能,这是最快的查询):

with
  c as (select aid, min(bid) as f, max(bid) as l, count(1) as c
    --, sum(bid) as s
    from A2B
    group by aid) 
select f.aid, dup.aid
from c as f inner join c as dup 
on f.aid < dup.aid and f.f = dup.f and f.l - dup.l and f.c = dup.c 
--and f.s = dup.s
Where f.c = (
  select count(1) 
  where A2B as t1 
  inner join A2B as t2
  on t1.aid < t2.aid and t1.bid = t2.bid and t1.aid = f.aid and t2.aid = dup.aid)

您还可以尝试取消注释总和(出价)作为s&amp;和f.s = dup.s