我使用Pandas预处理CSV数据集并将其转换为SQLite数据库。
我在两个实体A
和B
之间存在多对多关系,由结点DataFrame A2B.columns == ['AId', 'BId']
表示。 A
的唯一性约束是每个A
与B
的关系不同。
我希望根据此约束有效删除重复项A
。我是这样做的Pandas:
AId_dedup = A2B.groupby('AId').BId.apply(tuple).drop_duplicates().index
转换为元组可以比较与BIds
相关的AId
个集合。
关系A2B
可以看作是(稀疏布尔)矩阵,其中1s表示A
和B
之间存在链接。我想删除此矩阵的重复行,唉pd.unstack()
无法生成稀疏矩阵。 (它还需要有效的行散列)
我的问题是:
我想使用此操作在生物网络中查找同义词(重复对象),其中交互表示为表格。
编辑:以下是我想要的示例:
+-----+-----+
| Aid | Bid |
+-----+-----+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
| 3 | 4 |
+-----+-----+
A2B = A2B.groupby('AId').BId.apply(tuple)
+-----+-----------+
| Aid | Bid |
+-----+-----------+
| 1 | (1,2,3) |
| 2 | (1,2,3) |
| 3 | (1,2,3,4) |
+-----+-----------+
A2B = A2B.drop_duplicates()
+-----+-----------+
| Aid | Bid |
+-----+-----------+
| 1 | (1,2,3) |
| 3 | (1,2,3,4) |
+-----+-----------+
回到交界处表(在Pandas中不那么容易):
+-----+-----+
| Aid | Bid |
+-----+-----+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
| 3 | 4 |
+-----+-----+
答案 0 :(得分:0)
如果您可以重新创建A2B表,那么:创建具有唯一约束的新('AId','BId'),然后插入如下数据:
insert into new_A2B select distinct AId, BId from A2B;
然后通过带有ON CONFLICT子句的sqlite进行新插入,如下所示:
insert or ignore into new_A2B values (aid, bid);
如果您无法重新创建A2B表,则在从中选择行时使用distinct
编辑:
您可以通过此查询找到重复ID:
select A2B.aid, dup.aid
from A2B
left join A2B as dup on dup.bid = A2B.bid
group by A2B.aid, dup.aid
having count(A2B.bid) = count(dup.bid)
and count(A2B.bid) = (select count(bid) from A2B where aid = dup.aid)
如果需要,您可以添加where条件来查找重复项,仅用于较低的ID
where A2B.aid < dup.aid
也许这个查询会更快:
with
c as (select aid, count(1) as c
from A2B
group by aid)
select A2B.aid, dup.aid
from A2B
inner join c as ac on ac.aid = A2B.aid
left join A2B as dup on dup.bid = A2B.bid and A2B.aid < dup.aid
and exists(select 1 from c where aid = dup.aid and c = ac.c)
group by A2B.aid, dup.aid
having count(A2B.bid) = count(dup.bid)
and count(A2B.bid) = (select count(bid) from A2B where aid = dup.aid)
编辑: 还有一个可以测试的解决方案(有可能,这是最快的查询):
with
c as (select aid, min(bid) as f, max(bid) as l, count(1) as c
--, sum(bid) as s
from A2B
group by aid)
select f.aid, dup.aid
from c as f inner join c as dup
on f.aid < dup.aid and f.f = dup.f and f.l - dup.l and f.c = dup.c
--and f.s = dup.s
Where f.c = (
select count(1)
where A2B as t1
inner join A2B as t2
on t1.aid < t2.aid and t1.bid = t2.bid and t1.aid = f.aid and t2.aid = dup.aid)
您还可以尝试取消注释总和(出价)作为s&amp;和f.s = dup.s