查找列元素的组合是否在另一行上

时间:2019-04-30 11:35:59

标签: python pandas

我想提取出两行或更多行中geneA和geneB中元素组合的行。我的infile data.xlsx看起来像:

chrA_x          ens_geneA       geneA   chrB            ens_geneB       geneB
chr1:92092600   ENSG00000189195 BTBD8   chr2:164084669  ENSG00000237844 AC016766.1
chr1:121498879  ENSG00000233432 AL592   chr9:2781522    ENSG00000080608 PUM3
chr1:200152569  ENSG00000116833 NR5A2   chr7:112680583  ENSG00000223646 AC002463.1
chr1:205618297  ENSG00000158711 ELK4    chr7:32968816   ENSG00000122642 FKBP9
chr1:92092600   ENSG00000189195 BTBD8   chr2:164084669  ENSG00000237844 AC016766.1
chr1:92092600   ENSG00000189195 BTBD8   chr9:2781522    ENSG00000080608 PUM3

预期输出:

chrA_x          ens_geneA       geneA   chrB            ens_geneB       geneB
chr1:92092600   ENSG00000189195 BTBD8   chr2:164084669  ENSG00000237844 AC016766.1

到目前为止,我的代码仅给出了geneA和geneB中元素重复的行,而不是组合重复:

import pandas as pd
import numpy as np

pd.options.display.max_colwidth = 100
pd.set_option('display.max_columns', None)
df =  pd.read_excel("data.xlsx")
dups = np.logical_and((df[df.duplicated(['geneA'])]), (df[df.duplicated(['geneB'])]))

1 个答案:

答案 0 :(得分:2)

您应该首先合并这些列并测试该组合是否重复。假设字段中不存在逗号(alter procedure x @NumStr as varchar(50), @date as udate, @contactCode as int as select * from y where (Reg = @NumStr ) or ( NumStr = @NumStr and date = @date and contactCode = @contactCode ) p.s: reg is an integer Field ),则可以使用:

,